About a month ago Mike Lee, my bodyguard and majordomo, was trying to replicate our biggest crashing bug in Delicious Library by running it in gdb and replicating a user's actions. This bug actually wasn't even particularly common, but since we don't have very many crashing bugs in Delicious Library (compared to other projects in which I've been involved), and since I wanted to go drinking and not hang out and watch him screw around with Delicious Library, I took an interest in what he was doing.
Mike was loading certain books from Amazon.com to re-create the library of a user who reported that when he loaded or reloaded some items they would crash Delicious Library every time. Note the emphasis here. Whenever you hear "every time" in a bug report you should breath a sigh of relief, because you'll be able to find the bug. If something is reproducible, it's fixable. Period. It may be a pain, you may have to start cutting out large parts of your program and doing a binary search of your code, but you can fix it if it's reproducible.
In this bug, loading certain items would always crash DL for the user, while loading other items always worked fine. Other users had different items that would cause a crash, and the vast majority of users had no items that would cause a crash. We also couldn't get any items to crash on our machines.
To sum up: we have a crasher bug, that's bad, but it's reproducible for the user, that's good. Different users have different cases that cause crashes, that's bad again, and Mike couldn't reproduce the bug himself, and that's real bad. Also, I want to go drink, which is good in small doses but bad in large doses. What to do?
To his credit, Mike had taken the right first step (try to duplicate the conditions of the crash), but he was starting to spend too long on it (and more time debugging is less time boozing). It was time to move on.
Step two was to examine the exception that was being thrown right before the crash.
NSArchiverArchiveInconsistency - *** NSUnarchiver: inconsistency between written and read data for object 0x0Hmm, doesn't mean anything to me. Except, let's look up "NSArchiverArchiveInconsistency" on Google. A little reading informed me that it can be thrown when reading in bad data from an archive.
An archive? But I don't use NSUnarchivers in Delicious Library, do I? The next part of step two was to search my project in Xcode for NSUnarchiver and its superclass NSCoder. I found that I didn't use them anywhere NEAR the Amazon book-loading code. This made no sense! Sure, NSUnarchiver is also used in NIB loading, but, again, this wasn't anywhere near that kind of code.
Ok, move on. When you're looking for a bug, the best thing to do is keep an open mind. Play with ideas, try things out, then move on. Don't get hung up on one approach. Just try to get a feel for the whole system and the answer will come to you.
Step three was to look at the backtrace:
|0x96323f10: _NSExceptionHandlerExceptionRaiser (in ExceptionHandling)|
0x92896f3c: +[NSException raise:format:] (in Foundation)
0x92893ff0: __decodeObject_old (in Foundation)
0x928a5060: -[NSURLCache _diskCacheGet:] (in Foundation)
0x928a4b34: -[NSURLCache cachedResponseForRequest:] (in Foundation)
0x928a3f94: -[NSURLConnection(NSURLConnectionInternal) _performLoad] (in Foundation)
0x928a3a00: __resourceLoaderPerform (in Foundation)
0x9074afc4: ___CFRunLoopDoSources0 (in CoreFoundation)
0x9074a43c: ___CFRunLoopRun (in CoreFoundation)
0x90749ebc: _CFRunLoopRunSpecific (in CoreFoundation)
0x928a3760: +[NSURLConnection(NSURLConnectionInternal) _resourceLoadLoop:] (in Foundation)
0x9287c2b4: _forkThreadForFunction (in Foundation)
0x9002c3b4: __pthread_body (in libSystem.B.dylib)
Oooh, that's ugly. This isn't the main thread. This isn't even a thread I created. This is Apple's thread. I never even called any of these methods, this is something that got called as a side-effect from my code.
Now, let me state something unequivocally: 98% of the time when you think you've found a bug that is not your fault, it really is your fault. The other 2% of the time... well, it's probably your fault as well.
However, in this case... well, damn. We're deep in Apple's code. But I *know* Apple's got lots of code that uses urls and the web and stuff. I mean, hey, Safari, right? They've tested this. I don't believe their dang URL cache stuff has crashers this bad in it, unless I'm doing something odd to trigger them. So I can't just say, "Meh, Apple's fault, too bad, bitches," and be done with it.
Let's look closer at the trace. At the top, we're handling an exception. Big surprise, we knew that. Below that, we're raising the exception, which we also expected. Under that, we're decoding an object. Alright, so now we've confirmed where we suspected our "NSArchiverArchiveInconsistency" exception came from -- Apple was trying to decode an object and hit something it did not like. Check. But what was being decoded, from where did it come, and why was it corrupt? There are a lot of questions we need to answer.
The first interesting line for us is the one that starts
0x928a5060, because it says,
[NSURLCache _diskCacheGet:]. Now, wait a minute. NSURLCache? That's a class I've heard of. I look that up in Apple's documentation and it does what I might expect; it's for caching stuff that is loaded off the web.
Now it's time to let you in on something that you didn't know, so you couldn't have solved this mystery up to this point: in Delicious Library we load book descriptions from Amazon.com. In August of 2005 we released our 1.5.2 version, which disabled Apple's built-in disk caching of data that is downloaded using the Cocoa frameworks. This was done because the disk cache could conceivably grow quite large (with cover images) and server no purpose: there was never any need to look at the cached data, as whenever we reloaded from Amazon.com we really wanted the latest data, not a cached version.
Back up a second. We're getting a crash in NSURLCache, in a method called
_diskCacheGet:, but I just said that Delicious Library does not use a disk cache. Something's wrong here, right? We've got two assumptions that don't go together. You can smell the cognitive dissonance. This is your best friend. When you smell this smell, you know you're close to solving the whole thing.
What's the next step? Check your assumptions. Never say, "Well, I did blank, I know I did blank, blank is done, and that's the blank story!" Look at the actual code that does blank, and make DAMN SURE it really blanks.
How did we disable the disk cache in 1.5.2? We called
[[NSURLCache sharedURLCache] setDiskCapacity:0].
What exactly does this do? Check Apple's documentation. "Sets the receiver’s on-disk cache capacity to diskCapacity, measured in bytes... The on-disk cache will truncate its contents to diskCapacity, if necessary."
Now it's time for a leap of logic. We know that we're not having trouble when we're writing to the disk cache, we're having trouble reading from the disk cache (we turned off writing, and the backtrace clearly shows us loading from the disk, not writing to it). And, about the cache limiting call we added in 1.5.2: it doesn't actually prevent us from reading the disk cache, it just keeps us from writing any new data to the disk cache.
So, what happens if someone has been using Delicious Library from 1.5 or before (so they already have a disk cache, because it was enabled back then), and then they upgrade to 1.5.2, and reload one of the books they loaded under 1.5?
You wouldn't expect this to crash, because either the book is still in the disk cache, in which case it'd load from the cache (and at worst just have old data), or the cache has been correctly truncated, so the book would not be in the cache, so the book would load from the net. In neither case would you expect a crash.
But here's where we see a possible case that Apple could have missed, because they probably haven't shipped a program that originally had caching on, but then turned it off in a later version. So what if they never really tested their cache truncation code, and it turns out when you set the disk capacity to 0 Cocoa corrupts that program's existing URL cache, and makes it so any time you get a cache hit your program crashes? New programs that set their caches to 0 would never have a problem, and neither would old programs that don't mess with their cache size. One might reasonably think that Delicious Library was one of the few programs that had an older version with a disk cache and a new version that truncated it, and thus was the first to hit test this part of Apple's code (and find the bug).
This would explain why Mike couldn't replicate the user's problems -- he was starting on a machine that had never run 1.5 or earlier, so it didn't have any truncated (eg, corrupted) items in the URL cache (or any items at all, since we disabled writing them). It also explains why certain random items crashed for certain users upon reload -- whichever items they happened to have sitting in their URL cache when they upgraded from 1.5 to 1.5.2 would become corrupted for all time, and those would forever crash instead of loading. And it explains why a lot of users never saw this bug; they didn't reload old books or they started with version 1.5.2 or greater, so they never hit Apple's corrupted cache.
We have a theory, but how do we test it? Two ways: first, Mike asks some users with this problem to send us their URL cache directories (~/Library/Caches/Delicious Library/), and he copies their directories into his home directory and tries loading the items they say to cause crashes. He is able to replicate their crashes, so we've done a positive test. Mike then asks all the users who have reported this problem to blow away that URL cache directory for Delicious Library and try again. 100% report the problem is gone (yay!), and thus we've done a negative test.
We're now pretty darn sure. All that remains it to write a little code so that future users of Delicious Library don't encounter this problem. Sure, it's Apple's bug, but users don't know that and wouldn't much care even if they did; if my program crashes it's on my head, so I'm going to work around this bug. Here's the code, for your edification:
| // Turn off NSURLCache (we sort of keep our own cache...)|
and remove the existing cache directories so we don't get strange crashers from NSURLCache
(NSArchiverArchiveInconsistency) in Apple's code. (File RADAR!)
NSArray *applicationCacheDirectories = NSSearchPathForDirectoriesInDomains(NSCachesDirectory,
unsigned int applicationCacheDirectoryIndex = [applicationCacheDirectories count];
[[NSFileManager defaultManager] removeFileAtPath:[[applicationCacheDirectories
objectAtIndex:applicationCacheDirectoryIndex] stringByAppendingPathComponent:[[NSProcessInfo processInfo]
[[NSURLCache sharedURLCache] setDiskCapacity:0];
From the time I noticed Mike's activity to my writing this workaround code took about a half hour. Mike spent the next day verifying the fix and nursing a hangover.
What are the lessons here? First off, when you're tracing down a bug, keep your mind open. Don't get married to a theory or an approach. This isn't like programming, you've got to defocus your mind, read a bunch, look at your code. Don't think "I know I did this," think, "I *thought* I did this, but maybe I'm crazy." Question EVERY LINE around the crash. Does "if" really mean what I think it does? Does "*" bind tighter than "++"? Did I really write the correct variable name there, or did I write in some different variable and I keep reading it as the correct variable because that's what I meant to write?
Second, when you've got a possible theory on the cause of a bug, don't stop until you can prove to yourself how such a bug could exist and go undetected up until now, whether it's your fault or someone else's. It's not good enough to say, for example, "Oh, must be Bob's fault." If you can't explain how and why Bob made such a mistake, and then devise a workaround or fix, you don't really know what's going on, and thus you don't really know if you've fixed the bug. And, worse, you didn't really learn anything, so you're not going to be any better at it next time.
I never, ever fix a bug, even from another user, unless I can say with some surety that I can see the circumstances that led to that person writing that bug. It may be as simple as "oh, copy and paste error" or "this person was tired and overlooked this edge case, and that didn't come up until now." But if you see a bug and think, "How in the hell did this code ever work?" then you've got a lot more work to do (assuming the code did ever work), because you aren't allowed to move on until you can answer that question.
Understanding your code, and the underlying code of the environment you're in, is more important than fixing any one bug.
Many mojitos died to bring you this information.