March 4, 2006

Pimp My Code, Part 8: Mary, Mary, why you buggin?

In today's episode, instead of harshing on some poor fellow's code, we trace the short, miserable life of a real, live bug, from commission to detection to squashing, hopefully illustrating some general techniques for would-be exterminators.

--

About a month ago Mike Lee, my bodyguard and majordomo, was trying to replicate our biggest crashing bug in Delicious Library by running it in gdb and replicating a user's actions. This bug actually wasn't even particularly common, but since we don't have very many crashing bugs in Delicious Library (compared to other projects in which I've been involved), and since I wanted to go drinking and not hang out and watch him screw around with Delicious Library, I took an interest in what he was doing.

Mike was loading certain books from Amazon.com to re-create the library of a user who reported that when he loaded or reloaded some items they would crash Delicious Library every time. Note the emphasis here. Whenever you hear "every time" in a bug report you should breath a sigh of relief, because you'll be able to find the bug. If something is reproducible, it's fixable. Period. It may be a pain, you may have to start cutting out large parts of your program and doing a binary search of your code, but you can fix it if it's reproducible.

In this bug, loading certain items would always crash DL for the user, while loading other items always worked fine. Other users had different items that would cause a crash, and the vast majority of users had no items that would cause a crash. We also couldn't get any items to crash on our machines.

To sum up: we have a crasher bug, that's bad, but it's reproducible for the user, that's good. Different users have different cases that cause crashes, that's bad again, and Mike couldn't reproduce the bug himself, and that's real bad. Also, I want to go drink, which is good in small doses but bad in large doses. What to do?

To his credit, Mike had taken the right first step (try to duplicate the conditions of the crash), but he was starting to spend too long on it (and more time debugging is less time boozing). It was time to move on.

Step two was to examine the exception that was being thrown right before the crash. NSArchiverArchiveInconsistency - *** NSUnarchiver: inconsistency between written and read data for object 0x0 Hmm, doesn't mean anything to me. Except, let's look up "NSArchiverArchiveInconsistency" on Google. A little reading informed me that it can be thrown when reading in bad data from an archive.

An archive? But I don't use NSUnarchivers in Delicious Library, do I? The next part of step two was to search my project in Xcode for NSUnarchiver and its superclass NSCoder. I found that I didn't use them anywhere NEAR the Amazon book-loading code. This made no sense! Sure, NSUnarchiver is also used in NIB loading, but, again, this wasn't anywhere near that kind of code.

Ok, move on. When you're looking for a bug, the best thing to do is keep an open mind. Play with ideas, try things out, then move on. Don't get hung up on one approach. Just try to get a feel for the whole system and the answer will come to you.

Step three was to look at the backtrace:

Backtrace
0x96323f10: _NSExceptionHandlerExceptionRaiser (in ExceptionHandling)
0x92896f3c: +[NSException raise:format:] (in Foundation)
0x92893ff0: __decodeObject_old (in Foundation)
0x928a5060: -[NSURLCache _diskCacheGet:] (in Foundation)
0x928a4b34: -[NSURLCache cachedResponseForRequest:] (in Foundation)
0x928a3f94: -[NSURLConnection(NSURLConnectionInternal) _performLoad] (in Foundation)
0x928a3a00: __resourceLoaderPerform (in Foundation)
0x9074afc4: ___CFRunLoopDoSources0 (in CoreFoundation)
0x9074a43c: ___CFRunLoopRun (in CoreFoundation)
0x90749ebc: _CFRunLoopRunSpecific (in CoreFoundation)
0x928a3760: +[NSURLConnection(NSURLConnectionInternal) _resourceLoadLoop:] (in Foundation)
0x9287c2b4: _forkThreadForFunction (in Foundation)
0x9002c3b4: __pthread_body (in libSystem.B.dylib)


Oooh, that's ugly. This isn't the main thread. This isn't even a thread I created. This is Apple's thread. I never even called any of these methods, this is something that got called as a side-effect from my code.

Now, let me state something unequivocally: 98% of the time when you think you've found a bug that is not your fault, it really is your fault. The other 2% of the time... well, it's probably your fault as well.

However, in this case... well, damn. We're deep in Apple's code. But I *know* Apple's got lots of code that uses urls and the web and stuff. I mean, hey, Safari, right? They've tested this. I don't believe their dang URL cache stuff has crashers this bad in it, unless I'm doing something odd to trigger them. So I can't just say, "Meh, Apple's fault, too bad, bitches," and be done with it.

Let's look closer at the trace. At the top, we're handling an exception. Big surprise, we knew that. Below that, we're raising the exception, which we also expected. Under that, we're decoding an object. Alright, so now we've confirmed where we suspected our "NSArchiverArchiveInconsistency" exception came from -- Apple was trying to decode an object and hit something it did not like. Check. But what was being decoded, from where did it come, and why was it corrupt? There are a lot of questions we need to answer.

The first interesting line for us is the one that starts 0x928a5060, because it says, [NSURLCache _diskCacheGet:]. Now, wait a minute. NSURLCache? That's a class I've heard of. I look that up in Apple's documentation and it does what I might expect; it's for caching stuff that is loaded off the web.

Now it's time to let you in on something that you didn't know, so you couldn't have solved this mystery up to this point: in Delicious Library we load book descriptions from Amazon.com. In August of 2005 we released our 1.5.2 version, which disabled Apple's built-in disk caching of data that is downloaded using the Cocoa frameworks. This was done because the disk cache could conceivably grow quite large (with cover images) and server no purpose: there was never any need to look at the cached data, as whenever we reloaded from Amazon.com we really wanted the latest data, not a cached version.

Back up a second. We're getting a crash in NSURLCache, in a method called _diskCacheGet:, but I just said that Delicious Library does not use a disk cache. Something's wrong here, right? We've got two assumptions that don't go together. You can smell the cognitive dissonance. This is your best friend. When you smell this smell, you know you're close to solving the whole thing.

What's the next step? Check your assumptions. Never say, "Well, I did blank, I know I did blank, blank is done, and that's the blank story!" Look at the actual code that does blank, and make DAMN SURE it really blanks.

How did we disable the disk cache in 1.5.2? We called [[NSURLCache sharedURLCache] setDiskCapacity:0].

What exactly does this do? Check Apple's documentation. "Sets the receiver’s on-disk cache capacity to diskCapacity, measured in bytes... The on-disk cache will truncate its contents to diskCapacity, if necessary."

Now it's time for a leap of logic. We know that we're not having trouble when we're writing to the disk cache, we're having trouble reading from the disk cache (we turned off writing, and the backtrace clearly shows us loading from the disk, not writing to it). And, about the cache limiting call we added in 1.5.2: it doesn't actually prevent us from reading the disk cache, it just keeps us from writing any new data to the disk cache.

So, what happens if someone has been using Delicious Library from 1.5 or before (so they already have a disk cache, because it was enabled back then), and then they upgrade to 1.5.2, and reload one of the books they loaded under 1.5?

You wouldn't expect this to crash, because either the book is still in the disk cache, in which case it'd load from the cache (and at worst just have old data), or the cache has been correctly truncated, so the book would not be in the cache, so the book would load from the net. In neither case would you expect a crash.

But here's where we see a possible case that Apple could have missed, because they probably haven't shipped a program that originally had caching on, but then turned it off in a later version. So what if they never really tested their cache truncation code, and it turns out when you set the disk capacity to 0 Cocoa corrupts that program's existing URL cache, and makes it so any time you get a cache hit your program crashes? New programs that set their caches to 0 would never have a problem, and neither would old programs that don't mess with their cache size. One might reasonably think that Delicious Library was one of the few programs that had an older version with a disk cache and a new version that truncated it, and thus was the first to hit test this part of Apple's code (and find the bug).

This would explain why Mike couldn't replicate the user's problems -- he was starting on a machine that had never run 1.5 or earlier, so it didn't have any truncated (eg, corrupted) items in the URL cache (or any items at all, since we disabled writing them). It also explains why certain random items crashed for certain users upon reload -- whichever items they happened to have sitting in their URL cache when they upgraded from 1.5 to 1.5.2 would become corrupted for all time, and those would forever crash instead of loading. And it explains why a lot of users never saw this bug; they didn't reload old books or they started with version 1.5.2 or greater, so they never hit Apple's corrupted cache.

--

We have a theory, but how do we test it? Two ways: first, Mike asks some users with this problem to send us their URL cache directories (~/Library/Caches/Delicious Library/), and he copies their directories into his home directory and tries loading the items they say to cause crashes. He is able to replicate their crashes, so we've done a positive test. Mike then asks all the users who have reported this problem to blow away that URL cache directory for Delicious Library and try again. 100% report the problem is gone (yay!), and thus we've done a negative test.

We're now pretty darn sure. All that remains it to write a little code so that future users of Delicious Library don't encounter this problem. Sure, it's Apple's bug, but users don't know that and wouldn't much care even if they did; if my program crashes it's on my head, so I'm going to work around this bug. Here's the code, for your edification:

Workaround code
// Turn off NSURLCache (we sort of keep our own cache...)
and remove the existing cache directories so we don't get strange crashers from NSURLCache
(NSArchiverArchiveInconsistency) in Apple's code. (File RADAR!)
NSArray *applicationCacheDirectories = NSSearchPathForDirectoriesInDomains(NSCachesDirectory,
NSUserDomainMask, YES);
unsigned int applicationCacheDirectoryIndex = [applicationCacheDirectories count];
while (applicationCacheDirectoryIndex--)
[[NSFileManager defaultManager] removeFileAtPath:[[applicationCacheDirectories
objectAtIndex:applicationCacheDirectoryIndex] stringByAppendingPathComponent:[[NSProcessInfo processInfo]
processName]] handler:NULL];
[[NSURLCache sharedURLCache] setDiskCapacity:0];


From the time I noticed Mike's activity to my writing this workaround code took about a half hour. Mike spent the next day verifying the fix and nursing a hangover.

--

What are the lessons here? First off, when you're tracing down a bug, keep your mind open. Don't get married to a theory or an approach. This isn't like programming, you've got to defocus your mind, read a bunch, look at your code. Don't think "I know I did this," think, "I *thought* I did this, but maybe I'm crazy." Question EVERY LINE around the crash. Does "if" really mean what I think it does? Does "*" bind tighter than "++"? Did I really write the correct variable name there, or did I write in some different variable and I keep reading it as the correct variable because that's what I meant to write?

Second, when you've got a possible theory on the cause of a bug, don't stop until you can prove to yourself how such a bug could exist and go undetected up until now, whether it's your fault or someone else's. It's not good enough to say, for example, "Oh, must be Bob's fault." If you can't explain how and why Bob made such a mistake, and then devise a workaround or fix, you don't really know what's going on, and thus you don't really know if you've fixed the bug. And, worse, you didn't really learn anything, so you're not going to be any better at it next time.

I never, ever fix a bug, even from another user, unless I can say with some surety that I can see the circumstances that led to that person writing that bug. It may be as simple as "oh, copy and paste error" or "this person was tired and overlooked this edge case, and that didn't come up until now." But if you see a bug and think, "How in the hell did this code ever work?" then you've got a lot more work to do (assuming the code did ever work), because you aren't allowed to move on until you can answer that question.

Understanding your code, and the underlying code of the environment you're in, is more important than fixing any one bug.

Many mojitos died to bring you this information.

Labels:

27 Comments:

Anonymous David said...

I've seen the same bug in my application, which ocurred because users used the application in 10.3.9 and then upgraded to 10.4. Apparently the caches folder was copied somehow and messed thing about... My solution was to irgnoe this exception in the exceptionHandler:shouldHandleException:mask: method.

..and apple has this problem too I guess, since this bug also ocurrs when you try to work with the NSFontManager directly after the user upgraded as described above.

March 04, 2006 4:27 AM

 
Anonymous Anonymous said...

Please tell me that leaving the curlies off the while statement is just to compress it for the sake of the example :(

March 04, 2006 4:27 AM

 
Anonymous Matthias Winkelmann said...

Please tell me that leaving the curlies off the while statement is just to compress it for the sake of the example :(

I'm hoping the same. Leaving them out saves you one line and two characters, but the price is inconsistency, lower readability and having to think at the inevitable time you add a second line to the body of that loop.

March 04, 2006 5:11 AM

 
Anonymous Anonymous said...

Noobies should always use curly braces around blocks. That way it's easily identifiable as noob code. ;)

It's kind of a litmus test, if you feel so religious about these kind of things that you need to comment about it on other ppls code, you are a noob.

March 04, 2006 5:20 AM

 
Anonymous Anonymous said...

> if you feel so religious about these kind of things that you need to comment about it on other ppls code, you are a noob.

Yeah, but calling someone a "noob" labels you as a troll living in mom's basement jerking off to a Sear's catalog.

March 04, 2006 5:37 AM

 
Anonymous Anonymous said...

I'd put curly braces around the statement even if it meant my breath smells bad and my momma dresses me funny.

March 04, 2006 5:50 AM

 
Anonymous Shane said...

Good post Wil! I've always felt that this ability to project yourself through the code during a debug session is the hallmark of a great developer.

March 04, 2006 7:34 AM

 
Anonymous Anonymous said...

Did you file a bug with Apple?

March 04, 2006 7:50 AM

 
Anonymous Anonymous said...

As to the Sears catalogue, it takes a certain background to come up with that thought, namely, experience?

March 04, 2006 7:50 AM

 
Blogger Mike said...

This is probably the completely wrong location to mention this, but I am rather annoyed because I have a 100% reproduceable crash (well, technically an exception; the whole app doesn't crash) in Delicious Library 1.6.1 (running on a 10.4.5 PPC) that I have reported to Delicious Monster both via email and the built-in crash reporter and have never gotten any kind of acknowledgement. I love Delicious Library and really want to switch to it, but this bug makes that, well, impossible.

So for anyone playing along at home, if you launch Delicious Library, tell it you want to import books from Booxter, and click OK, you will get an exception every time at NSInvalidUnarchiveOperationException

NSInvalidUnarchiveOperationException - *** -[NSKeyedUnarchiver decodeObjectForKey:]: cannot decode object of class (BXSmartList)

{
NSStackTraceKey = "0x9604bf10 0x9290207c 0x928ec0d4 0x928eca78 0x928ecf18 0x928ec140 0x928eca78 0x9290a1d8 0x928ec140 0x928eb924 0x9296b210 0x000a1e08 0x0005ab6c 0x00048050 0x00046f64 0x9372e270 0x9372e1a4 0x9372e084 0x9374809c 0x93747c84 0x937476a8 0x936e8eb0 0x93691ef4 0x00033238 0x93689330 0x93779e68 0x0000a308 0x0000a1b0";
}

STACK:
0x9604bf10: _NSExceptionHandlerExceptionRaiser (in ExceptionHandling)
0x9290207c: +[NSException raise:format:] (in Foundation)
0x928ec0d4: __decodeObjectBinary (in Foundation)
0x928eca78: -[NSKeyedUnarchiver _decodeArrayOfObjectsForKey:] (in Foundation)
0x928ecf18: -[NSArray initWithCoder:] (in Foundation)
0x928ec140: __decodeObjectBinary (in Foundation)
0x928eca78: -[NSKeyedUnarchiver _decodeArrayOfObjectsForKey:] (in Foundation)
0x9290a1d8: -[NSDictionary initWithCoder:] (in Foundation)
0x928ec140: __decodeObjectBinary (in Foundation)
0x928eb924: __decodeObject (in Foundation)
0x9296b210: +[NSKeyedUnarchiver unarchiveObjectWithFile:] (in Foundation)
0x000a1e08: 0x000a1e08 (in Delicious Library)
0x0005ab6c: 0x0005ab6c (in Delicious Library)
0x00048050: 0x00048050 (in Delicious Library)
0x00046f64: 0x00046f64 (in Delicious Library)
0x9372e270: -[NSApplication sendAction:to:from:] (in AppKit)
0x9372e1a4: -[NSControl sendAction:to:] (in AppKit)
0x9372e084: -[NSCell _sendActionFrom:] (in AppKit)
0x9374809c: -[NSCell trackMouse:inRect:ofView:untilMouseUp:] (in AppKit)
0x93747c84: -[NSButtonCell trackMouse:inRect:ofView:untilMouseUp:] (in AppKit)
0x937476a8: -[NSControl mouseDown:] (in AppKit)
0x936e8eb0: -[NSWindow sendEvent:] (in AppKit)
0x93691ef4: -[NSApplication sendEvent:] (in AppKit)
0x00033238: 0x00033238 (in Delicious Library)
0x93689330: -[NSApplication run] (in AppKit)
0x93779e68: _NSApplicationMain (in AppKit)
0x0000a308: 0x0000a308 (in Delicious Library)
0x0000a1b0: 0x0000a1b0 (in Delicious Library)
???:


So I have tried all the obvious stuff. Made sure haxies were disabled. Tried a "clean" user and a "clean" Delicious installation. No dice.

Sorry if this is the wrong place to post this, but hey, the subject of the blog post kinda, well, spurred my interest.

mike@mikesilverman.com

March 04, 2006 9:08 AM

 
Anonymous Anonymous said...


Yeah, but calling someone a "noob" labels you as a troll living in mom's basement jerking off to a Sear's catalog.


Yea, but wouldn't it be like a Macworld for us?

March 04, 2006 9:13 AM

 
Anonymous Anonymous said...

That bug brings shivers to my spine.

March 04, 2006 9:31 AM

 
Blogger Mike Lee said...

O_o

Mike Silverman,

I'll e-mail this to you directly, but, for anyone finding this via Google...

Delicious Library doesn't support smart playlists, so it doesn't know what to do with Booxter's. You need to delete those from your Booxter library before importing.

For the record, we don't usually respond to crash reports because they're usually automated, but we do read them, aggregate them, and work on fixing them.

March 04, 2006 9:43 AM

 
Anonymous Mark Stultz said...

"Meh, Apple's fault, too bad, bitches"

I believe I've said these same words while figuring out my own bugs with WebKit. Turned out to be my fault as well :)

March 05, 2006 12:28 AM

 
Anonymous Anonymous said...

Alcoholism Test

Below are some general questions to help you assess whether you have a problem with alcohol.

1. Do you occasionally drink heavily after a disappointment, quarrel, or when the boss gives you a hard time?

2. When you have trouble or feel under pressure, do you always drink more heavily than usual?

3. Have you noticed that you are unable to handle more liquor than when you were first drinking?

4. Did you ever wake up the "morning after" and find that you could not remember part of the past evening, even though your friends say you did not "pass out"?

5. When drinking with other people, do you try to have a few extra drinks when others will not know it?

6. Are there certain occasions when you feel uncomfortable if alcohol is not available?

7. Have you recently noticed that when you begin drinking, you are in more of a hurry to get the first drink than you used to be?

8. Do you sometimes feel a little guilty about your drinking?

9. Are you secretly irritated when your family or friends discuss your drinking?

10. Have you recently noticed an increase in the frequency of your memory "blackouts"?

11. Do you often find that you wish to continue drinking after friends say they have had enough?

12. Do you usually have a reason for the occasions when you drink heavily?

13. When sober, do you often regret things you have done or said while drinking?

14. Have you tried switching brands or following various plans to control your drinking?

15. Have you often failed to keep the promises you have made to yourself about controlling or cutting down on your drinking?

16. Have you ever tried to control your drinking by changing jobs or moving?

17. Do you try to avoid family or close friends while you are drinking?

18. Are you having an increasing number of financial and work problems?

19. Do more people seem to be treating you unfairly without good reason?

20. Do you eat very little or irregularly when you are drinking?

21. Have you recently noticed that you cannot drink as much as you once did?

22. Do you sometimes have "the shakes" in the morning and find that it helps to have a drink?

23. Do you sometimes stay drunk for several days at a time?

24. Do you sometimes feel very depressed and wonder whether life is worth living?

25. Sometimes after periods of drinking, do you see or hear things that aren’t there?

26. Do you get terribly frightened after you have been drinking heavily?

March 05, 2006 8:21 AM

 
Blogger Wil Shipley said...

Yes, yes, yes, yes, no, yes, no, yes, yes, yes, no, yes, yes, no, yes, no, no, no, no, no, yes, no, no, yes, no, no.

March 05, 2006 8:42 AM

 
Blogger David said...

is it just me, or does that title sound like it came out of a george clinton song? you've got to have a little place in your heart for some parliament, wil ;-)

March 05, 2006 12:01 PM

 
Blogger Wil Shipley said...

I've got a bit of Parliament, and also Digital Underground, but in this case I was paying homage to Run-DMC.

March 05, 2006 2:31 PM

 
Blogger MatzeLoCal said...

Hei :)

I really really really love 'Pimp My Code'. But unfortunately I know it just since 'Part 6'. Since Blogger.com is kinda search-unfriendly, I thought if it wouldn't be possible to if you could put some static links to the parts of PMC on the sidebar.

Greets,

Matthias

March 05, 2006 11:02 PM

 
Anonymous Anonymous said...

Oh, and here I thought the "yes, yes, no, no" was homage to Mel Brook's "History of the World: Part I"

March 06, 2006 12:42 PM

 
Blogger Abhi Beckert said...

Matzelocal, just click on the last item in "Previous Posts", and it'll change to the 10 posts before that one.

The first PMP is only a few pages back.

March 07, 2006 9:28 PM

 
Anonymous Anonymous said...

Has the WCTU taken up comment-spamming? What's with the 'are you a drunk' test?

-jcr

March 10, 2006 3:10 AM

 
Anonymous Jerome said...

This post is actually an interesting little nugget from a language designer's point of view. Understanding the code and the environment are all too often overlooked in language design. One of the biggest problems, for examle, with the current crop of scripting languages is that there is an over emphasis on flexibility, ease of use, learning curve, and the such, and a wholesale neglect of understanding what's going on. After aspects get old, it will be interesting to see if this issue gets addressed.

March 12, 2006 11:13 AM

 
Anonymous Anonymous said...

AH, that brings back memories of the time when I was assigned to fix weird bugs in our router software. The joy of finding a bug after two weeks of intensive and frustrating search...

To hunt down bugs, you have to think like an experimental physicist. You observe a phenomenon, develop a theory, and test it.

The hardest step is to find out how to reproduce the error.

March 15, 2006 12:46 AM

 
Anonymous charles who used to live in Seattle said...

This is very common on the hardware side, too.

"My part worked until you added your crappy crap to my part, and then my part stopped working. Meh, your fault, too bad, bitch."

I problem solve exactly the same way as Wil (that alone makes me feel a little bit smarter than I did a minute ago).

The take away: Understand, Really Understand, the problem before you start wasting a lot of time throwing around blame and band-aid fixes.

March 19, 2006 11:12 PM

 
Anonymous Anonymous said...

Once when I was debugging some embedded stuff (8052 assembler inside some sort of security system), I was getting a crasher exactly 7 seconds after the chip started up. This happened to correspond to the exact same line in the code every time, so naturally that started to come under intense scrutiny. Trouble was, there was nothing wrong with it! No matter how I tried to "see" what that code was *really* doing, could I see a way for it to crash. This had me scratching my head for a few days... I was getting nowhere. At that stage the hardware was just a wirewrap board and the only one in existence, so I couldn't run the code on another system to try it there, and that's what started me thinking, well, even though we had this running just great until the latest code version, perhaps it's a hardware problem after all? Well. To cut a long story short, that's exactly what it was. Turned out that that CPU has a pin which is pulled low to access the internal ROM, and high to access externally connected memory. We didn't have any external memory, so the pin was tied low... or was it? No it was just not connected. The pin was sensed at startup time by some internal circuitry, then set to high impedance. It would slowly gather charge from stray leakage until it reached a logic '1', at which point it tried to switch to the non-existent external ROM. This charge-up took... 7 seconds. The reason it had never been spotted was that previously the weather had been more humid and had probably not allowed the charge to build up. A change to drier conditions coincided with a new version of the code! So it had to be the code, right? I guess the point is, as Wil says, you have to keep an open mind and root out the problem whatever it is - don't assume that you know all the facts. The fact it's failing and you don't know why proves you don't.

October 30, 2006 7:10 PM

 
Blogger Nitewing98 said...

DUDE! God bless you. I have been fighting a similar error being generated by an Applescript Studio app that I'm working on. I believe this post just put me on the track of this error.

August 08, 2007 12:38 PM

 

Post a Comment

<< Home