July 7, 2008

Pimp My Code, Part 15: The Greatest Bug of All

Last week, a customer reported a bug to me in Delicious Library 2: when he first launched version two, his version one data would start to be imported, but after waiting for tens of minutes watching the annoying aqua progress bar creep along, his entire machine would crash. Every time.

His collection was so huge it took me two solid days to download his version one files onto my machine. When I ran Delicious Library 2 on my Air in debug mode, it converted his data for about thirty minutes, then crashed with an exception that said the database file was 'corrupt'. I tried to look where it had crashed, but XCode was reporting that all my source code files were gone. In fact, so were my iTunes songs, and everything in my Documents folder. I opened Finder and Terminal, and verified that the directories were indeed empty. No error messages, no residual files. Just empty. Panic.

A minute or so later, all these files magically came back. Everything. As if nothing had happened.

After thinking about this overnight (while staying away from the computer), I changed one word in my source code, and the customer's file loaded perfectly, in only twenty minutes.

Part 1: Triage

In my personal pantheon of bugs, this report was triaged at the top. Roughly, my triage order is:
  • Data-loss bugs
  • Unavoidable crashers
  • Functionality-blocking bugs
  • Avoidable crashers
  • Avoidable bugs
  • Misfeatures
  • Performance issues
  • Feature suggestions
  • UI feedback
This is, of course, a rough ordering: obviously if a customer is trying to publish a single item to her website and it's taking a half hour, then I'm obviously going to bump that up the list over someone whose app crashes when she leans on the "a" key for a three hours in the "Actors" field. (For the record, neither of those are real bugs.)

But, note that in the first case, the "performance" issue has made the feature essentially unusable, so it's really a functionality-blocking bug. Further, if publishing is synchronous, then this bug is blocking access to the application for an unacceptable amount of time, and could be considered a crasher of sorts.

Weighing the bug my real user reported, we see that it's The Perfect Storm, a trifecta that crashes the app (and the machine!), every time (on launch!!), blocks all functionality, and, to add insult to injury, it takes a long time to do so. (The food wasn't very good, and the portions were very small.)

This bug was suddenly at the top of my list.

Part 2: Machine Crashers

How can machines crash due to user programs? The machine-crashing bug is exceedingly rare in shipping applications; in my experience, there are two primary causes: some system resource gets used up by the program, or the program confuses the window server (the graphics system) and the machine's primary interface locks up.

As an operating system matures, it's usually harder and harder to confuse the window server (I haven't seen such a bug in OS X in a long time), but resource over-allocation problems remain.

One of the goals of the operating system's designers is to not allow programs running in user space to ever crash the entire machine. Some would say it's a primary goal, even. But in real life, most operating systems can be brought to their knees (if not fully knocked-out) by user code.

Consider a simple application that just allocates memory in a tight loop. As the operating system runs this application and starts running out of free physical RAM pages, it starts throwing away the pages used by other applications, in a process that goes kind of like this: oh, you haven't used mail in a few seconds, I'll throw that away for now, and, uh, also this web browser, and, uh, damn, the window server's backing store, and, uh, this essential system font server, and, uh...

Pretty soon you've effectively locked up the system, just because it's going to take so long to page in any particular resource the windowing system needs to run that by the time it's loaded off the disk your rogue process has caused some other vital page to be unloaded.

Now, the internet is full of flames about how this is completely unacceptable, and OS designers are 100% to blame and yadda yadda yadda. These are, in my opinion, probably from computer science PhD students who believe in a perfect world of provable programming and the Easter Bunny.

Here in the realm of actually making money, if running your program causes a user's computer to crash, she doesn't care if it's Apple's "fault" — she's going to post all over the interwebs that your program sucks, and ask you for her money back. Now, since you're not a PhD student, you like money, so this is bad.

And, honestly, why blame the OS designer? Your app was written incorrectly. It was going to crash or be killed by the OS anyhow. This is a bug you need to fix either way. And the user would be pissed at you even if the OS catches your app and says, "Sorry, this application has gone crazy, we're killing it."

You can see why OS designers don't spend all their time worrying about such issues. It's not as if a lot of real malware is written to crash machines — like biological viruses, successful software viruses don't kill the host, they co-exist with the host and use a minimum of resources while replicating themselves. Viruses that kill the host go extinct very quickly.

Further, to allow programmers to get the most out of the hardware, OS designers need to let us live right on the edge. The more blocks and checks they put between us and the hardware, the slower our programs are going to run. Imagine if the Apple said to programmers: "Ok, in order to not let you dominate the hardware, we will only let you use 50% of the CPU at any one time." Now imagine you're a user, running World of Warcraft on your brand-new $4,000 ultra-lux machine, and you're only getting half the framerate you should be, and would be under, say, Windows.

You'd be pissed. And Apple would lose a customer.

Interlude

Look, I don't want to seem mean. I'm sure she's a very nice person in real life, and probably quite smart. And, she's a very handsome woman, I'm not denying that. But, honestly, Jessica Alba can't act. It's just not a skill she has.

It's OK. Not everyone was meant to be an actress. But you really have to admit your limitations to yourself. I know, for instance, I'll never be a supermodel. I'm OK with that. I'd make a lousy president, too. I accept it.

It's time to stop, Jessica. You're costing people money. And YOU would be happier doing something you know you're good at, that you could feel good about. There are LOTS of professions for preternaturally beautiful women out there. You could be one of my assistants, for instance.

Part 3: Diagnosis? Delicious!

How do you even begin to fix a bug where things are going great for a while, then your machine crashes and/or all your files disappear?

Remember that the very first thing you do, when looking at any bug, before you even start thinking about it, and long before you look at your code, is replicate it. You can't debug what you can't replicate, and user reports are usually lacking in some details that your trained eye will catch.

At this point, I'd already replicated the crash, and I'd seen a valuable clue the user hadn't seen: my files had disappeared. The key clue.

Next, in a case like this, you'd need to make sure you have a recent backup of your machine, because, damn. I mean, I've been programming for a long, long, long, long, long, long, long, long, long, long, long, long, long, long, long, long, long, long, long, long, long, long, long, long, long time, and I've never seen a bug like that.

Ever.

Ok: your machine is crashing, so you suspect that it's running out of some resource. Memory is the most common resource to run out of and the easiest to find and fix, so you run your program under Instruments, with the leak detection tool. You really should have done this before you released your application, but you were totally burnt-out and you'd been in beta for months and didn't seem to have any major issues, so you failed to do everything just right. Now you're paying.

Your code will run slow as hell under Instruments, but in the end, like that hot gypsy lady at the fair, she will magically tell you, "Hey, you leaked this object exactly here and at exactly this time, so cut it out." It's a small miracle, and the team who wrote this has two to go and they're saints, bitches.

Sadly, at this point I had also already run Instruments / leaks weeks before, and although it had found a major image leak in 2.0 (whoopsie!) which had kept some users from loading very large libraries, and released it in 2.0.1, this fix didn't help our guy. In fact, my virtual memory usage was holding steady at 1.26 GB as I loaded his collection, which is really not bad considering every process on my system right now has between 0.5 to 1.18 GB of VM right now, under no load.

Not memory. Hmm. Now you think. Think think think. I think better if I distract myself from the problem and let my hind-brain take over. So I went and played Assassin's Creed, and murdered some fools who had waged war on their fellow man, or made the mistake of calling me peasant, or gotten in my way, or looked funny, or stood around yelling too much about how Salem Ali is a strong man. Honestly, once you start stabbing people in the neck it's hard to stop.

All the while, I'm swishing ideas around in my mouth to see if they taste right. Graphics memory leak? Often a window server will have its own mini-memory manager: was I overloading it as I converted 40,000 cover images from version one? Maybe, but... Instruments had actually found my leak of graphics memory before, so I kind of trusted that it would have found any others I might have had.

Stab. Stab stab. Gurgle.

Also, my files had disappeared. That doesn't seem like a graphics problem.

Stab. Man, I love jumping from a tall building and landing with my blade in someone's throat. I am so bad-ass.

What resources are related to files? Honestly, I don't know a lot about modern file systems; they are kind of a black box to me. When I went to college, we had integers which represented files. And we liked it!

But Mach had introduced a new kind of I/O, called memory-mapped I/O, that I was using extensively. A possible culprit?

Part 4: Memories of Memory Mapping

Back when NeXTstep 0.8 was first unveiled, one of its key revolutions was it was based on not just plain UNIX, but Mach UNIX, which was a really damn fast and clever re-write of the lowest layers of UNIX, with some extra fun added in, including cheap and fast inter-process communication. Avie Tevanian had been working on Mach as a PhD student at Carnegie Mellon, and Steve Jobs recognized he was a star and hired him straight away. (Microsoft countered by hiring Avie's old advisor to work on NT, which is kind of like Microsoft hiring my mom because I'm a good programmer.)

One of the great things Mach gave us was a new way to read files, which process was really slow by default under vanilla UNIX, for reasons I don't fully understand. Mach introduced the metaphor of memory-mapped files, which used the virtual memory system to map a file on disk onto an address somewhere in VM, without actually loading anything from the disk at first. The cool thing was your program could access any byte in this special VM region, and the correct part of the file would be paged in on-demand, using the exact same mechanism that the VM system used to extend applications' physical memory by mapping their virtual memory spaces to a large reserved area on the disk.

This turned out to be very fast, because it had none of the overhead of normal UNIX I/O, which usually involved reading a byte or so at a time using a UNIX function, which involved an expensive system trap and had a lot of overhead (checking to see if enough bytes have been read in from the disk and if not reading in some more) compared to just fetching a memory location. In the latter case, once a page was fetched in you can fetch all the other bytes on that page for free, no system calls at all, just a standard memory access. Big win.

NeXTstep 1.0 gave us to memory streams, which were awesome, and NeXTStep 2.0 (new capitalization) used memory mapped files by default when you read files using its fancy new NSData class, and all new applications ran a lot faster and life was good.

Well, mostly good. One problem with memory-mapped I/O is that if the user deletes or moves the file on the disk, the OS can't get to it any more, but the programmer never really knows when this might happen, so she don't know whether to just go ahead and make a copy of the entire file into memory or not. So it's not a good idea to keep memory-mapped files open for a long time.

Another, more recent, problem is that virtual memory on 32-bit systems is capped at 4GB, which means the largest file you can memory-map is a lot smaller than 4GB (since your app is going to use, like, at least 1GB of VM itself). Nowadays, it's not uncommon for users to have 6GB movie files in their iTunes. Oops.

So, for a while now, Apple has been moving gently away from memory-mapping I/O. For example, you have to set a flag (CSResourcesFileMapped) in your applications' Info.plist if you want to memory-map the resource files that the AppKit and Foundation frameworks load for you; otherwise they are read using different (unknown) techniques, which may or may not be slower (see below).

Fortuitously, however, I'd just been part of a discussion on a developer mailing list about optimized file system access, and learned an important fact, which gave me the solution.

Part 5: Everything New is Old Again

I have exalted memory-mapped I/O for exactly twenty years now, but just a week or so ago an Apple engineer told me that the latest tests show that Apple's new-fangled "uncached" access method had at least as good of performance as memory-mapped I/O, with neither of the two big drawbacks listed above. It was The New Way.

Here's a bit from the NSData page:
NSMappedRead
A hint indicating the file should be mapped into virtual memory, if possible.
Available in Mac OS X v10.4 and later.
Declared in NSData.h
NSUncachedRead
A hint indicating the file should not be stored in the file-system caches.
For data being read once and discarded, this option can improve performance.
Available in Mac OS X v10.4 and later.
Declared in NSData.h
Sadly, there's not a lot of guidance there when you should or shouldn't use either option. Also, there's no hint if the options are mutually exclusive. Why would you use NSMappedRead, and why not? Who knows?

But now I'd heard the Word, and It Was Good. I resolved then and there to look over my code, and see when I did and didn't need NSMappedRead. But I didn't want to just naively trust that NSUncachedRead was always better (APIs so rarely work like that), so I'd need to do timing tests. So, in fact, I was in the middle of these timing tests (on iTunes importing) when I got this new bug report, so I had some experience with memory-mapped vs. cached access.

Part 6: I Love It When a Plan Comes Together

Ok, I'm reading in 40,000 image files as part of the conversion, and storing them in my database. At some point the filesystem goes on vacation and says, to hell with you, I'm not reading any more damn files, even if they are just directories. Hrmmm. Hrmmmmmmmmmm.

"Oh, please, spare my li-- gurgle!"

Maybe the problem was there was some huge bug in Apple's Mach, where if you open too many files in a short period of time, the filesystem tried to, like, cache the results, and the cache blew up, and as a result the filesystem incorrectly just would fail to open any more files, instead of flushing the cache. This made a little sense, since the problem had gone away spontaneously after a minute or so, like a cache problem would when the cache auto-flushed.

One thing I knew, KNEW wasn't the problem was that I had too many file open simultaneously. Because the way my code worked, I'd open the files using NSData's dataWithContentsOfMappedFile:, do a little massaging on the image inside, and then the files would auto-close themselves when the NSData was autoreleased, which I would correctly do every 100 files I read in.

I knew this. Knew! I've also been around long enough to know that whenever I know the operating system must be bugged, since my code is correct, I should take a damn close look at my code. The old adage (not mine) is that 99% of the time operating system bugs are actually bugs in your program, and the other 1% of the time they are still bugs in your program, so look harder, dammit.

So I traced through the steps. First I read the image file into the NSData. Right. It's auto-released, so I'm not leaking it. Then, I set the CoreData object's data field to point to the same NSData. Still no leak. Then, the NSData is autoreleased, correctly, and the system resources should be freed up.

Wait a minute. Wait. It's autoreleased, but that doesn't mean it's deallocated. It just means the reference count is decremented. But someone else might still have a pointer to it... in fact, I had just assigned the same NSData to the CoreData object. Knowing it was an immutable data, the CoreData object didn't make a copy of my data, it just retained the exact same NSData object. The one that had an open memory-map of the image file!

CoreData's objects persist in memory until you flush them explicitly, which means anything they are holding also persists. So, I was creating 40,000 memory-mapped files and not freeing any of them. That, uh... seemed fishy. Like, five-course-sushi-dinner fishy.

Now, I could flush these cover objects explicitly using CoreData; I'd have to read in a batch of old objects, convert them, save everything, then flush the covers, then read in another batch. In fact, I was already using batches (following CoreData's best practice guidelines, see "Importing in Batches"), but flushing would add a layer of complexity. I'd have to keep track of the objects I created separately, and loop through and flush them.

Sure, that's easy code to write. But I'm making a 2.0.2 release, and in x.x.y releases, I want to change as little code as possible to fix the bug, because my beta testers weren't going to be excited about testing a release that adds no features. I will test it myself, of course, and I'd ask anyone who reported the bug to test it, but doing something as heavy-duty as suddenly flushing a bunch of objects could have side effects, and I wouldn't have enough people to test for those. Yes, in theory flushing shouldn't have side-effects, but in that same world, no code should ever have bugs and communism is a really nifty idea.

But, I'd been told that NSUncachedRead was as good as NSMappedRead, without as many drawbacks. Would it also not tie up the same system resources? Certainly the word "uncached" seemed promising. I'd try it.

I kind of exaggerated on the "one word" in my preamble — it would have been a one word change, but I'd used the older interface on NSData:

NSData *imageData = [NSData dataWithContentsOfMappedFile:fullImagePath];

I replaced it with this line (which would say NSMappedRead instead of NSUncachedRead to be equivalent to the above):

NSData *imageData = [NSData dataWithContentsOfFile:fullImagePath
options:NSUncachedRead error:&error];

Thus removing the memory-mapping.

Importing the large collection worked perfectly. The problem was solved, and now I could upgrade truly HUGE collections.

Part 7: Where Did We Go Wrong?

We had tested Delicious Library 2 with importing an enormous sample file of 10,000 Delicious Library 1 items. However, it took so long to build up such a library in version 1 that we ended up building the sample library using a custom program, and as a result that test file didn't have any images in it. But beta testers were enrolled specifically because they had collections of thousands of items, and we had the largest closed beta of any program I've worked on. We also tested with our friends' Delicious Library 1 files, of about 3,300 items.

But this customer had over 4,400 items in Delicious Library 1. That, it turned out, was a few over a limit we had no idea existed. Damn it.

It's a bug we should have caught. We should have spent the time to get the images in the 10,000 item file. I messed up.

Software is written by humans. Humans get tired. Humans become discouraged. They aren't perfect beings. As developers, we want to pretend this isn't so, that our software springs from our head whole and immaculate like the goddess Athena. Customers don't want to hear us admit that we fail.

The measure of a man cannot be whether he ever makes mistakes, because he will make mistakes. It's what he does in response to his mistakes. The same is true of companies.

We have to apologize, we have to fix the problem, and we have to learn from our mistakes.

Unfortunately, Delicious Monster has so much mail from releasing version 2, we're failing at responding to customers right now. We are a tiny company, and had one support person, and although we're bringing a second up to speed, it takes time to teach someone how to solve our users' problems. Our backlog of mail messages is finally going down — as of today, we're at 1,300, down from 3,000 a week or so after our release. We still get several hundred messages a day, though, and it takes time just to sort through and see which ones are bugs and which ones are customers needing help and which ones are people just saying, "Hey, that's cool, you know what else would be cool?"

When this customer finally got through to me, I apologized to him for the bug and the crummy response time, and elevated his bug to the top. I let him know what I was doing to fix it, and got the fix done in a couple days. I also gave him a free license to Delicious Library 2, since he'd been unable to buy it yet.

I had already known that when I launched 2.0, I couldn't immediately go on vacation; that I'd have to jump in to the 2.0.1 release for any urgent bugs that customers found. What I didn't realize was how much support e-mail would spike. That's the lesson I learned: I need to have extra people ready to do support when I do a major release. I'm not sure how to do this and not be paying people to twiddle their thumbs the rest of the time, but that's something I'll have to figure out. And I apologize to my customers for this screw-up.

The beta for 2.0.2 is out now, at http://delicious-monster.com/downloads/Delicious%20Library%202%20Beta/DeliciousLibrary2.zip.

It'll load a ton of items.

2 Comments:

Anonymous Anonymous said...

Are you an actor? Acting teacher? Acting coach, even? No? If not, you are not qualified to comment on anyone else's acting abilities. You have only your own ill-considered opinions.

August 01, 2008 1:08 PM

 
Blogger Wil Shipley said...

[Note: In general I don't publish comments on posts, but I thought this one was funny. Don't feel bad if you read it - I have a zillion people who read this and enjoyed it.]

December 29, 2009 3:43 PM

 

Post a Comment

<< Home