(This will include some C code explanations so putting this in Developer's Corner as that seems like a good fit)
It’s common these days for evangelism/advocacy purposes to have a “our system would never have this defect” thread. For instances the MAID SAFE project has an ongoing thread with a title like “This would never have happened on MAID SAFE”, which gets updated whenever a famous website gets hacked, or during more eggregious censorship episodes. So I figure it would be nice to list examples of crashing bugs or security exploits that plague the IT world, and find out whether they could have occured on Genode-based operating systems (spoiler: they probably wouldn’t )
An obvious recent example would be the NULL-indirection in a CrowdStrike kernel module that crashed user-facing systems the world over.
But instead of that, today I’m going to share a bit of TTS lore which is not (well) known by the public at large.
Most non-devs (and many devs) will probably be bored to tears by the below… But for me it’s engraved in my mind, it’s written in flaming letters in the sky. So here goes.
Once upon a time (a decade+ ago), in an OS galaxy far far away, there was an Audio CD ripping application called TunePrepper. For a time it ran on an OS called BeOS, and it was a hit with users, and things were good. Radio stations need to convert their Audio CD collection to MP3s on their broadcast computers, and TP was the cat’s meow for that purpose.
But then BeOS went belly-up (or, to be more accurate, was slain, but this is not the place for revisiting the MS anti-trust trial, come on). So we had to port our full TTS software suite (TunePrepper included) to the successor of BeOS, let’s call them “Be2”. That successor improved on BeOS in several ways, but it had an even larger “kernel land” : the Be2 designers thought that e.g. the net_server component, which was in user-space, should become a kernel-space component. “For improved performance” (that was the thinking at the time). (Spoiler alert: net_server was not the culprit in the below story, I’m citing it just for illustration purposes).
Anyway back to TTS: the TunePrepper port was fairly uneventful, it went well as expected… Except the app kept crashing the Kernel, “triggering a KDL”…
“KDL” stands for “Kernel Debugging Land”. The correct terminology would thus be something like “the app kept triggering kernel segfaults which would display the white-screen-of-death Kernel Debugging Land kernel debugger”. But that’s a mouthful, so people just say “the app KDL’ed”, that gets the message across. Anyway. Carrying on.
The KDLs were not entirely reproducible, some CDs would rip ok, you’d get a long string of successful rips, then at some point a rip would trigger a KDL right when the CD would get inserted into the CD tray. Some others would trigger a KDL when selecting another window after being done ripping… There was no definite pattern. The only definite common denominator was TunePrepper (which allowed for eventual resolution of the bug, though with lots of efforts and elapsed months… unlike the rest of our problems which were not so “easy” to fix… but I’m getting ahead of myself again).
So what did the “white screen of death” say, what was the crash message exactly ?
Again, there was no definite pattern in the message, just like there wasn’t in the behavior/reproducibility. The KDLs would sometimes be related to TunePrepper, sometime to random kernel modules. It would crash on corrupt variables in firewire (hint: we didn’t use firewire in any way, shape or form), or in VFS, in the HD Audio driver, or in the network driver, that is to say, in completely random stuff.
To put things in context : TP was a CD ripper. The way it worked was, the user would feed some Audio CDs to the CD-ROM tray, TP would automatically mount them and “rip” them.
During the mounting, the system would automatically do a CDDB lookup to find out the artist and album name of the CD and its tracks… “CDDB” stands for “audio CD database” – basically it has an API where you feed it with the “digital signature” of an Audio CD, and it returns the title of the album and of all its tracks. So after using that server API, TP would rename the mounted device : it would change the original generic name (something like “Audio CD”) to its looked-up name (for instance: “Ark - Burn the Sun” (spoiler alert: that latter name is a few bytes longer than the original name – see where this is going ?)
So TP kept KDL’ing the system. So we filed tickets, lots of them, with photos of the KDL messages and backtrace. Sometimes we’d try to post to an existing ticket, but would get a response like “this KDL message looks nothing like the one you opened the ticket with, you should open a new ticket”.
Days turned into weeks. Weeks turned into months. I was very much looking at solving this, not just for the sake of having software that runs and can be sold, but also for the sake of “lessons leaned” : I was looking forward to finding the bug, drawing conclusions so that it would never occur again, and engraving those in a cast-iron panel hung on the wall of my office ^^.
One episode I remember was : I’m looking closely at the KDL message, which mentions (as often) a corrupt 32-bit register, and I start thinking “this 32 bit word looks like a series of 4 ASCII bytes”. Converting the hex to ASCII, I find something like “e Su” and I realize “oooh, that could be a part of the string 'Ark - Burn the Sun” being used for a buffer overflow ?". Nothing definite, but a hint. At last! After months of no hints, and given how desperate I was, I started hanging on it for dear life. So I start going through my code : the album metadata gets fed to various BString classes, BStringView, BMenuItem… Those are used throughout the system, that can’t be the culprit ? Even if by miracle one of them had a way to trigger a kernel bug, it would manifest its way throughout the OS, not just in TunePrepper ? I even try to to a “synthethic recreation” of the bug, right in my userland code (if only I had known…), I hardcode the name to “XXXX…XXXXXX” (a series of 64 bytes as I recall). That led me nowhere, I gave up. Not trying to find excuses for myself, but after a long time churning on it, that lead started to look weak, and since it occured in one of the later tickets, it didn’t occur to me at that time to think of the real cause (which will be revealed below).
Meanwhile the tickets kept accumulating (this whole thing went on for months), the screenshots, the back-and-forth, the Be2 team getting tired of hearing from us, etc.
What to do ? What to do ?
We finally paid a little ‘bounty’ to a member of the Be2 team, the one best suited for bug hunting, known for creating the Be2 guarded heap, the leak analyzer and other tools – really rigorous guy, great value to the team and an ass-kicker whenever he set out to fix a bug (but unfortunately that was infrequent, as he had little time to dedicate to Be2 development, it might have changed history otherwise). I remember we agreed on something like, “find the bug, receive $500”. Worth the price considering the effort it took him, and all the bugs he had fixed for us and for the community previously without receiving any compensation, it was also a way to catchup on thanking him.
So he tries out different strategies and after a while, he decides to compiles the OS with guarded heap, and runs its in Qemu, emulating an Audio CD, running TunePrepper. The thing takes forever to boot up and to run the app, as the guarded heap is terribly slow and hugely memory-hungry compared to a normal system with a normal memory heap. Plus he has to wade through tons of false positives (i.e. guarded heap “hits” that were irrelevant to our bug), because, well, that’s Be2 we’re talking about.
In the end he does trigger a guarded heap “hit” that also happens to occur right before a KDL, when inserting the (virtual/fake) Audio CD in the Qemu emulator. Yay! There we go. The moment of truth. The hit is about a buffer overrun in… (drum rolls)… rootfs
. The file system that gets populated with hard drives, CDs and the like.
The offending code looked something like this:
if (strlen(newName) > strlen(fs.name))
// no need to re-allocate, keep the old buffer
strcpy (fs.name, newName)
else
{
free (fs.name);
fs.name = strdup (newName)
}
Did you spot the mistake ? Yup. Exactly. The “if” logic is reversed.
And this, ladies and gents, is how you create a semi-subtle kernel crash, that will clobber adjacent memory that might or might not be used by another kernel component, which might or might not detect it and crash immediately, but will highly likely NOT hint at the real culprit (rootfs), it will crash/point the finger at another component instead.
Why didn’t that bug get found sooner, by someone else? People don’t rename their root volumes (hard drive partitions) all that often. And probably not with names as long as “Beatles - Yellow Submarine” that are more likely to clobber important memory. Hypothetically speaking, with a small user base like Be2’s, maybe someone had filed one ticket about it in the previous years. The ticket probably had a KDL message pointing the finger at another component, and thus remained unresolved, or even dismissed as “just a fluke”. Heck, we filed tons of tickets and screenshots, because we had a somewhat reproducible case, since ripping dozens of CDs one after the other resulted in dozens of rootfs renaming, yet no-one found the cause… So how could we lay the blame on an hypothetical prior ticket not being resolved?
Anyway. Bug found. We have to pay the bounty. I remember talking to Dane (the head honcho of TTS ^^) and he was incensed, in two ways Like “how can a single line of code bring down the whole OS ??” and “I’m not paying up that much for just a single line of fixed code !”. I seem to remember we shared the price eventually, but I almost had to pay out of my own pocket, he was so p-o’ed ^^. I can put myself in his shoes: non-developers have no idea how brittle a big-kernel is, that a single NULL indirection or buffer overrun can make the whole thing a flaming wreck.
So what now. We were in a good place afterwards, right ? After finally finding this if/else logic reversal, the fix was pushed into a commit, a new nightly build was made, we installed the nightly and verified that this KDL no longer occured. Problem solved, right ? We no longer KDL with a buffer overflow when inserting an AudioCD, we’re good to go ?
Nope. There soon was another problem. Then another. Then another. Then another.
We threw in the towel : let’s stop sales of TunePrepper completely, and instead focus on the rest of our software suite.
But then the “rest” had a problem too. Then another. Then… Ok I’ll stop here.
So what’s the “lesson learned” bottom line here, about this particular kernel and how it’s developed ?
- are they lazy-bones coders who do not measure up ? Of course not, they re-created a whole Operating System with little prior experience of OS development, in the space of a few years, without (originally) being paid anything. That’s pure heroics right there.
- maybe Donald Knuth was right, “premature optimization is the root of all evil”, especially if (speculating a bit, here) it’s the kind of if-else optimization done late at night running on (coffee) fumes ? Maybe.
- maybe it’s our own fault at TTS, we didn’t file enough tickets, and they would have found the bug faster, had we filed more tickets ? HA ! Good one. You have me up in stitches here.
- maybe there exists a magic trick that can make kernel coding easy and we are all too stupid to see it ? Of course not. The only way to make big kernels stable is to be Google with its billions $$$, or have a huge user base and dev base that “work harder, not smarter”. And even if you “check all the marks” you might still get a “CrowdStrike”, despite all your billion dollars and your hardworking, blood-injected-eyes droids.
The answer is of course, none of the above really matters, due to the original sin. Don’t put all your eggs in the “kernel” basket and expect to avoid Amiga-style “Guru Meditations”.
Bottom line – none of that would have happened on Genode-based operating systems.