Home    All Articles    About    carlos@bueno.org    RSS

Unusually Effective Debugging

Performance optimization in the small is a kind of debugging. You're finding something specific to fix, or fixing something specific you've found. Slowness is a bug. It's an unexpected behavior of the system and you want to remove it while disturbing as little as possible.

Everybody knows somebody who is unusually good at debugging. They're almost magical. They hear about a problem or see a stack trace and say Oh, I think I know what that is. Ten minutes later it's fixed. If you're lucky, afterwards you get a neat detective story ending about how and why the bug happened.

I love those people. Everybody loves those people. You can even train yourself to become one of those people, but it's not going to happen by imitating the behaviors that you see. The magic is hidden in those ten minutes. The people themselves may not be able to explain what it is they're doing different. It's largely a point of view, and the ability to change one's point of view at will.

Effective debugging is not about being right. It's not about knowing the answer. It's not about guessing the answer, or having a detailed mental model of the system and a backup of the codebase in your head. It's not about narratively satisfying just-so stories about the how and the why. It's not about coming up with theories and looking for evidence to confirm those theories. It's kind of the opposite.

Effective debugging is about being wrong. A lot. It's about accepting that being wrong is on the way to discovering the ground-state truth of reality. Having accepted that, you focus on strategies that allow you to be wrong in productive ways. It's about killing your darlings, looking for evidence to prove your theories false. It's about ignoring the how and why and describing, as precisely as possible, what the problem is. It's about imagining a huge multidimensional search space of possibilities and looking for ways to eliminate half or whole dimensions, recursively, until you've isolated the fault.

The Game of Twenty Breakpoints

It's an algorithm every child learns. I'm thinking of an animal and you have to guess what it is. You could do a linear search through the animal kingdom and find the answer. Is it a duck? No. Is it a badger? No. Is it a honeybadger? No. You can get some constant-factor speedup by simulating me in your head. I like dinosaurs, so start with those.

For small values of N, this algorithm works. But on even a moderately complicated system, N can be astoundingly large. If you sat down and worked out just how much volume is encompassed by a ten-dimensional space, you'll see what I mean. By the way, this is why crusty old programmers drone on about keeping things small and simple. Every additional state-value of the system opens another dimension of possibilities to reason about.

A better algorithm is to look for questions that allow you to eliminate whole kingdoms and phyla no matter what the answer is. Is it larger than a duck? Does it breathe air? Does it have fins? If you do it right, you'll find the answer in roughly log(N) time no matter the size of N.

An even better algorithm is to combine the two. Your knowledge of the system isn't useless; it guides your questions. If you suspect I like dinosaurs, one of the first questions should be whether the animal is extinct.

Yesterday I played this game with my brother-in-law. We were in bear country, and about four questions in I was pretty sure he was thinking of a bear. At that point I could have played it either way, asking outright or continuing to vector in until I was sure. In the real world there's often a difference in cost between gathering more evidence and testing a particular theory: compiling, running tests, etc. Even if you're pretty sure you've found your bear, a few extra evidential steps won't hurt. It can keep you from stopping too early on a false positive.

Two people playing the same game will start at different points in the space and take different paths. That's fine. That's style. Uniformity of the process doesn't matter, only convergence.


The day after I wrote this post I was diagnosing a problem of missing log data from a remote datacenter. A large number of things have to happen in order for the data to get from point A (the server) to point B, a dot on a graph on my computer screen. Visualizing that as a long pipeline, I started in the middle, skipping along a kind of binary search, expecting to find where the data was leaking out.

Annoyingly, all of the things that were supposed to happen seemed to be happening. And there was no data in evidence anywhere along the pipeline. It meant that the problem was likely at the source, right on the server itself. But that seemed to be working properly as well.

After four hours of experimenting, reading code, examining systems, digging up old documentation and emails, questioning of sanity, and towers of theories built up and destroyed, my colleague Zheng and I stumbled on the root cause: a law of physics.

Light travels through optical fiber at 200,000km per second, one of those handy round numbers Nature gives us once in a while. The two systems were 12,000km apart, or 24,000km round trip. I measured that by eyeballing a map of internet cables on our office wall. 24 / 200 = 0.120, which meant that an aggressive 100ms network connection timeout was always tripping before the data was theoretically able to make the round trip.

Half a day is nowhere near the magical log(N) time this post talks about, and it probably would have taken much longer if I hadn't just been reminded of this famous case. Effective debugging is very hard to pull off consistently.