Home    All Articles    About    carlos@bueno.org    RSS

Un­usual­ly Ef­fective De­bugg­ing

Per­for­mance opt­imiza­tion in the small is a kind of de­bugg­ing. You're fin­d­ing some­th­ing specific to fix, or fix­ing some­th­ing specific you've found. Slow­ness is a bug. It's an un­ex­pected be­havior of the sys­tem and you want to re­move it while dis­turb­ing as lit­tle as pos­sible.

Every­body knows some­body who is un­usual­ly good at de­bugg­ing. They're al­most mag­ical. They hear about a pro­blem or see a stack trace and say Oh, I think I know what that is. Ten minutes later it's fixed. If you're lucky, af­terwards you get a neat de­tec­tive story end­ing about how and why the bug hap­pened.

I love those peo­ple. Every­body loves those peo­ple. You can even train your­self to be­come one of those peo­ple, but it's not going to happ­en by im­itat­ing the be­haviors that you see. The magic is hidd­en in those ten minutes. The peo­ple them­selves may not be able to ex­plain what it is they're doing dif­ferent. It's lar­ge­ly a point of view, and the ab­il­ity to chan­ge one's point of view at will.

Ef­fective de­bugg­ing is not about being right. It's not about know­ing the an­sw­er. It's not about guess­ing the an­sw­er, or hav­ing a de­tailed ment­al model of the sys­tem and a bac­kup of the co­debase in your head. It's not about nar­rative­ly satis­fy­ing just-so sto­ries about the how and the why. It's not about com­ing up with theo­ries and look­ing for evi­d­ence to con­firm those theo­ries. It's kind of the op­posite.

Ef­fective de­bugg­ing is about being wrong. A lot. It's about ac­cept­ing that being wrong is on the way to dis­cover­ing the ground-state truth of rea­l­ity. Hav­ing ac­cepted that, you focus on strateg­ies that allow you to be wrong in pro­duc­tive ways. It's about kill­ing your darl­ings, look­ing for evi­d­ence to prove your theo­ries false. It's about ig­nor­ing the how and why and de­scrib­ing, as pre­cise­ly as pos­sible, what the pro­blem is. It's about im­agin­ing a huge multi­dimen­sion­al search space of pos­sibilit­ies and look­ing for ways to eliminate half or whole di­mens­ions, re­cur­sive­ly, until you've isolated the fault.

The Game of Twen­ty Break­points

It's an al­gorithm every child learns. I'm think­ing of an an­im­al and you have to guess what it is. You could do a li­near search through the an­im­al kingdom and find the an­sw­er. Is it a duck? No. Is it a badg­er? No. Is it a honeybadg­er? No. You can get some constant-factor speedup by simulat­ing me in your head. I like di­nosaurs, so start with those.

For small values of N, this al­gorithm works. But on even a moderate­ly com­plicated sys­tem, N can be as­toun­ding­ly large. If you sat down and wor­ked out just how much volume is en­com­passed by a ten-dimensional space, you'll see what I mean. By the way, this is why crus­ty old pro­gramm­ers drone on about keep­ing th­ings small and sim­ple. Every ad­dition­al state-value of the sys­tem opens an­oth­er di­mens­ion of pos­sibilit­ies to rea­son about.

A bet­t­er al­gorithm is to look for ques­tions that allow you to eliminate whole kingdoms and phyla no matt­er what the an­sw­er is. Is it larg­er than a duck? Does it breat­he air? Does it have fins? If you do it right, you'll find the an­sw­er in rough­ly log(N) time no matt­er the size of N.

An even bet­t­er al­gorithm is to com­bine the two. Your know­ledge of the sys­tem isn't useless; it guides your ques­tions. If you sus­pect I like di­nosaurs, one of the first ques­tions should be wheth­er the an­im­al is ex­tinct.

Yes­terday I played this game with my brother-in-law. We were in bear co­unt­ry, and about four ques­tions in I was pre­tty sure he was think­ing of a bear. At that point I could have played it eith­er way, as­k­ing out­right or con­tinu­ing to vec­tor in until I was sure. In the real world there's often a dif­fer­ence in cost bet­ween gat­her­ing more evi­d­ence and test­ing a par­ticular theo­ry: com­pil­ing, runn­ing tests, etc. Even if you're pre­tty sure you've found your bear, a few extra evi­den­ti­al steps won't hurt. It can keep you from stopp­ing too early on a false positive.

Two peo­ple play­ing the same game will start at dif­ferent points in the space and take dif­ferent paths. That's fine. That's style. Uni­form­ity of the pro­cess doesn't matt­er, only con­verg­ence.

Epi­logue

The day after I wrote this post I was di­ag­nos­ing a pro­blem of mis­s­ing log data from a re­mote datacent­er. A large numb­er of th­ings have to happ­en in order for the data to get from point A (the serv­er) to point B, a dot on a graph on my com­put­er scre­en. Visualiz­ing that as a long pipeline, I star­ted in the mid­dle, skipp­ing along a kind of bi­na­ry search, ex­pect­ing to find where the data was leak­ing out.

An­noying­ly, all of the th­ings that were sup­posed to happ­en seemed to be hap­pen­ing. And there was no data in evi­d­ence an­yw­here along the pipeline. It meant that the pro­blem was li­ke­ly at the sour­ce, right on the serv­er it­self. But that seemed to be work­ing pro­per­ly as well.

After four hours of ex­periment­ing, rea­d­ing code, ex­amin­ing sys­tems, di­gg­ing up old documen­ta­tion and em­ails, ques­tion­ing of san­ity, and tow­ers of theo­ries built up and de­stroyed, my col­league Zheng and I stumbled on the root cause: a law of physics.

Light travels through opt­ical fiber at 200,000km per second, one of those handy round numb­ers Na­ture gives us once in a while. The two sys­tems were 12,000km apart, or 24,000km round trip. I measured that by eyeball­ing a map of in­ter­net c­ables on our of­fice wall. 24 / 200 = 0.120, which meant that an aggres­sive 100ms net­work con­nec­tion timeout was al­ways tri­pp­ing be­fore the data was theoretical­ly able to make the round trip.

Half a day is now­here near the mag­ical log(N) time this post talks about, and it pro­bab­ly would have taken much long­er if I hadn't just been re­min­ded of this fam­ous case. Ef­fective de­bugg­ing is very hard to pull off con­sis­tent­ly.