Home    All Articles    About    carlos@bueno.org    RSS

A Dis­m­al Guide to Con­cur­ren­cy

12 April 2010

Two peo­ple can paint a house fast­er than one can. Honeybees work in­depen­dent­ly but pass mes­sages to each other about con­di­tions in the field. Many forms of con­cur­ren­cy, so ob­vi­ous and natur­al in the real world, are ac­tual­ly pre­tty alien to the way we write pro­grams today. It's much eas­i­er to write a pro­gram as­sum­ing that there is one pro­ces­sor, one mem­o­ry space, sequen­ti­al ex­ecu­tion and a God's-eye view of the in­tern­al state. Lan­guage is a tool of thought as much as a means of ex­press­ion, and the mindset em­bed­ded in the lan­guages we use can get in the way. (Original­ly on Facebook's En­gineer­ing blog.)



The sloven­li­ness of our lan­guage makes it eas­i­er for us to have foolish thoughts. The point is that the pro­cess is re­ver­sible.
-- Geor­ge Or­well
Politics and the En­glish Lan­guage

That lan­guage is an in­stru­ment of human rea­son, and not mere­ly a medium for the ex­press­ion of thought, is a truth general­ly ad­mitted.
-- Geor­ge Boole
The Laws of Thought

We're going through an in­vers­ion of scale in com­put­ing which is mak­ing para­llel­ism and con­cur­ren­cy much more im­por­tant. Single com­put­ers are no long­er fast en­ough to han­dle the amounts of data we want to pro­cess. Even with­in one com­put­er the re­lative speeds of pro­ces­sors, mem­o­ry, storage, and net­work have di­ver­ged so much that they often spend more time wait­ing for data than doing th­ings with it. The pro­ces­sor (and by ex­tens­ion, any pro­gram we write) is no long­er a Wizard of Oz kind of charact­er, sole ar­bit­er of truth, at the cent­er of every­th­ing. It's only one of many tiny bugs crawl­ing over moun­tains of data.

Many hands make light work

A few years ago Tim Bray de­cided to find out where th­ings stood. He put a com­put­er on the in­ter­net which con­tained over 200 mill­ion lines of text in one very large file. Then he chal­lenged pro­gram­ers to write a pro­gram to do some sim­ple th­ings with this file, such as fin­d­ing the ten most com­mon lines which matched cer­tain pat­terns. To give you a feel for the simplic­ity of the task, Bray's ex­am­ple pro­gram em­ployed one sequen­ti­al thread of ex­ecu­tion and had 78 lines of code, some­th­ing you could hack up over lunch.

The com­put­er was un­usu­al for the time: it had 32 in­depen­dent hardware threads which could ex­ecute simul­taneous­ly. The twist of the WideFind­er chal­lenge was that your pro­gram had to use all of those threads at once to speed up the task, while add­ing as lit­tle code as pos­sible. The pur­pose was to de­monstrate how good or bad every­day pro­gramm­ing is at splitt­ing large jobs into para­llel tracks.

How hard could it be? I thought. Very hard, as it hap­pened. I got up to 4 para­llel pro­ces­ses be­fore my pro­gram col­lap­sed under its own weight. The crux of the pro­blem was that the file was stored on a hard drive. If you've never peeked in­side a hard drive, it's like a re­cord play­er with a metal disc and a mag­netic head in­stead of a need­le. Just like a re­cord it works best when you "play" it in sequ­ence, and not so well if you keep mov­ing the need­le around. And of co­ur­se it can only play one thing at a time. So I could­n't just split the file into 32 chunks and have each thread read a chunk simul­taneous­ly. One thread had to read from the file and then dole out parts of it to the oth­ers. It was like try­ing to get 31 housepaint­ers to share the same buc­ket.

Para­llel­ism is the act of tak­ing a large job, splitt­ing it up into small­er ones, and doing them at once. Peo­ple often use "para­llel" and "con­cur­rent" in­terchan­gab­ly, but there is a sub­tle dif­fer­ence. Con­cur­ren­cy is neces­sa­ry for para­llel­ism but not the other way around. If I al­ter­nate bet­ween co­ok­ing eggs and pan­cakes I'm doing both con­cur­rent­ly. If I'm co­ok­ing eggs while you are co­ok­ing pan­cakes, we are co­ok­ing con­cur­rent­ly and in para­llel. Tech­nical­ly if I'm co­ok­ing eggs and you are mow­ing the lawn we are also work­ing in para­llel, but since no co­or­dina­tion is needed in that case there's noth­ing to talk about.

When I looked at other peo­ple's en­t­ries for hints I was struck by how al­most all of them, good and bad, looked com­plicated and steam­punky. Part of that was my un­familiar­ity with the tech­niques, but an­oth­er part was the lack of good sup­port for para­llel­ism which for­ced peo­ple to roll their own ab­strac­tions. (Ask four pro­gramm­ers to create a new ab­strac­tion and you'll get five and a half an­sw­ers.) The pit­hiest entry was 130 lines of OCaml, a lan­guage that makes it some­what eas­i­er to hack togeth­er "para­llel I/O". Most of the oth­ers were sever­al hundred lines long. Many peo­ple like me were not able to com­plete the chal­lenge at all. If it's this dif­ficult to para­llel­ize a tri­vi­al string-counting pro­gram, what makes us think we're doing it right in com­plex ones?

Ideal­ly, con­cur­ren­cy should­n't leak into the logic of pro­grams we're try­ing to write. Some rea­l­ly smart peo­ple would figure out the right way to do it. They would write pap­ers with lots of equa­tions in them and fly around to con­fer­ences for a few years until some other smart peo­ple figured out what the hell they were say­ing. Those peo­ple would go de­velop li­bra­ries in our favorite pro­gramm­ing lan­guages. Then we could just put import concurrent; at the top of our pro­grams and be on our way. Con­cur­ren­cy would be an­oth­er thing we no long­er worry about un­less we want to, like mem­o­ry man­age­ment. Un­for­tunate­ly there is evi­d­ence that it won't be this clean and sim­ple. A lot of th­ings we take for gran­ted may have to chan­ge. The switch to mem­o­ry man­age­ment wasn't all that easy eith­er, come to think of it.

There are at least two con­cur­ren­cy pro­blems to solve: how to get many com­ponents in­side one com­put­er to co­operate with­out stepp­ing all over each other, and how to get many com­put­ers to co­operate with­out drown­ing in co­or­dina­tion over­head. These may be speci­al cases of a more gener­al pro­blem and one sol­u­tion will work for all. Or per­haps we'll have one kind of pro­gramm­ing for the large and an­oth­er for the small, just as the mech­anics of life are dif­ferent in­side and out­side of the cell.

At the far end of the spectrum are large dis­tributed databases, such as those used by search en­gines, on­line re­tail­ers, and soci­al net­works. These th­ings are en­orm­ous net­works of com­put­ers that work togeth­er to han­dle thousands of writes and hundreds of thousands of reads every second. More mac­hines in the sys­tem raises the odds that one of them will fail at any mo­ment. There is also the chan­ce that a link bet­ween groups of mac­hines will fail, cutt­ing the brain in half until it is re­paired. There is a tri­cky balan­ce bet­ween being able to read from such a sys­tem con­sis­tent­ly and quick­ly and writ­ing to it re­liab­ly. The situa­tion is sum­med up by the CAP Theorem, which states that large sys­tems have three de­sir­able but con­flict­ing pro­pert­ies: Con­sis­ten­cy, Availabil­ity, and Partition-tolerance. You can only opt­im­ize for two at the ex­pen­se of the third.

A Con­sis­tent/Avail­able sys­tem means that rea­d­ing and writ­ing al­ways works the way you ex­pect, but re­quires a major­ity or quorum of nodes to be runn­ing in order to func­tion. Think of a par­li­ment that must have more than half of mem­b­ers pre­sent in order to hold a vote. If too many can't make it, say be­cause a flood was­hes out the brid­ge, a quorum can't be for­med and busi­ness can't pro­ceed. But when en­ough mem­b­ers are in com­munica­tion the decision-making pro­cess is fast and un­am­bigu­ous. These cat­ego­ries are not rigid­ly ex­clusive. The status re­port pro­blem is usual­ly han­dled by hav­ing heirarch­ies of super­visors and em­ployees aka "re­ports". The gos­sip con­sis­ten­cy pro­blem can be hel­ped by tagg­ing data with times­tamps or vers­ion numb­ers so you can re­con­cile con­flict­ing values. The in­terest­ing thing about Amazon's Dynamo is not even­tu­al con­sis­ten­cy but con­figur­able con­sis­ten­cy.

Con­sis­tent/Par­tition­able means that the sys­tem can re­cov­er from failures, but re­quires so much extra co­or­dina­tion that it col­lap­ses under heavy use. Im­agine hav­ing to send and re­ceive a status re­port for every de­cis­ion made at your com­pany. You'll al­ways be cur­rent, and when you come back from vaca­tion you will never miss a thing, but mak­ing ac­tu­al pro­gress would be very slow.

Availab­le/Par­tition­able means that you can al­ways read and write values, but the values you read might be out of date. A clas­sic ex­am­ple is gos­sip: at any point you might not know the latest on what Judy said to Bill but even­tual­ly word gets around. When you have new gos­sip to share you only have to tell one or two peo­ple and trust that in time it will reach every­one who cares. Spread­ing gos­sip among com­put­ers is a bit more re­li­able be­cause they are end­less­ly patient and (usual­ly) don't garble mes­sages.

After lots of grop­ing around with bi­ll­ions of dol­lars of re­venue at stake, peo­ple who build these large sys­tems are com­ing to the con­clus­ion that it's most im­por­tant to al­ways be able to write to a sys­tem quick­ly and read from it even in the face of tem­pora­ry failures. Stale data is a con­sequ­ence of loos­er co­upl­ing and great­er auto­nomy needed to make that pos­sible. It's un­com­fort­able to ac­cept the idea that as the com­put­ing power of an Availab­le/Par­tition­able sys­tem scales up, the fog of war de­scends on con­sis­ten­cy, but in prac­tice it's not the end of the world.

This was not a whims­ical nor easy choice. Im­agine Ebenez­er Scrooge is mak­ing so much money that Bob Cratchit can't keep up. Scrooge needs more than one em­ployee to re­ceive and count it. To find out the grand total of his money at any point, he has to ask each of them for a sub­tot­al. By the time Scrooge gets all the an­sw­ers and adds them up, his em­ployees have co­un­ted more money, and his total is al­ready out of date. So he tells them to stop co­unt­ing while he gath­ers sub­tot­als. But this was­tes valu­able work­ing time. And what if Scrooge adds an­oth­er counting-house down the street? He'll have to pay a street boy, lit­tle Sammy Locke, to a) run to the other house and tell them to stop co­unt­ing, b) gath­er their sub­tot­als, c) de­liv­er them to Scrooge, then d) run back to the other house to tell them to re­sume co­unt­ing. What's worse, his cus­tom­ers can't pay him while this is hap­pen­ing. As his op­era­tion gets bi­gg­er Scrooge is faced with a grow­ing tradeoff bet­ween stale in­for­ma­tion and halt­ing every­th­ing to wait on Locke. If there's an­yth­ing Scrooge likes less than old numb­ers, it's pay­ing peo­ple to do noth­ing.

Scrooge's di­lem­ma is for­ced upon him by basic physics. You can't avoid it by using electrons in­stead of street urchins. It's im­pos­sible for an event hap­pen­ing in one place (eg data chang­ing in­side one com­put­er or pro­cess) to af­fect any other place (eg other com­put­ers or pro­ces­ses) until the in­for­ma­tion has had time to travel bet­ween them. Where those de­lays are small re­lative to per­for­mance re­quire­ments, Scrooge can get away with vari­ous forms of loc­k­ing and enjoy the il­lus­ion of a shared, con­sis­tent mem­o­ry space. But as pro­grams spread out over more and more in­depen­dent work­ers, the com­plex­ity needed to main­tain that il­lus­ion be­gins to over­whelm every­th­ing else. This is not about speed-of-light ef­fects or an­yth­ing like that. I'm only talk­ing about re­fer­ence frames in the sense of "old news", such as when you find out your co­usin had gott­en mar­ried last year. Her wedd­ing and your un­aware­ness are both "true" re­lative to your re­fer­ence frames until you re­ceive news to the contra­ry.

Scratch that, re­ver­se it

Shared mem­o­ry can be pus­hed fair­ly far, howev­er. In­stead of ex­plicit locks, Clojure and many newer lan­guages use an in­terest­ing tech­nique cal­led software trans­ac­tion­al mem­o­ry. STM simulates a sort of post-hoc, fine-grained, im­plicit loc­k­ing. Under this scheme semi-independent work­ers, cal­led threads, read and write to a shared mem­o­ry space as though they were alone. The sys­tem keeps a log of what they have read and writt­en. When a thread is fin­is­hed the sys­tem verif­ies that no data it read was chan­ged by any other. If so the chan­ges are com­mit­ted. If there is a con­flict the trans­ac­tion is ab­or­ted, chan­ges are rol­led back and the thread's job is re­tried. While threads op­erate on non-overlapping parts of mem­o­ry, or even non-overlapping parts of the same data struc­tures, they can do whatev­er they want with­out the over­head of loc­k­ing. In ess­ence, trans­ac­tion­al mem­o­ry al­lows threads to ask for for­give­ness in­stead of per­miss­ion.

As you might have gues­sed from those jolly hints about con­flict and rollback, STM has its own speci­al pro­blems, like how to per­form those abort/retry cyc­les ef­ficient­ly on thousands of threads. It's fun to im­agine pat­holog­ical con­flict scenarios in which long chains of trans­ac­tions un­ravel like a cheap sweat­er. Also, STM can only han­dle ac­tions that are un­do­able. You can't retry most kinds of I/O for the same rea­son you can't re­wind a live con­cert. This is han­dled by queue­ing up any non-reversible ac­tions, per­form­ing them out­side of the trans­ac­tion, cach­ing the re­sult in a buff­er, and re­play­ing as neces­sa­ry. Read that sen­t­ence again. There is a re­cent paper about an in­terest­ing varia­tion on this theme cal­led HyTM, which ap­pears to do a copy-on-write in­stead of per­form­ing writes to shared mem­o­ry.

Hold this thread as I walk away

Un­deniab­ly awesome and clev­er as STM threads are, I'm not con­vin­ced that shared mem­o­ry makes sense out­side of the "cell mem­brane" of a single com­put­er. Through­put and laten­cy al­ways have the last laugh. A con­cur­rent sys­tem is fund­amen­tal­ly li­mited by how often pro­ces­ses have to co­or­dinate and the time it takes them to do so. As of this writ­ing com­put­er mem­o­ry can be ac­cessed in about 100 nano­seconds. Local net­work's laten­cy is measured in micro­seconds to mil­liseconds. Schemes that work well at local mem­o­ry speeds don't fly over a chan­nel one thousand times slow­er. Through­put is a pro­blem too: mem­o­ry can have one hundred times the through­put of net­work, and is shared among at most a few dozen threads. A large dis­tributed database can have tens of thousands of in­depen­dent threads con­tend­ing for the same band­width.

If we can't carry the shared-memory model out­side of the com­put­er, is there some other model we can bring in­side? Are threads, ie semi-independent work­ers that play in­side a shared mem­o­ry space, ab­solute­ly neces­sa­ry? In his "stan­dard lec­ture" on threads Xavi­er Leroy de­tails three rea­sons peo­ple use them:

Message-passing, which first ap­peared in Smalltalk, is the core ab­strac­tion of Joe Armstrong's pro­gramm­ing lan­guage Er­lang. Er­lang pro­grams do th­ings that make pro­gramm­ers take notice, like run some of the busiest telep­hone switches for years with­out fail. It approac­hes con­cur­ren­cy with three iron rules: no shared mem­o­ry even bet­ween pro­ces­ses on the same com­put­er, a stan­dard for­mat for mes­sages pas­sed bet­ween pro­ces­ses, and a guaran­tee that mes­sages are read in the order in which they were re­ceived. The first rule is meant to avoid the hear­taches de­scribed above and em­braces local know­ledge over glob­al state. The second and third keep pro­gramm­ers from end­less­ly re­in­vent­ing schemes for pass­ing mes­sages bet­ween pro­ces­ses. Every Er­lang pro­cess has sovereign con­trol over its own mem­o­ry space and can only af­fect oth­ers by sen­d­ing well-formed mes­sages. It's an elegant model and hap­pens to be a cleaned-up vers­ion of the way the in­ter­net it­self is con­struc­ted. Message-passing is al­ready one of the ax­ioms of con­cur­rent dis­tributed com­puta­tion, and may well be uni­vers­al.

A lot of writeups re­peat a "nine nines", ie 99.9999999% re­liabil­ity claim for Erlang-based Ericsson telep­hone switches owned by British Telecoms. This works out to 31 mil­liseconds of downtime per year, which hov­ers near the edge of measurabil­ity, not to say plausibil­ity. I was pre­sent at a talk Armstrong gave in early 2010 dur­ing which he was asked about this. There was a lit­tle foot shuffl­ing as he qualified it: it was ac­tual­ly 6 or so seconds of downtime in one de­vice dur­ing a code up­date. Since BT had X de­vices over Y years, they cal­culated it as 31ms of average downtime per de­vice per year. Or some­th­ing like that. Eith­er way it's an im­pres­sive feat.

There are pro­bab­ly more ax­ioms to dis­cov­er. Lan­guages be­come more power­ful as ab­strac­tions are made ex­plicit and stan­dardized. Message-passing says noth­ing about opt­imiz­ing for loc­al­ity, ie mak­ing sure that pro­ces­ses talk with other pro­ces­ses that are loc­ated near­by in­stead of at ran­dom. It might be cool to have a stan­dard way to measure the loc­al­ity of a func­tion call. Lan­guages be­come even more power­ful when ab­strac­tions are made first-class en­tit­ies. For ex­am­ple, lan­guages that can pass func­tions as ar­gu­ments to other func­tions can generate new types of higher-order func­tions with­out the pro­gramm­er hav­ing to code them by hand. A big part of dis­tributed com­put­ing is de­sign­ing good pro­tocols. I know of no lan­guage that al­lows pro­tocols as first-class en­tit­ies that can be pas­sed around and man­ipulated like func­tions and ob­jects are. I'm not even sure what that would look like but it might be in­terest­ing to try out.

There is a lot of sound and fury around para­llel­ism and con­cur­ren­cy. I don't know what the an­sw­er will be. I per­sonal­ly sus­pect that a re­laxed, shared-memory model will work well en­ough with­in the con­fines of one com­put­er, in the way that New­tonian physics works well en­ough at cer­tain scales. A more aus­tere model will be needed for a small net­work of com­put­ers, and so on as you grow. Or per­haps there's some­th­ing out there that will make all this loc­kwork moot.