A Dismal Guide to Concurrency

12 April 2010

Two people can paint a house faster than one can. Honeybees work independently but pass messages to each other about conditions in the field. Many forms of concurrency, so obvious and natural in the real world, are actually pretty alien to the way we write programs today. It's much easier to write a program assuming that there is one processor, one memory space, sequential execution and a God's-eye view of the internal state. Language is a tool of thought as much as a means of expression, and the mindset embedded in the languages we use can get in the way. (Originally on Facebook's Engineering blog.)

The slovenliness of our language makes it easier for us to have foolish thoughts. The point is that the process is reversible.
-- George Orwell
Politics and the English Language

That language is an instrument of human reason, and not merely a medium for the expression of thought, is a truth generally admitted.
-- George Boole
The Laws of Thought

We're going through an inversion of scale in computing which is making parallelism and concurrency much more important. Single computers are no longer fast enough to handle the amounts of data we want to process. Even within one computer the relative speeds of processors, memory, storage, and network have diverged so much that they often spend more time waiting for data than doing things with it. The processor (and by extension, any program we write) is no longer a Wizard of Oz kind of character, sole arbiter of truth, at the center of everything. It's only one of many tiny bugs crawling over mountains of data.

Many hands make light work

A few years ago Tim Bray decided to find out where things stood. He put a computer on the internet which contained over 200 million lines of text in one very large file. Then he challenged programers to write a program to do some simple things with this file, such as finding the ten most common lines which matched certain patterns. To give you a feel for the simplicity of the task, Bray's example program employed one sequential thread of execution and had 78 lines of code, something you could hack up over lunch.

The computer was unusual for the time: it had 32 independent hardware threads which could execute simultaneously. The twist of the WideFinder challenge was that your program had to use all of those threads at once to speed up the task, while adding as little code as possible. The purpose was to demonstrate how good or bad everyday programming is at splitting large jobs into parallel tracks.

How hard could it be? I thought. Very hard, as it happened. I got up to 4 parallel processes before my program collapsed under its own weight. The crux of the problem was that the file was stored on a hard drive. If you've never peeked inside a hard drive, it's like a record player with a metal disc and a magnetic head instead of a needle. Just like a record it works best when you "play" it in sequence, and not so well if you keep moving the needle around. And of course it can only play one thing at a time. So I couldn't just split the file into 32 chunks and have each thread read a chunk simultaneously. One thread had to read from the file and then dole out parts of it to the others. It was like trying to get 31 housepainters to share the same bucket.

Parallelism is the act of taking a large job, splitting it up into smaller ones, and doing them at once. People often use "parallel" and "concurrent" interchangably, but there is a subtle difference. Concurrency is necessary for parallelism but not the other way around. If I alternate between cooking eggs and pancakes I'm doing both concurrently. If I'm cooking eggs while you are cooking pancakes, we are cooking concurrently and in parallel. Technically if I'm cooking eggs and you are mowing the lawn we are also working in parallel, but since no coordination is needed in that case there's nothing to talk about.

When I looked at other people's entries for hints I was struck by how almost all of them, good and bad, looked complicated and steampunky. Part of that was my unfamiliarity with the techniques, but another part was the lack of good support for parallelism which forced people to roll their own abstractions. (Ask four programmers to create a new abstraction and you'll get five and a half answers.) The pithiest entry was 130 lines of OCaml, a language that makes it somewhat easier to hack together "parallel I/O". Most of the others were several hundred lines long. Many people like me were not able to complete the challenge at all. If it's this difficult to parallelize a trivial string-counting program, what makes us think we're doing it right in complex ones?

Ideally, concurrency shouldn't leak into the logic of programs we're trying to write. Some really smart people would figure out the right way to do it. They would write papers with lots of equations in them and fly around to conferences for a few years until some other smart people figured out what the hell they were saying. Those people would go develop libraries in our favorite programming languages. Then we could just put import concurrent; at the top of our programs and be on our way. Concurrency would be another thing we no longer worry about unless we want to, like memory management. Unfortunately there is evidence that it won't be this clean and simple. A lot of things we take for granted may have to change. The switch to memory management wasn't all that easy either, come to think of it.

There are at least two concurrency problems to solve: how to get many components inside one computer to cooperate without stepping all over each other, and how to get many computers to cooperate without drowning in coordination overhead. These may be special cases of a more general problem and one solution will work for all. Or perhaps we'll have one kind of programming for the large and another for the small, just as the mechanics of life are different inside and outside of the cell.

At the far end of the spectrum are large distributed databases, such as those used by search engines, online retailers, and social networks. These things are enormous networks of computers that work together to handle thousands of writes and hundreds of thousands of reads every second. More machines in the system raises the odds that one of them will fail at any moment. There is also the chance that a link between groups of machines will fail, cutting the brain in half until it is repaired. There is a tricky balance between being able to read from such a system consistently and quickly and writing to it reliably. The situation is summed up by the CAP Theorem, which states that large systems have three desirable but conflicting properties: Consistency, Availability, and Partition-tolerance. You can only optimize for two at the expense of the third.

A Consistent/Available system means that reading and writing always works the way you expect, but requires a majority or quorum of nodes to be running in order to function. Think of a parliment that must have more than half of members present in order to hold a vote. If too many can't make it, say because a flood washes out the bridge, a quorum can't be formed and business can't proceed. But when enough members are in communication the decision-making process is fast and unambiguous. These categories are not rigidly exclusive. The status report problem is usually handled by having heirarchies of supervisors and employees aka "reports". The gossip consistency problem can be helped by tagging data with timestamps or version numbers so you can reconcile conflicting values. The interesting thing about Amazon's Dynamo is not eventual consistency but configurable consistency.

Consistent/Partitionable means that the system can recover from failures, but requires so much extra coordination that it collapses under heavy use. Imagine having to send and receive a status report for every decision made at your company. You'll always be current, and when you come back from vacation you will never miss a thing, but making actual progress would be very slow.

Available/Partitionable means that you can always read and write values, but the values you read might be out of date. A classic example is gossip: at any point you might not know the latest on what Judy said to Bill but eventually word gets around. When you have new gossip to share you only have to tell one or two people and trust that in time it will reach everyone who cares. Spreading gossip among computers is a bit more reliable because they are endlessly patient and (usually) don't garble messages.

After lots of groping around with billions of dollars of revenue at stake, people who build these large systems are coming to the conclusion that it's most important to always be able to write to a system quickly and read from it even in the face of temporary failures. Stale data is a consequence of looser coupling and greater autonomy needed to make that possible. It's uncomfortable to accept the idea that as the computing power of an Available/Partitionable system scales up, the fog of war descends on consistency, but in practice it's not the end of the world.

This was not a whimsical nor easy choice. Imagine Ebenezer Scrooge is making so much money that Bob Cratchit can't keep up. Scrooge needs more than one employee to receive and count it. To find out the grand total of his money at any point, he has to ask each of them for a subtotal. By the time Scrooge gets all the answers and adds them up, his employees have counted more money, and his total is already out of date. So he tells them to stop counting while he gathers subtotals. But this wastes valuable working time. And what if Scrooge adds another counting-house down the street? He'll have to pay a street boy, little Sammy Locke, to a) run to the other house and tell them to stop counting, b) gather their subtotals, c) deliver them to Scrooge, then d) run back to the other house to tell them to resume counting. What's worse, his customers can't pay him while this is happening. As his operation gets bigger Scrooge is faced with a growing tradeoff between stale information and halting everything to wait on Locke. If there's anything Scrooge likes less than old numbers, it's paying people to do nothing.

Scrooge's dilemma is forced upon him by basic physics. You can't avoid it by using electrons instead of street urchins. It's impossible for an event happening in one place (eg data changing inside one computer or process) to affect any other place (eg other computers or processes) until the information has had time to travel between them. Where those delays are small relative to performance requirements, Scrooge can get away with various forms of locking and enjoy the illusion of a shared, consistent memory space. But as programs spread out over more and more independent workers, the complexity needed to maintain that illusion begins to overwhelm everything else. This is not about speed-of-light effects or anything like that. I'm only talking about reference frames in the sense of "old news", such as when you find out your cousin had gotten married last year. Her wedding and your unawareness are both "true" relative to your reference frames until you receive news to the contrary.

Scratch that, reverse it

Shared memory can be pushed fairly far, however. Instead of explicit locks, Clojure and many newer languages use an interesting technique called software transactional memory. STM simulates a sort of post-hoc, fine-grained, implicit locking. Under this scheme semi-independent workers, called threads, read and write to a shared memory space as though they were alone. The system keeps a log of what they have read and written. When a thread is finished the system verifies that no data it read was changed by any other. If so the changes are committed. If there is a conflict the transaction is aborted, changes are rolled back and the thread's job is retried. While threads operate on non-overlapping parts of memory, or even non-overlapping parts of the same data structures, they can do whatever they want without the overhead of locking. In essence, transactional memory allows threads to ask for forgiveness instead of permission.

As you might have guessed from those jolly hints about conflict and rollback, STM has its own special problems, like how to perform those abort/retry cycles efficiently on thousands of threads. It's fun to imagine pathological conflict scenarios in which long chains of transactions unravel like a cheap sweater. Also, STM can only handle actions that are undoable. You can't retry most kinds of I/O for the same reason you can't rewind a live concert. This is handled by queueing up any non-reversible actions, performing them outside of the transaction, caching the result in a buffer, and replaying as necessary. Read that sentence again. There is a recent paper about an interesting variation on this theme called HyTM, which appears to do a copy-on-write instead of performing writes to shared memory.

Hold this thread as I walk away

Undeniably awesome and clever as STM threads are, I'm not convinced that shared memory makes sense outside of the "cell membrane" of a single computer. Throughput and latency always have the last laugh. A concurrent system is fundamentally limited by how often processes have to coordinate and the time it takes them to do so. As of this writing computer memory can be accessed in about 100 nanoseconds. Local network's latency is measured in microseconds to milliseconds. Schemes that work well at local memory speeds don't fly over a channel one thousand times slower. Throughput is a problem too: memory can have one hundred times the throughput of network, and is shared among at most a few dozen threads. A large distributed database can have tens of thousands of independent threads contending for the same bandwidth.

If we can't carry the shared-memory model outside of the computer, is there some other model we can bring inside? Are threads, ie semi-independent workers that play inside a shared memory space, absolutely necessary? In his "standard lecture" on threads Xavier Leroy details three reasons people use them:

Shared-memory parallelism using locks or transactions. This is explicitly disowned in both Erlang and Leroy's OCaml in favor of message-passing. His argument is that it's too complex, especially in garbage-collected languages, and doesn't scale. (This was in 2002 when multiprocessors were rare and strange beasts.)
Overlapping I/O and computation, ie while thread A is waiting for data to be sent or received, threads B-Z can continue their work. Overlapping (aka non-blocking I/O) is needed to solve problems like WideFinder efficiently. This is often thwarted by low-level facilities inside the operating system that were written without regard to parallelism. Leroy thinks this should be fixed at the OS level instead of making every program solve it again and again.
Coroutines, which allow different functions to call each other repeatedly without generating an infinitely long stack of references back to the first call. This looks suspiciously like message-passing.

Message-passing, which first appeared in Smalltalk, is the core abstraction of Joe Armstrong's programming language Erlang. Erlang programs do things that make programmers take notice, like run some of the busiest telephone switches for years without fail. It approaches concurrency with three iron rules: no shared memory even between processes on the same computer, a standard format for messages passed between processes, and a guarantee that messages are read in the order in which they were received. The first rule is meant to avoid the heartaches described above and embraces local knowledge over global state. The second and third keep programmers from endlessly reinventing schemes for passing messages between processes. Every Erlang process has sovereign control over its own memory space and can only affect others by sending well-formed messages. It's an elegant model and happens to be a cleaned-up version of the way the internet itself is constructed. Message-passing is already one of the axioms of concurrent distributed computation, and may well be universal.

A lot of writeups repeat a "nine nines", ie 99.9999999% reliability claim for Erlang-based Ericsson telephone switches owned by British Telecoms. This works out to 31 milliseconds of downtime per year, which hovers near the edge of measurability, not to say plausibility. I was present at a talk Armstrong gave in early 2010 during which he was asked about this. There was a little foot shuffling as he qualified it: it was actually 6 or so seconds of downtime in one device during a code update. Since BT had X devices over Y years, they calculated it as 31ms of average downtime per device per year. Or something like that. Either way it's an impressive feat.

There are probably more axioms to discover. Languages become more powerful as abstractions are made explicit and standardized. Message-passing says nothing about optimizing for locality, ie making sure that processes talk with other processes that are located nearby instead of at random. It might be cool to have a standard way to measure the locality of a function call. Languages become even more powerful when abstractions are made first-class entities. For example, languages that can pass functions as arguments to other functions can generate new types of higher-order functions without the programmer having to code them by hand. A big part of distributed computing is designing good protocols. I know of no language that allows protocols as first-class entities that can be passed around and manipulated like functions and objects are. I'm not even sure what that would look like but it might be interesting to try out.

There is a lot of sound and fury around parallelism and concurrency. I don't know what the answer will be. I personally suspect that a relaxed, shared-memory model will work well enough within the confines of one computer, in the way that Newtonian physics works well enough at certain scales. A more austere model will be needed for a small network of computers, and so on as you grow. Or perhaps there's something out there that will make all this lockwork moot.