Home    All Articles    About    carlos@bueno.org    RSS

Hyperlogloglog



Back in ancient times, when individual datums were hard to come by and therefore perceived as valuable, people trusted almost all of the data that came their way. Later, data became so plentiful that people could afford to disbelieve some of it, make judgements about its quality, speculate about the possible motives of the people supplying it, get sloppy about grammatical rules regarding plurals, and so on. Much later the flood of data became so immense that all you could do was watch it rush by in a state of helpless bemusement while desperately climbing to higher ground. This is an Adamsesque short story for Venkat's 42nd birthday.

The core problem of information is that more of it almost never makes you happier. The natural reaction to things that make you unhappy or uncomfortable is to avoid them. Most of science & technology is about helping people do just that. Thus it was that the first really successful Big Data computer, developed at the University of Quaternion Prime, was designed around the newly discovered Second Law of Large Numbers.

The First Law of Large Numbers states that, given enough data, averages over that data should work out about the way you'd expect them to.

The Second Law is a modern refinement of that idea, with exotic redefinitions of the very concepts of "average", "number", "large", "expect", "idea", and even "the". But in essence the Second Law says that if you're going to look at just the top line and the bottom line, it doesn't matter much how you actually add up all the numbers in between. Therefore you can, say, simply count the digits and squint, and arrive at an acceptable answer. This saved enormous amounts of computational energy and angst.

The Quaternion Mark I took up the entire University campus grounds and was wildly successful, allowing the regents to build a bigger campus next door. The main reason for its popularity was that the tradeoff between performance and accuracy was controlled by the people who were both asking for these numbers and paying for the hardware to compute them in the first place. Not that it was cheap. In fact it was cripplingly expensive, which meant that no one else could afford to second-guess them.

Over generations the volume of data grew until even counting the commas, let alone the digits, became a significant engineering challenge. Thousands of grad-student-years of effort were poured into solving the problem. The breakthrough, which redefined everything yet again, was achieved by a failing student named Hitherto Millwall who had been tasked to design a faster commastore.

Taking ideas to their logical extremes can zip you right past common sense, and requires a peculiar mix of courage and desperation to run with it.

His chain of thought went like this: the fundamental strategy for dealing with large amounts of data was compression. Huge streams of numbers were converted by various clever tricks into streams tiny enough for humans to handle, who then decided what to do. If you really think about it, he argued, the entire purpose of data-driven decision-making is to compress ungodly infinitudes of numbers down to a single bit of decision: a yes or a no.

Millwall built a prototype in his dorm room and created what everyone had really wanted all along. The Hyperlogloglog was the size of a small housepet and was modeled on the human brain. It was capable of handling unlimited amounts of input data via the simple technique of immediately throwing it away. This consumed only a small portion of the computer's capacity.

The rest of the machine's intellect was dedicated to guessing what it was that the questioner really wanted to do anyway, and recommending that as the best course of action. Sophisticated-looking charts were printed out on luxuriously heavy paper to help convince others. The questioner could, for example, curl the charts into a tight roll, fold it in half, and beat doubters over the head until consensus was reached. This became known as the Millwall Cluestick and was very popular for sneaking improvised rhetorical weapons past security at the rowdier math conferences.

After billions in Hyperlogloglog sales, Hitherto Millwall and his former alma mater went through many rounds of suits, counter-suits, arbitration, sabotage, and attempted assassination over control of the patents until the case finally went before the highest court on the planet.

The presiding judge was a computer the size of a small housepet.