Let's say you have a large computer system and you want to measure it for performance or efficiency. You want metrics. You want pretty pictures that tell you what's going on. You start looking at graphing libraries and databases. I think this is exactly backward. You have to start with measurement. It's shockingly easy to fool yourself and you really have to get it right. Bad data looks very much like useful data and it's worse than having no data at all. If some number is 500% off, maybe you'll notice it right away. But what about 5%? or 0.5%? Or what if you aren't measuring what you think you are measuring?
What you want, ultimately, are metrics: numbers that answer questions like What is the average memory consumed by a hit to the web server? But you don't collect metrics, you collect measurements. If a particular hit to your webserver consumes 3MB of memory, and you record that fact, that's a measurement. The average memory consumed for all measurements over the last hour is a metric. The 50th percentile (aka median) is also a metric that can be derived from the same set of measurements. It's important to keep this distinction in mind.
You want to decouple five very distinct activities: a) measuring events, b) collecting the data somewhere, c) storing it for however long you need to, d) creating metrics, and e) analyzing them. I'd like to suggest that collecting, storing, and drawing graphs from a set of numbers are important, but also fairly well understood.
What isn't solved is how to get useful measurements in the first place. That's where understanding your particular system comes into play. Making the same mistakes over and again has forced me to become fairly conservative about what data I'll trust, and to adopt a small set of rules to help me cope.
This sounds kind of silly when you say it outright, but very often people will use proxies or simulations instead of direct measurement because they are easier to make. A good example is network latency. They will use third-party services that deploy servers all over the world to "ping" back to the site and report latency metrics. This is fine, and useful to know, but it's not the same as what your actual users experience.
There is almost always a tradeoff between the accuracy & completeness of a measurement and how expensive or invasive it is to make. This can lead to people only measuring at infrequent times, and often manually. But if something is important enough to worry about, then it's probably important enough to monitor all the time. If you can't figure out how to do that, then design a measurement you can run all of the time, even if it is inferior, and use it as an early-warning system.
The mistake I make most often is to create a new measurement, say counting the number of bytes going across a wire, and not test it thoroughly. You write the code, make a few test runs, and see numbers pouring out the other end. But are they complete and correct? Did you catch all of the cases where data goes out? Are you counting bytes or characters? Payloads only, or headers too? What layer of the stack is your instrumentation in? How likely is it for new code to introduce cases where data goes out and you miss it? Can you cross-check it against a completely different way to measure, say a TCP dump?
Sometimes good measurements go bad. It's a good idea to re-test them periodically. Let's say you've decided to measure “CPU time” to produce metrics about how fast your search engine is. Later on you install more servers to expand your capacity. These servers are newer so are likely faster than the old ones. (And don't get me started about dynamic CPU speeds.) All of a sudden your average CPU time per search goes down. Coincidentally, you also deployed a new version of your search index that day. Which change explains the drop, and by how much?
There are a couple of reasonable fixes for this particular problem. You can scale CPU time by the processor speed to get "normalized CPU time". Or perhaps you can measure CPU instructions instead. Either way the idea is to remove extra factors that affect the measurement.
Another mistake is to skew measurements or metrics toward a particular population. It's almost never possible to measure every single event on a large system, so you have to choose some smaller sample of them, say 1:100 or 1:1,000,000. When collecting these measurements you want to make the set of events being sampled representative of the population as a whole. When comparing two metrics, you have to make sure that the two groups of samples underlying them are similar enough for the comparison to make sense. This is tricker than it sounds.
Did you know that engineers at Facebook are several centimeters taller than the average person? It's true — if you include children in the average.
Recently when preparing for IPv6 Day, we set up an experiment to measure network latency for users that supported it. Excited to draw me some pretty graphs, I compared the network performance of v6 users with v4 users, and the system told me that mobile v6 clients were 40% faster than the old protocol. Not quite believeing this I sliced the data by country, and then by datacenter. Same story. I emailed people smarter than me, proud of my discovery.
It didn't last long. They pointed out that comparing v6 clients with the average is nonsense. The mobile phones which support the new IP protocol also support the 4G radio protocol, which is expected to be faster than 3G. All I was doing was comparing fast clients to slow clients.
In machine learning circles these are called features but here I'll call them metadata. Metadata are attributes of the system or event that might influence the measurement being made. For a webserver log, useful metadata might include the URL being hit, time of day, server IP, client IP, any A/B testing or profilers that are in effect, the user's logged-in status, the present load on the server, and so on.
These are important for grouping and filtering in the analysis phase. If you had made the mistake of comparing the average height of an engineer with the average height of all people, having metadata like age in your set of measurements helps you fix it.
In general I try to collect as many details about the event as possible, because you never know what might turn out to be useful. We've had performance regressions correlated with everything from server uptime to cookie size. This becomes important when you're already aware that there is a problem and need to find the root cause. Is the regression happening on all pages, or only a few? Some servers or datacenters or all? And so on.
It took me a long time to appreciate this one. After working with the systems other people built here at Facebook for a while I've come around. You can and sometimes have to get really fancy with the kinds of measurements you make. But it's a really good thing to have one measurement system that is as simple, robust, and low-impact as possible. sampling is random and the code is dependency-free and fast to execute. For a web system this can literally be the standard webserver log with extra performance measurements added to it. When code goes wrong, databases fill up and fancier systems have bugs or go non-linear, having a ground truth to cross-check everything else with has saved my sanity time and again.
Metrics tend to be chosen based on their ease of calculation rather than informative power. Averages, aka arithmetic means, are simple to calculate and understand. They are also full of lies. If Bill Gates boards a city bus, on average everyone is a billionaire. There are many metrics that are more useful, especially when paired with a good graphing library. That is part of the reason I suggest keeping measurements separate from metrics so that you can pick & choose. But that is another post for another day.