Install a DNS resolver on your laptop

26 February 2009
Free hotspot internet providers (eg Meraki) can have pretty good bandwidth but still feel slow because their DNS resolvers suck and they don't know it. You'll have great response from an SSH session or webmail but clicking a link to a new site will pause or fail.

Even large ISPs get this wrong. I tried for several years to convince BellSouth that one of their DNS resolvers was down:

"No, my internet is not down. The DNS server is down. I can ping. DNS. Dee Enn Ess. Pee Eye Enn Gee. Do you understand I'm trying to tell you about a bad problem with your system? One of your DNS servers is down. It's been down since 2003 but it's still in rotation. Yes, I restarted my router. Yes, my connection is now working but that's not the poi--". Click. Good times.

Solution: install your own damned resolver. I recommend Dr Berstein's excellent dnscache, part of daemontools djbdns (which itself runs under daemontools). Incidentally, this is also a good idea for your servers if you do any crawling, image fetching, etc. You'd be surprised how much it can help.

Excellent installation instructions here:

http://matt.simerson.net/computing/dns/djbdns-macosx.shtml

http://matt.simerson.net/computing/dns/djbdns-freebsd.shtml

Credit to tlack, who taught me this trick back when I was still figuring out bash.


How To Save The Web

25 August 2008

A strange as it may sound, public record does not exist on the internet. Consider this: it would be impossible for, say, the New York Times to change something it printed in 1997 -- there are hardcopies all over the world. But for nytimes.com it's as simple as a mouse click. So the internet we have today is public but it's not really a record. Healthy public record is the foundation of a free and literate society.

I propose that a loosely-connected network of independent archives, running on personal computers, under the care of self-interested individuals, and sharing common data formats, can in time self-assemble into something that fits the bill.

How we got here

For historical and technical reasons, stuff published on the web lives primarily on servers controlled by the original publisher. It is "distributed" at the time it is asked for, one copy at a time, and that copy is usually discarded. This means that if the original publisher goes away the authoritative source goes too. It also means the publisher, be it a company, government agency, small club or individual, gets to choose how the data is accessed now and in the future. This new power has become very tempting [0]. The mainstream is slowly realizing that stuff frequently disappears by accident and by design [1]:

  1. sports.yahoo.com "...a New York computer security expert who found official Chinese documents that list He's age as 14 years and 220 days... The spreadsheets were taken down off the site recently and He's name had been removed..."

  2. www.talkingpointsmemo.com "When we went to the page for the photograph of President Bush and Abramoff, the page in question had disappeared from the site..."

  3. www.nytimes.com ...Said videos were posted, then mysteriously disappeared from the Edwards Web site, with officials muttering something about campaign finance rules...

  4. perezhilton.com "Pinksky's photo disappeared from the hospital's site on Friday, as the scandal story started to get more legs..."

Wouldn't people notice if things went away?

Most often they don't, and if they do notice, so what? Revisionism and neglect is low-risk. If you are caught in the act a few determined people can stop you if they find a copy. "Orphan" works are sometimes rescued by their fans. But it's easy to forget that these cases are the exception. This leads to the common assumption is that if it's digital, it lives forever.

Intelligent, influential people say things like this: "Once the Internet knows something, it never forgets. This material just doesn't disappear from the Internet if it's sufficiently interesting. Paris Hilton's genitals have joined the undead - they will live forever, stalking the Internet until the last plug is pulled on the last network router." [2]

It sounds nice, but there are serious problems with this idea. If your medium is X times less durable you need either X times as many copies or X times the care. The Rosetta Stone didn't survive thousands of years in the desert because of its intrinsic cultural value. It survived because it's made of stone. The UNIX crowd learned this lesson a few years ago to their lasting regret: there are no digital copies of the first four versions of UNIX, only some printouts.[3]

Worse is the implication is that if something is not "sufficiently interesting", if it's not part of the story a society wants to tell about itself, it's worthless. The future disagrees about what is and is not important -- that's what makes it the future. No one today cares what the Rosetta Stone actually says [4]. The interesting part is its side-by-side translation we used to crack hieroglyphics. And archaeologists learn more from garbage dumps than from Genesis precisely because they collect what people no longer want.

The current situation

The current situation in archives is much like the web: uncoordinated, conflicting, changing. The most widespread problem is a paradoxical attitude: most people understand that a centralized web would be unsustainable, but few seem to carry that logic over to archiving.

There are a few public archives. Many national libraries have set up consortia to study the problem. There are search engine caches like Google's. There is the remarkable and far-sighted archive.org. All of them are welcome --in this game, the more the merrier-- but I believe they have various flaws. Google's cache exists for Google's purposes, and is not designed for the long term [5]. The LOCKSS project [6] is commendable in sprit and clever in design, but access to it appears to be limited to select universities and libraries.

Archive.org has two handicaps. It's not actually possible for one organization to curate the web. Second, being a non-profit sitting target, they are forced to take down stuff they do save [7] [8]

In April 2004 The public editor of the The New York Times spelled out the paper's position in an article titled "Paper of Record? No Way, No Reason, No Thanks". He was speaking more against the obligation to print government notices than the idea of public record per se. But all through it he assumes that (a) someone else will do it, and (b) this information (not to mention copies of his newspaper) will always somehow be available to future historians. At the same time his newspaper was telling archive.org to remove nytimes.com from the collection [9].

What the Archive should be

  • Decentralized & Redundant
    Centralized is too expensive and too fragile. Redundancy increases the odds of survival.

  • Long-Term
    If it's not long-term there's not much point. Open, stable formats.

  • Locally curated
    There should be at least as many opinions about what should go into the archive as people using it.

  • Public & Coherent
    A cache is useless if no one can get to it. It should also contain only things that are already public.

  • Verifiable
    Whether by digital signatures or by comparing copies, or both, the archive must be resistant to tampering.

  • Respectful of privacy
    A lot can be revealed by someone's reading list, and the need to anonymize may conflict with the needs of tamper-proofing.

  • Useful for User 0
    It has to be useful even if you are the only user, otherwise there is much less incentive to use and contribute to it.

In short, we need something akin to the web itself: something that can grow without limit, yet does not require much centralized organization. It can be pulled into pieces and operate independently and merge back together. Once it reaches a certain size it, or at least the idea, will be impossible to kill.

In a sense the archive already exists, though in a low-energy state. Part of it lives in the browser cache of everyone's computer. These caches are not coherent, organized, searchable, or public. They also have a lot of stuff in there that is better left private. We have to work around that. But it's a start.

Dowser's approach

"The lost cannot be recovered; but let us save what remains: not by vaults and locks which fence them from the public eye and use in consigning them to the waste of time, but by such a multiplication of copies, as shall place them beyond the reach of accident."
Thomas Jefferson, 18 February 1791

Dowser is a program designed to run on a normal personal computer but operate exactly like a website. Its user interface is a fully-functioning website that is hosted by and only accessible to the local machine [10]. This website enables the user to add pages to their local archive, search many search engines at once (aka "metasearch"), search the local archive, add organizational tags, take notes, export and share, etc. It is not intended as a professional tool for research specialists, but as a helpmeet for "power users": journalists, students, writers, etc, who have a demonstrated need for powerful research tools but do not have the time or inclination to train on an academic-grade tool.

When a user adds a URL to the archive, Dowser will attempt to retrieve the content of this URL by itself, without reference to any private authentication or "cookies" held by the user. This helps to ensure that private data stays private. Optionally, Dowser will download linked images, videos and such, and any linked pages up to a user-defined "link depth". As time goes on, Dowser will occasionally "ping" the URLs it knows about to determine if they have changed. If so it will download the new copy and thus build up a change history.

The program will not allow the user to alter any data in his archive, though it is possible to delete anything. Since the formats are open and simple, a determined user can of course alter the archive data with other tools. We can't stop that, and should not try, but will be an interesting obstacle.

Pulling it together

So far we have only a local archive. What User 0 reads is saved and available only to User 0. So how do we share it worldwide without inviting spam or violating privacy or running afoul of the law, etc?

The seven qualities above are somewhat in conflict with each other. Privacy may conflict with the need to verify copies, decentralization conflicts with system performance conflicts with public access, etc. Centralized archives solve it by doing it all themselves and building up a reputation for trustworthiness. LOCKSS, which is a network of caching servers installed in many universities, depends on the mutual trust of those institutions for verification and limits public access to ward of lawsuits by copyright owners.

You and I aren't smart enough to solve the sharing problem. We should start small and listen to the users, to use the much greater imagination of the group. Make tools to allow User 0 to share with User 1, then User 2, etc, but make them orthogonal to the rest. Allow people to extend Dowser and use it in ways we don't expect.

The common feature of all of the current archiving schemes is that they are hard for a mere mortal to set up. In theory anyone can download archive.org's software and use it to build their own archive. In practice it's very tricky, and not worth the effort for most people to learn the ancillary skills like programming, systems administration, etc. You can't join LOCKSS without being a major university.

Archiving today is a luxury good: complex, balky, expensive, like cars before 1908. When something is a luxury and you want more of it, what do you do? You turn it into a commodity. Make the hard parts easy, do what you can for the harder parts, get something workable into the hands of a much larger group. They will take it from there.

Notes

Labels: ,



Network Distance

09 June 2008
When Sydney is closer than Sao Paulo

It's quite possible to make your living from the internet without really considering how it's constructed. I came across this talking with my friend Aaron. He's a bright guy but like many people he assumes the net is done "with satellites or whatever", or never think about it.

The general structure of the internet pretty similar to the global air network. Picture those glowing arcs connecting cities on the back of airline magazines. Then recall the hour you spent in line before boarding, the hour spent driving to the airport, the two hours from the big hub airport to the smaller city you really want to go to, and how relieved you are to be traveling now instead of last week when a storm in Chicago somehow messed up flights to Los Angeles. That's basically how your data feels, too.

The long-haul internet is, in fact, a series of tubes. Very fragile tubes.

The long-haul cables that run the global economy are less than 2 inches in diameter, buried under railroad tracks, highways, or ocean sediment. Within densely-settled areas the network is relatively redundant. But between countries and across oceans and mountains, data flows through an uncomfortably small set of bottlenecks.

Any minor disaster can damage large portions of the 'net. In 2004 the Miami/Sao Paulo traffic was suddenly re-routed through Washington DC, then New York, then across the ocean to Brussels(!), then back to SP. As late as 1998, a train wreck or wildfire in northern Florida could cut off large parts of Latin America for days. This year Suez suffered multiple failures and re-routing flooded the already overloaded Europe-Asia network for several weeks.

Even on a good day you can see the problem. Look at how a packet of data might travel from San Francisco to Hong Kong:


(start)
Folsom St, San Francisco (1 mile)
Pine St, San Francisco (2 miles)
Pine St, San Francisco
Oakland, CA (10 miles)
Sacramento, CA (80 miles)
San Jose, CA (120 miles)
Oakland, CA (40 miles)
San Jose, CA (40 miles)
Hong Kong (7,000 miles)
(finish)


That poor little packet of data rattled all around California, looking for an uncongested cable over the Pacific. For each hop, a decision is made to send it on to some other place that may have better luck. The system works pretty well under stress which is good because stress is there all of the time.

The long-haul internet is a map of trade volume between cities, that lags up to 20 years behind reality.

This is actually true for all forms of high-volume transport, so there is a lot of history to learn from. Infrastructure is insanely expensive and slow to build even though it almost always pays off in the long run. Short hops between financial/military/political/industrial hubs tend to get built up first. Just look at how many ways there are to travel between New York, Boston, and DC, for example.

Or look at area codes. At the time the precursor to the modern phone system was built, dialing a 1 took 1/9th the time of dialing a 9. Silly, but true. So there was a premium on lower numbers. New York's area code was 212, DC 202, Los Angeles 310, Chicago 312. El Paso, Texas? 915. Anchorage, Alaska? 907. Miami was definitely not a hub at the time, but it was important to the Navy and Air Force. Miami's code is 305.

FinallySo if several factors of demography, geography, and politics align, there may form a route of sufficient capacity between two points. If not, too bad. It takes years to build up demand, more years to begin the project, and more and more years to finish it. There are bribes, labor riots, sabotage, political chicanery, etc. The first trans-continental rail link in the US was completed in 1869. Twenty years earlier, people had been hijacking ships in Louisiana to sail around South America and crash-land ashore at San Francisco.

The same dramas play out when cables are planned and laid. The connection between Seattle and Tokyo is excellent. Ciudad Mexico and Dallas? Fairly new and really fast. New York to London? World-class. But try to get an email from Barcelona to Bangalore, and often you'll find that it routes through America. Companies are scrambling to build up Europe-Asia links. They've been scrambling since the late 1980's, and it will be some time before they get there.

The long-haul internet is not a magic leprechaun.

Even if there weren't these human problems there are still fundamental ones. Let's imagine a perfectly balanced world-wide network. You have an internet business based in San Francisco. Your people are in SF, your technology suppliers are in SF, most of your customers are in the United States. You have a small but growing customer base in Hong Kong and China. We have a perfect 'net, so there are no silly congestion problems and there is just as much bandwidth across oceans as between cities. Rack space in HK is twice the price as in SF.

So what is the no-brainer place to put your next server farm, hire people to maintain it, set up office space, pay property and business taxes, etc? Correct. Hong Kong.

No matter how good the internet gets it will never be faster than a small fraction of the speed of light.

Recently I was looking at the server logs of a site located in the US. The response times had a very high variance, which indicates a severe bottleneck somewhere. After a lot of poking around, trying to find the problem within the server farm, I had the bright idea to segment the logs by source country. And there it was: the response times were all over the map because the users were all over the map. From the perspective of the 'net, Stockholm and Singapore are just down the block while India and Sao Paulo are past the moon.

Rules of thumb:
- Light takes about 100 milliseconds to travel 10,000 miles and back.

- A network packet on a good route may take 3 to 10 times that long.

- A network packet has not "arrived" until a return acknowledgment has been sent all the way back.

- The longer and more complicated the route from here to there, the higher the chance (often more than 10%!) that a packet will get lost.

- The larger the file you send, the more packets it has to be chopped up into, the more likely one will be lost and have to be re-transmitted. All else being equal the transit time of a file increases more than linearly to its size.


My Projects

  • Odacite
    Certified Organic Skin Care
  • Archivd
    Simple research tools for teams
  • wtop / logrep
    "top" for Apache and other webservers, plus powerful log grepping

Contact: my first name at bueno dot org

More Posts   RSS