How To Save The Web

26 August 2008

There is no such thing as public record on the internet. Maybe that sounds strange -- the internet is the most public thing around. But consider this: it would be impossible for, say, the New York Times to change or obliterate something it printed in 1997. There are hardcopies all over the world. But for nyt.com it's as simple as a mouse click, and it happens all of the time. The internet we have today is public but it's not really a record.

This is pretty important. Public record, the ability to prove that someone said something on a certain date, is the basis of a free and literate society.

It doesn't have to be this way

Some disappeared things that got noticed.

sports.yahoo.com ...a New York computer security expert who found official Chinese documents that list He's age as 14 years and 220 days... The spreadsheets were taken down off the site recently and He's name had been removed...

www.talkingpointsmemo.com "When we went to the page for the photograph of President Bush and Abramoff, the page in question had disappeared from the site..."

perezhilton.com "Pinksky's photo disappeared from the hospital's site on Friday, as the scandal story started to get more legs..."

lots more...

For historical reasons, stuff published on the web lives primarily on servers controlled by the original publisher. This means that if they go away that data does too. It also means that they get to choose how it is accessed. Every time some politician is disgraced, his or her name is removed from the website of any organization that can't afford the embarrassment. When a publisher announces that all back issues will go behind the "pay wall", behind the wall it goes.

Revisionism is sometimes risky. If you are caught in the act it only takes a few people to make sure copies pop up like dandelions. But these cases are the exception and I think Cory Doctorow is far too smug about this point. The daily loss of data is much greater than what bloggers can hope to rescue. And what is important to us now may not be the bits our descendants care about. Archaeologists learn more from garbage dumps than from Genesis.

There are a few coherent public archives, for example, Yahoo's and Google's cache and archive.org. But Google's cache exists for Google's purposes and it is not designed for the long term. Archive.org has two handicaps: a centralized cache of the web is very expensive to maintain, and they are forced to take down stuff all of the time. We need something that can scale, something both lawyer- and disaster-resistant.

The good news

Isn't it interesting how many of these problems are solved by having lots of copies in lots of places?

In a sense this archive already exists, though in a low-energy state. Part of it is in your browser's cache right now. These caches are not coherent, organized, searchable, or public. They also have a lot of stuff in there that is better left private. So we have to work around that. But it's a start.

What the Archive should be

Decentralized & Redundant
Centralized is too expensive and too fragile. Redundancy increases the odds of survival.
Long-Term
If it's not long-term there's not much point. Open, stable formats.
Locally curated
There should be at least as many opinions about what should go into the archive as people using it.
Public & Coherent
A cache is useless if no one can get to it. It should also contain only things that are already public.
Verifiable
Whether by digital signatures or by comparing copies, or both, the archive must be resistant to tampering.
Useful for User 0
It has to be useful even if you are the only user, otherwise there is no incentive to keep the flame alive.

Fellow Travelers

One way to do this is with desktop software. My old project Dowser was an attempt to make personal archiving easy for the average user under the guise of a "research tool". I'm working on a new version called Dowser2. If you want to help, drop me a line at my first name at bueno dot org.