There is no such thing as public record on the internet. Maybe that sounds strange -- the internet is the most public thing around. But consider this: it would be impossible for, say, the New York Times to change or obliterate something it printed in 1997. There are hardcopies all over the world. But for nyt.com it's as simple as a mouse click, and it happens all of the time. The internet we have today is public but it's not really a record.
This is pretty important. Public record, the ability to prove that someone said something on a certain date, is the basis of a free and literate society.
For historical reasons, stuff published on the web lives primarily on servers controlled by the original publisher. This means that if they go away that data does too. It also means that they get to choose how it is accessed. Every time some politician is disgraced, his or her name is removed from the website of any organization that can't afford the embarrassment. When a publisher announces that all back issues will go behind the "pay wall", behind the wall it goes.
Revisionism is sometimes risky. If you are caught in the act it only takes a few people to make sure copies pop up like dandelions. But these cases are the exception and I think Cory Doctorow is far too smug about this point. The daily loss of data is much greater than what bloggers can hope to rescue. And what is important to us now may not be the bits our descendants care about. Archaeologists learn more from garbage dumps than from Genesis.
There are a few coherent public archives, for example, Yahoo's and Google's cache and archive.org. But Google's cache exists for Google's purposes and it is not designed for the long term. Archive.org has two handicaps: a centralized cache of the web is very expensive to maintain, and they are forced to take down stuff all of the time. We need something that can scale, something both lawyer- and disaster-resistant.
Isn't it interesting how many of these problems are solved by having lots of copies in lots of places?
In a sense this archive already exists, though in a low-energy state. Part of it is in your browser's cache right now. These caches are not coherent, organized, searchable, or public. They also have a lot of stuff in there that is better left private. So we have to work around that. But it's a start.
One way to do this is with desktop software. My old project Dowser was an attempt to make personal archiving easy for the average user under the guise of a "research tool". I'm working on a new version called Dowser2. If you want to help, drop me a line at my first name at bueno dot org.