Home    All Articles    About    carlos@bueno.org    RSS

How To Save The Web

25 August 2008

A stran­ge as it may sound, pub­lic re­cord does not exist on the in­ter­net. Con­sid­er this: it would be im­pos­sible for, say, the New York Times to chan­ge some­th­ing it prin­ted in 1997 -- there are hardcop­ies all over the world. But for nytimes.com it's as sim­ple as a mouse click. So the in­ter­net we have today is pub­lic but it's not rea­l­ly a re­cord. Healthy pub­lic re­cord is the foun­da­tion of a free and lit­erate society.

I pro­pose that a loosely-connected net­work of in­depen­dent archives, runn­ing on per­son­al com­put­ers, under the care of self-interested in­dividu­als, and shar­ing com­mon data for­mats, can in time self-assemble into some­th­ing that fits the bill.

How we got here

For his­tor­ical and tech­n­ical rea­sons, stuff pub­lished on the web lives primari­ly on serv­ers con­trol­led by the origin­al pub­lish­er. It is "dis­tributed" at the time it is asked for, one copy at a time, and that copy is usual­ly dis­car­ded. This means that if the origin­al pub­lish­er goes away the aut­horitative sour­ce goes too. It also means the pub­lish­er, be it a com­pany, govern­ment agen­cy, small club or in­dividu­al, gets to choose how the data is ac­cessed now and in the fu­ture. This new power has be­come very tempt­ing [0]. The mainstream is slow­ly rea­liz­ing that stuff frequent­ly dis­ap­pears by ac­cident and by de­sign [1]:

  1. sports.yahoo.­com "...a New York com­put­er secur­ity ex­pert who found of­fici­al Chinese docu­ments that list He's age as 14 years and 220 days... The spreadsheets were taken down off the site re­cent­ly and He's name had been re­moved..."

  2. www.tal­kingpointsmemo.­com "When we went to the page for the photog­raph of Pre­sident Bush and Ab­ramoff, the page in ques­tion had dis­ap­peared from the site..."

  3. www.nytimes.com ...Said videos were post­ed, then mys­terious­ly dis­ap­peared from the Ed­wards Web site, with of­fici­als mut­ter­ing some­th­ing about cam­paign fin­an­ce rules...

  4. per­ez­hilton.com "Pinksky's photo dis­ap­peared from the hos­pital's site on Friday, as the scand­al story star­ted to get more legs..."

Would­n't peo­ple notice if th­ings went away?

Most often they don't, and if they do notice, so what? Re­vision­ism and neg­lect is low-risk. "Orphan" works are some­times re­scued by their fans. But it's easy to for­get that these cases are the ex­cep­tion. In­tel­ligent, in­fluen­ti­al peo­ple say th­ings like this: "Once the In­ter­net knows some­th­ing, it never for­gets. This materi­al just doesn't dis­ap­pear from the In­ter­net if it's suf­ficient­ly in­terest­ing." [2]

There are seri­ous pro­blems with this idea. The Roset­ta Stone didn't sur­vive thousands of years in the de­sert be­cause of some in­trin­sic cul­tur­al value. It sur­vived be­cause it's made of stone. The UNIX crowd lear­ned this les­son a few years ago to their last­ing re­gret: there are no di­git­al co­p­ies of the first four vers­ions of UNIX, only some prin­touts.[3]

Worse is the im­plica­tion is that if some­th­ing is not "suf­ficient­ly in­terest­ing", if it's not part of the story a society wants to tell about it­self, it's worthless. The fu­ture dis­ag­rees about what is and is not im­por­tant, and why. That's the de­fin­ing charac­teris­tic of the fu­ture. No one today cares what the Roset­ta Stone ac­tual­ly says [4], yet it is more im­por­tant to us (as the key to hierog­lyphics) than it was to the society that made it.

The cur­rent situa­tion

The cur­rent situa­tion in archives is much like the web: un­coor­dinated, con­flict­ing, chang­ing. The most widespread pro­blem is a para­dox­ical at­titude: most peo­ple un­derstand that a centralized web would be un­sus­tain­able, but few seem to carry that logic over to archiv­ing.

There are a few pub­lic archives. Many nation­al li­bra­ries have set up con­sor­tia to study the pro­blem. There are search en­gine cac­hes like Goog­le's. There is the re­mark­able and far-sighted archive.org. All of them are wel­come --in this game, the more the merrier-- but I be­lieve they have vari­ous flaws. Goog­le's cache ex­ists for Goog­le's pur­poses, and is not de­sig­ned for the long term [5]. The LOC­KSS pro­ject [6] is com­mend­able in sprit and clev­er in de­sign, but ac­cess to it ap­pears to be li­mited to select uni­ver­sit­ies and li­bra­ries.

Archive.org has two han­dicaps. It's not ac­tual­ly pos­sible for one or­ganiza­tion to curate the web. Second, being a non-profit sitt­ing tar­get, they are for­ced to take down stuff they do save [7] [8]

In April 2004 The pub­lic editor of the The New York Times spel­led out the paper's posi­tion in an ar­ticle tit­led "Paper of Re­cord? No Way, No Rea­son, No Thanks". He was speak­ing more against the ob­liga­tion to print govern­ment notices than the idea of pub­lic re­cord per se. But all through it he as­sumes that (a) some­one else will do it, and (b) this in­for­ma­tion (not to men­tion co­p­ies of his newspap­er) will al­ways some­how be avail­able to fu­ture his­torians. At the same time his newspap­er was tell­ing archive.org to re­move nytimes.com from the col­lec­tion [9].

What the Archive should be

In short, we need some­th­ing akin to the web it­self: some­th­ing that can grow with­out limit, yet does not re­quire much centralized or­ganiza­tion. It can be pul­led into pieces and op­erate in­depen­dent­ly and merge back togeth­er. Once it rea­ches a cer­tain size it, or at least the idea, will be im­pos­sible to kill.

In a sense the archive al­ready ex­ists, though in a low-energy state. Part of it lives in the brows­er cache of every­one's com­put­er. These cac­hes are not co­herent, or­ganized, search­able, or pub­lic. They also have a lot of stuff in there that is bet­t­er left private. We have to work around that. But it's a start.

Dow­ser's approach

"The lost can­not be re­covered; but let us save what re­mains: not by vaults and locks which fence them from the pub­lic eye and use in con­sign­ing them to the waste of time, but by such a multi­plica­tion of co­p­ies, as shall place them be­yond the reach of ac­cident."
Thomas Jef­ferson, 18 Feb­rua­ry 1791

Dows­er is a pro­gram de­sig­ned to run on a norm­al per­son­al com­put­er but op­erate ex­act­ly like a web­site. Its user in­ter­face is a fully-functioning web­site that is hos­ted by and only ac­cessib­le to the local mac­hine [10]. This web­site en­ables the user to add pages to their local archive, search many search en­gines at once (aka "metasearch"), search the local archive, add or­ganization­al tags, take notes, ex­port and share, etc. It is not in­ten­ded as a pro­fes­sion­al tool for re­search special­ists, but as a helpmeet for "power users": jour­nal­ists, students, writ­ers, etc, who have a de­monstrated need for power­ful re­search tools but do not have the time or in­clina­tion to train on an academic-grade tool.

When a user adds a URL to the archive, Dows­er will at­tempt to re­trieve the con­tent of this URL by it­self, with­out re­fer­ence to any private aut­hentica­tion or "co­ok­ies" held by the user. This helps to en­sure that private data stays private. Opt­ional­ly, Dows­er will download li­nked im­ages, videos and such, and any li­nked pages up to a user-defined "link depth". As time goes on, Dows­er will oc­casional­ly "ping" the URLs it knows about to de­ter­mine if they have chan­ged. If so it will download the new copy and thus build up a chan­ge his­to­ry.

The pro­gram will not allow the user to alter any data in his archive, though it is pos­sible to de­lete an­yth­ing. Since the for­mats are open and sim­ple, a de­ter­mined user can of co­ur­se alter the archive data with other tools. We can't stop that, and should not try, but will be an in­terest­ing ob­stac­le.

Pull­ing it togeth­er

So far we have only a local archive. What User 0 reads is saved and avail­able only to User 0. So how do we share it worldwide with­out in­vit­ing spam or violat­ing priva­cy or runn­ing afoul of the law, etc?

The seven qualit­ies above are some­what in con­flict with each other. Priva­cy may con­flict with the need to ver­ify co­p­ies, de­centraliza­tion con­flicts with sys­tem per­for­mance con­flicts with pub­lic ac­cess, etc. Centralized archives solve it by doing it all them­selves and build­ing up a re­puta­tion for trustworthi­ness. LOC­KSS, which is a net­work of cach­ing serv­ers in­stal­led in many uni­ver­sit­ies, de­pends on the mutu­al trust of those in­stitu­tions for verifica­tion and li­mits pub­lic ac­cess to ward of law­suits by co­pyright own­ers.

You and I aren't smart en­ough to solve the shar­ing pro­blem. We should start small and li­st­en to the users, to use the much great­er im­agina­tion of the group. Make tools to allow User 0 to share with User 1, then User 2, etc, but make them orthogon­al to the rest. Allow peo­ple to ex­tend Dows­er and use it in ways we don't ex­pect.

The com­mon fea­ture of all of the cur­rent archiv­ing schemes is that they are hard for a mere mort­al to set up. In theo­ry an­yone can download archive.org's software and use it to build their own archive. In prac­tice it's very tri­cky, and not worth the ef­fort for most peo­ple to learn the an­cil­la­ry skills like pro­gramm­ing, sys­tems ad­ministra­tion, etc. You can't join LOC­KSS with­out being a major uni­vers­ity.

Archiv­ing today is a luxu­ry good: com­plex, balky, ex­pen­sive, like cars be­fore 1908. When some­th­ing is a luxu­ry and you want more of it, what do you do? You turn it into a com­mod­ity. Make the hard parts easy, do what you can for the hard­er parts, get some­th­ing work­able into the hands of a much larg­er group. They will take it from there.

Notes