Preserving Web Content for the Future

Friday, May 23, 2003

“Graduates of the School of Journalism, you will write the first draft of history. Or, if the deadline’s really tight, you can download various pieces of the first draft of history from Google and then put your name on it. As a member of the fourth estate, you are the nation’s watchdog. Corrupt businessmen and politicians will fear your mighty CTRL-C, CTRL-V skills. You will be scrupulously fair and unbiased and follow the story wherever it leads, as long as it won’t end up giving ammo to right-wing weirdos. While journalism jobs continue to decline, take heart in the fact that yours is a noble calling in which nobody really checks your resume, especially at the New York Times.”
– David Burge, Hail to Thee, [Insert College Name Here], (CNSNews.com), May 23, 2003

by Philipp Lenssen

Investigative online journalism: Microdoc(s)-News set up an artificial shadow domain to drop their PageRank from 6 to 0 — and is now checking how much PR really matters (and what it takes to boost it again).

by Philipp Lenssen

I would like to review what data is preserved from my grandparents by taking an imaginary walk through their attic.
One thing that always stands out are now historical magazines, like one covering the first moon landing. Who back in the late 1960s would have thought this to be worthwhile preserving? OK, maybe some. My grandmother marked it with a thick black pen on the cover: “Keep this”.

When you go through the issue, interestingly enough, the actual photos of the moon landing, and the story behind it, are the least thrilling. I wasn’t alive when Armstrong made his first steps, and yet I saw the pictures a thousand times, saw the event on endless TV documentations, and read about the background-story.

No. What I found much more interesting than all the news stories was the advertisement. A linguist might say the interpersonal and textual, as opposed the experimental, catched my attention. Somehow, it expressed much more news to me about how people were thinking, and seeing the world, back then and there.

Let’s Keep It All

This poses an interesting challenge: we don’t know what will be interesting enough to preserve for future generations. And we don’t know what is worthwhile preserving, because no one else would have thought of preserving it.
One approach: preserve everything, and let the future people dig through it. This however creates a huge need for searching and finding. We can scan and file everything these days, creating libraries of digital information.

Digital Storage Secrets

Now then, I would argue the following are ways to preserve digital information:

Leave out abbreviations
Be verbose
Be specific
Ignore meta-data
Use the simplest implementation
Separate data from functionality
Separate content from layout
Talk about one thing a time
Don’t compress data
Don’t trust anyone

These are the issues I learned from my programming experience, and they apply to Web content as well, because:

Abbreviations might not be understood out of context, or in the future.
If you’re not verbose, you don’t use enough word variations, which will cause the interesting bits to sink due to lack of fundament.
If you’re not very specific, the text rots faster, and you and others will be forced to add and modify to it all the time.
Meta-data at best describes the actual information at one point in time. You cannot trust meta-data to be accurate in the future — search engines don’t even trust it in the present.
The simpler and more open the document format, the higher a chance it will survive the next years. Adding complexity means more complex rendering mechanisms are needed.
Data should not be directly mixed with functionality/ behavior, since this decreases its accessibility. (People might want to attach different behavior to the same data at different times.)
If you’re relying on certain output media, or certain output media settings, you’re risking survival of the content you produce, because you add yet another factor out of your control.
You need to talk about one thing at a time. (In programming, this would be refactoring and modularization — in the world of SEO, it’s giving each page a descriptive title.) Mixing topics means one might be shadowed by the other when one wants to retrieve it during search.
Compressing data into binary form is yet another hurdle for future access.
The more you rely on external tools (or other people) to handle your data, the more it is at risk to get lost. TinyURL.com, GeoCities.com, even Blogger.com, might be doomed, or completely change their service. You simply don’t know. All you can trust is your own storage and publishing mechanisms. Less layers between you and your reader bring a higher the chance the content will be preserved.

The Web’s Lingua Franca

In the current set of digital communication media, I would point to the Web as the most interesting. And in the current implementation of Web languages, I would point to XHTML Strict. (XHTML is the eXtensible Hyper-Text Markup Language — based on XML instead of SGML, as HTML is.)
The idea here is that one separates content from layout, making it accessible for a multitude of devices, today and in the future (who knows how people will browse during next decade).

At the same time, people don’t use (X)HTML as it’s intended to be — well, at least as intended by Tim Berners-Lee and the World Wide Web Consortium: HTML separated from layout issues. So the problem is pushed to the browser side. Browser manufactures have to be commercial to carry on writing their code, which means their implementations are fault-tolerant. This in reverse makes people even more lazy when writing HTML.

So that we could see a future where HTML is so fuzzy, so far from any standard, and so undefined in its actual use, that it starts to completely rely on certain browsers and their implementation. However, software, especially the compiled and unreadable binary versions of such (but even the huge and complex libraries of open source), dies the fastest. With it, older information on the web could die.

It happens quite fast. If you download the Netscape 4 these days, it will start to load a certain Netscape.com webpage which will cause it to spit out errors. Not even the one party having control over both the webpage and the browser manages to maintain the content accessible.

Makes one wonder if digital storage is superior to more traditional means of preserving data.
While we can still read epitaphs hammered into thousand-year old stones, we have a hard time finding the floppy drive to access source code coming with a 1980s C-64 BASIC programming book. (Even if you could type the code from the book, you likely don’t have the right compiler to run it. Here, a solution would be to be broad in concept and unspecific in implementation; i.e. using pseudo-code.)

HTML as intended is very simple, unspecific, and doesn’t need complex interpretations. However, the way most people (or software tools) write it today — at a time when a lot of older material is recognized, copied, uploaded, indexed, and thus intended to be conserved — it resembles more that 1980s BASIC instead of pseudo-code (excuse the use of programming metaphors, since HTML is not a programming language).

Escaping the Dilemma

There is simply no realistic way to educate people into writing “stricter” HTML. And most tools, for in- and output of HTML, are just going along same ways, either because they are forced to by expectations, or because the developers themselves don’t understand the web. (I wrote a Content Management System myself over the course of some years working for a company, and you simply cannot expect a customer working with it to understand the underlying issues completely — e.g. separation of form and layout — and yet, it’s amazingly easy to still implement them and let them be useful.)

Imagine a best-selling tool (say, Microsoft FrontPage) that would suddenly start to only produce (X)HTML Strict. And imagine a popular browser (like Netscape), that would now start to render only HTML Strict*. Suddenly, people and tools would start to adapt, because they have to. The blog-producing applications, and all other sorts of Content Management Systems, make up a large part (maybe the largest part) of content produced online today. If people would follow this road, there’s much less of a chance future generations will be completely lost in the Web.

*Modern browsers already have so-called doctype-sniffing mechanisms, but the website won’t be “punished" for syntax inconsistencies or proprietary, vendor-specific extensions; the browser will simply switch into an “old-fashioned” fault-tolerant rendering mode.

Back at my grandparent’s attic, I’m now un-dusting a back-issue of the German MAD from the 1970s, a somewhat costly rarity over three decades later. Don Martin, Al Jaffee, William M. Gaines, Mort Drucker, to name a few. I’m truly glad their genius wasn’t published as webpage. Or else, I might never have discovered it.

Quite a good storage secret: Print it on paper. Put it in the attic.

Preserving Web Content fo ... by Philipp Lenssen

Googling Paranoia

As Microdoc points out, most Google stories in the media are about googling, the loss of privacy, and possible merits; see “She found her father’s story and herself” (today in Kansas City Star), “Is Search Privacy an Issue?” (by Danny Sullivan, today at InternetNews.com), and “The World According to Google” (yesterday at San Francisco Chronicle).

How much is really publicly found online? I’d almost argue: not more than what you once intended to communicate. Your office might publish your phone number, email and job title. Hardly anything dubious. Google Groups archive the Usenet. Which is a public place to begin with. Newspapers will feature your name if you talked to them before. If you talk to a reporter, you already know it’ll be publicized. University dissertations. Isn’t that called a publication?

What actually happens is that people don’t understand that computers are really bad at forgetting.
People tend to forget. You did something bad, embarrassing, stupid, maybe illegal? Just wait some months. Maybe a year. Forgiven and forgotten. Nobody will talk about it.
Not so with search engines and their data storage. Like most websites (think deserted GeoCities accounts), the Google Cache, or the WayBack machine (yes, you can even see how modest Google was layouted back in 1998). Online people endlessly mirror, cache, duplicate, quote, and republish. To counter information you don’t like, all you can do is react to it, creating noise, flirting with search engines to prefer your way of the story.

People aren’t really afraid of the Internet in general, the World Wide Web, Search Engines, or Google in specific; no, they’re suffering from egophobia.

Or at least, that’s the image the media tries to project.

Most people I know are actually happy when they find out more about themselves when googling their name. Cheaper than a shrink, and much more revealing.

“Axiom 1 for the world we’ve begun:

Your reputation used to depend on
What you concealed
Now it depends on what you reveal

The age of secretive mandarins who creep on heels of tact
Is dead: we are all players now in the great game of fact instead
So since you can’t keep your cards to your chest
I’d suggest you think a few moves ahead
As one does when playing a game of chess

Axiom 2 to make the world new:

Paranoia’s simply a word for seeing things as they are
Act as you wish to be seen to act
Or leave for some other star

Somebody is prying through your files, probably
Somebody’s hand is in your tin of Netscape magic cookies
But relax:
If you’re an interesting person
Morally good in your acts
You have nothing to fear from facts

Axiom 3 for transparency:

In the age of information the only way to hide facts
Is with interpretations
There is no way to stop the free exchange
Of idle speculations

In the days before communication
Privacy meant staying at home
Sitting in the dark with the curtains shut
Unsure whether to answer the phone
But these are different times, now the bottom line
Is that everyone should prepare to be known
Most of your friends will still like you fine”
– Momus: The Age of Information (from the 1997 album Ping Pong)

Googling Paranoia by Philipp Lenssen

Memomarker is out of Beta, since it will now ignore scripts and styles of a page. Also, you can now directly submit text.

by Philipp Lenssen

Bookmarks for the Next Millennium

"While I had some fun with your Memomarker tool, and the underlying concept, why don't you put your manual (and philosophical) effort into ways of finding content long after a Google bookmark or a Google search engine result is gone?"
– Jim, May 22 2003

"Ten thousand years (...) is about as long as the history of human technology. We have fragments of pots that old. Geologically, it's a blink of an eye. When you start thinking about building something that lasts that long, the real problem is not decay and corrosion, or even the power source. The real problem is people. If something becomes unimportant to people, it gets scrapped for parts; if it becomes important, it turns into a symbol and must eventually be destroyed. The only way to survive over the long run is to be made of materials large and worthless, like Stonehenge and the Pyramids, or to become lost. The Dead Sea Scrolls managed to survive by remaining lost for a couple millennia. Now that they've been located and preserved in a museum, they're probably doomed. I give them two centuries — tops."
– Danny Hillis, The Millennium Clock, the 1990s

"As a means of recording and providing access to our cultural memory, digital technology has numerous advantages and may help relieve the traditional conflict between preservation and access. (...)

Digital technology, however, poses new threats and problems as well as new opportunities. Its functionality comes with complexity. Anyone with a compass (or a clear night to view the position of the stars in relation to true north) could theoretically set up or repair a sundial. A digital watch is more useful and accurate for telling time than a sundial, but few people can repair it or even understand how it works. Reading and understanding information in digital form requires equipment and software, which is changing constantly and may not be available within a decade of its introduction. Who today has a punched card reader, a Dectape drive, or a working copy of FORTRAN II?"
– The Commission on Preservation and Access and The Research Libraries Group, Inc., Preserving Digital Information, May 1, 1996

Epitaph: 'Document Not Found'

What will survive in the very long run? Google won't survive. Search engines as known today won't survive. But as long as there are ideas, the concept of finding them will survive — in whatever technological implementation.

It's hard to predict a way to preserve a Bookmark into the next Millennium.
A collection of memowords is not enough. This bookmark would have to be such a complete description of the idea, that it would qualify as a copy. And indeed, I found making an idea somewhat popular the easiest way to preserve it, since it will get copied — that's increasing its virtual "truck factor". No special search to find it; it's all around you anyway, because people keep talking about it, and if one is lost, there's still plenty. Just as in biological evolution, you won't survive; it's up to the offspring to carry on the flame of life.


"I want to build
a clock that ticks
once a year.

  The century hand
  advances once every
  100 years, and the
  cuckoo comes out
  on the millennium.

    I want the cuckoo
    to come out every
    millennium for the
    next 10,000 years.

      If I hurry, I should
      finish the clock
      in time to see the
    cuckoo come out
      for the first time."

– Danny Hillis, The Millennium Clock

Bookmarks for the Next Mi ... by Philipp Lenssen

Google Answers Researcher Interview: Justaskscott-ga

Today’s Google Answers Researcher in the spotlight is JustAskScott. Scott answered 447 question to this moment, from “Is Is .9995 the SAME as 99.95%” (yes) on June 22, 2002, to his most recent, “Name of company and address”.

Where are you from?
Originally: Long Island (pronounced Lawngylind by natives), NY
Currently: Buckeye Country (central Ohio)

What’s your profession?
Hmmm, good question. Google Answers Researcher, now and hopefully to some extent always. Previously and in the future, a lawyer. Also in the future, a librarian or other sort of information professional.

Justaskscott-ga on Google Answers

What kind of questions do you like to answer most?
Any question for which I can easily generate good search terms. Also, questions written by someone with a good sense of humor (badabing is a good example).

What were some of the most interesting discoveries you made during your research activity (please include the questions)?
I feel like I’ve absorbed so much information that it’s hard to pinpoint a few interesting discoveries. I suppose my favorite discovery was that I could translate a phrase without knowing the language! (It may not work all the time, but at least it’s a possibility.)

English to gaelic tranlation

Hypothetical Musings by Justaskscott-ga

You have one minute to convince a potential customer who doesn’t know about Google Answers to start using it. What do you say?
First of all, just browse the site. It can’t hurt. The questions are about almost every conceivable topic; there’s bound to be some that interest you. If you click on these questions, you’ll see, for the most part, good answers and comments.

There must be some questions that you’ve always wondered about — there might even be something you wondered about for the first time today. So go ahead, post a question for whatever you think it’s worth. You might get a great answer!

If an omniscient deity would be a Google Answers Researcher, which question would you ask?
This sounds like a sequel to Bruce Almighty! OK, let’s see ... how about: “Why is the world as it is?” I hope that the answer isn’t: “Because!"

Which famous person, dead or alive, would make a great addition to the Researcher team?
Ben Franklin would be great!

Time magazine features Google Answers; what can be seen on the cover?
Wouldn’t that be something? Hmmm ... I suppose screenshots of the opening lines of a few good answers.

Justaskscott-ga’s Favorites

What are your favorite research tools, on- and offline?
Wait a second, there’s an offline too? Since I don’t usually work in a library, I don’t have much access to offline research tools. Occasionally I consult the Oxford English Dictionary — I have the first edition, but it’s still helpful. As for online tools, Google is #1 — and I’m not just saying that because I’m a Google Answers Researcher. Ixquick is a nice metasearch engine.

What are some remarkable less known websites?
I don’t think that I visit any less known websites on a consistent basis. (I’ll probably think of one later.) If you happen not to know it, Arts & Letters Daily is quite remarkable. However, I spend too much time on Google Answers to visit it often.

What are some of your favorite Google Answers by fellow Researchers?
Here’s an awe-inspiring answer that I read today:

“The Music called “DUB” I want/need some on 12 inch Vinal. For clouseau-ga only!”

Every week, I see at least one amazing answer. (I don’t read most answers — there’s not enough time — so my belief is that there are amazing answers every day!) At those moments, Google Answers reaffirms my faith in humanity.

Justaskscott-ga’s Spare-time

Got any weird hobbies?
Is trying to predict the winners of the Triple Crown races each year weird? No? Then I suppose I don’t have any weird hobbies.

What are some of your favorite books, movies, and music albums?

Books: The Scarlet Letter (which I read many years ago, and will have to reread to make sure it was really as good as I remember); Foucault’s Pendulum; Truman; and First In His Class: A Biography Of Bill Clinton
Movies: Citizen Kane; Schindler’s List; Trois Colours: Rouge; and Testament.
Music: Liz Phair’s Exile in Guyville (I’m a guy, yet I think it’s the only perfect album I’ve ever heard); The Beatles entire catalog (if I had to choose, I’d say Abbey Road); Värttinä’s Oi Dai or Seleniko (the Dixie Chicks of Finland); and My Bloody Valentine’s Loveless (where have you gone, Kevin Shields?).

Final words by Justaskscott-ga

What would be the title of your autobiography?
“Just Ask Scott”. No, just kidding. I think it’s too early for an autobiography; I feel that I’m coming close to finding my purpose in life, and then I’ll have a better sense of what to call it.

Anything else you might want to say?
Howard Dean for President!

Google Answers Researcher ... by Philipp Lenssen

>> More posts

Blog | Forum more >> Archive | Feed | Google's blogs | About

This site unofficially covers Google™ and more with some rights reserved. Join our forum!