Saturday, July 12, 2003

Why URLs Don’t Die

URLs should never have been part of the web user interface, one might think:

“It is likely that domain names only have 3-5 years left as a major way of finding sites on the Web. In the long term, it is not appropriate to require unique words to identify every single entity in the world. That’s not how human language works.”
– Jakob Nielsen, URL as UI (Alertbox), March 21, 1999

However, URLs don’t die out. In fact, they become incredibly helpful. If I scan a Google search result page for relevant pages, I also always scan the URL. When I browse a site, I also glance at the URL. URLs give you a sense of location, time, and meaning.

What’s in a URL?

What’s in a URL? Or rather, what useful information on the target page can be in a URL?

When you create a web site, how can you make use of this knowledge? You can structure your folders, and find optimized names. If done right, with all these attributes a user can instantly feel whether or not she has been on the page before; she considers it trustworthy; how she can jump one level up in the hierarchy; what the file-type is; at times, whether or not the page is static or dynamic. Your site will be more useful.

So a bare-naked web address is very useful. Sometimes, more useful than the meta-data contained within the page. URLs are tech-lingo that people got used; tech-lingo that works. They’re more than a street-sign. To the frequent web browser they are the shape of the house, the noise of the environment, weather, fragrance; the face of a person, the colors of his clothes from a distance. Nothing to make a final judgment, but more than enough to make an educated guess about what to expect. An exposed web address give us an instant feeling of where we are and whether or not we want to move on to somewhere else.

A side-note: URL or URI?

“The key advantage of a web page Uniform Resource Locator’s (URL) is its universality – the address is the same no matter where in the world it is used. This is why Tim Berners-Lee proposed (...) that it be called a Universal Resource Identifier (URI) to suggest his vision of a network where anything could be linked to anything. However, he experienced some philosophical resistance to the ideas of universality from the IETF team working on the web standards, and so the address became named the now familiar Uniform Resource Locator (URL).

The word URL can be pronounced either ’U-R-L’ or ’earl’”.
–, Web Addresses

More on this can be found in Tim Berner-Lee’s Weaving the Web, which details the invention of the World Wide Web and its standards. One of Tim Berners-Lee’s only regrets about URLs: “If I had the chance to revisit [the web protocol] (...) I think I would have found a way to do without the double-slash!"

Google Headquarter Moves

“Fast-growing Internet search firm Google is moving its Mountain View headquarters about a half mile away to a four-building complex being vacated by struggling Silicon Graphics.”
– Steve Johnson, Google moving in Mtn. View (Mercury News), Jul. 12, 2003

Google Finds Most Relevant German Pages, AllTheWeb Second

There have been some blog posts recently saying that AllTheWeb covers more pages of non-English content than the Google Inc. machines. I was curious to find out if that’s the case, or wether AllTheWeb just covers more irrelevant pages.

The tests undertaken previously were restricted to common language-neutral words with a language filter applied (Microdoc Info), or to common language-specific words with no language-filter applied (German Abakus SEO Blog).

Google vs AllTheWeb Round II

[The Battle]

I am more interested in specific language-specific words (with no language-filter applied), and tried the following:

“Gegoogelt” (meaning “to google” in German, as in, “Ich wurde gegoogelt”):

Google finds the most. AllTheWeb’s second, and AltaVista third.

Now for a phrase I used in the German translation for this blog, Google Großaufnahme, talking about the Google Query Language, the “Google-Abfragesprache”:

Google finds my page, as only result. AllTheWeb finds another page... or does it? Checking that other page it becomes clear it doesn’t contain the phrase! AllTheWeb ignored my quotes and returned a page containing both terms, separated.

Next in line is the unquoted keyword combination quantenphysik erklärung (German, meaning: “quantum physics explanation”):

Let’s take a word which could fit better on typical product advertisement pages. (Quantum physics and “to google” are hardly found on those). I chose Marlboro cigarettes, in German: “marlboro zigaretten” (no quotes used in query):

Again, AllTheWeb comes in as second. It does seem though as if it’s getting closer. As is AltaVista when it comes to these “advertisement” keywords.

And finally, let’s repeat the tests with one of those common words. Like “Das Haus” (meaning “the house” in German). The quoted query should find a lot of German pages:

Now it looks as if AllTheWeb’s the heavy-weight champion (keep in mind it doesn’t respect quotations to force phrase-hits only). And even AltaVista knocks out Google for this very common phrase. But so far — play the theme of Rocky — I must conclude Google has the most relevant pages in its index.
Google often filters out pages for abuse; some of these anti-spam techniques might not be implemented in other search engines*. Google got the right focus; not only in English.

*E.g. the German search engine Fireball often returns definite spam pages (like those containing white keywords on white background, in other words, hidden spam).


