Google Blogoscoped

Forum

Google URLs break the web

Brian M. [PersonRank 10]

Tuesday, May 22, 2007
17 years ago3,612 views

[0] http://blogoscoped.com/forum/14069.html
[1] http://www.w3.org/Consortium/Member/List
[2] http://www.w3.org/Provider/Style/URI
[3] http://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Relationship_to_URL_and_URN

Number 10 of my list of annoying things about Google was Google Print URLs [0]. Google's dynamic URLs are the nastiest thing most surfers encounter on a given day on the Internet. If you think you are going to right click on that search result and paste the URL into another application, you are mistaken. That URL is not the simplicity that you expect. It's a behemoth that has dynamic session code intended both to track your every move and, in the case of Google Books, help Google do everything they can to stop you from seeing too much information.

I'm not usually one to rant on such a thing. I think this kind of information collection is mostly harmless. Click data allows Google to objectively measure their search performance and tune their algorithms at a minimal cost to your privacy. In return for this data, Google provides personalized search services for every user who wants it.

But Google Books is the worst. People come to Google Books with the mindset that it is something like a library. Libraries are well organized, and the exact location of every book is well known. You can use an online catalog or a card catalog and be given directions to the same place. This is why it is so frustrating that old Google Books URLs no longer work. It's not fair to say that Google doesn't know about the old ones and thus can't write a few regular expressions to redirect them. They can just look at their logs and see thousands of failed access attempts.

When Google Books first came out, I encouraged Wikipedians to implant references to it all throughout the wiki. I even created a slick javascript form that created the wiki syntax for you automatically. This was a mistake. Not only are every single one of these URLs now broken, but there is not enough information contained in the URL to uniquely identify the book, page and line in cases where that wasn't manually added as plain text. I can't write an algorithm to go fix them all, and a human being probably couldn't do it either.

You might think, hey, stop your whining. Google is a publicly traded company that can and should do everything it can to stop your prying eyes from reading that extra page on Book search. But did you know that Google is a member of the W3C [1]? What does the W3C think about this nefarious practice? Nothing kind. Here are a couple of excerpts from their document explaining why "Cool URIs don't change" [2]:

"In theory, the domain name space owner owns the domain name space and therefore all URIs in it. Except insolvency, nothing prevents the domain name owner from keeping the name. And in theory the URI space under your domain name is totally under your control, so you can make it as stable as you like. Pretty much the only good reason for a document to disappear from the Web is that the company which owned the domain name went out of business or can no longer afford to keep the server running."

"Do you really feel that the old URIs cannot be kept running? If so, you chose them very badly. Think of your new ones so that you will be able to keep them running after the next redesign."

Dynamic URLs are certainly more prevalent than when this doc was written, but that doesn't change the guidelines. There have been cases where entire services have been removed, and every URL thus returns a 404. This breaks the Internet for future surfers and is a shortsighted move. I would recommend that Google and other providers appoint someone, or even a team, to scour their logs looking for failed access attempts. This is a one time cost that is a precursor to then implementing a new policy: If you change a URL format, don't be lazy, redirect all the new ones to point to useful information.

(Note: If you are confused about URI/URL, see [3])

Philipp Lenssen [PersonRank 10]

17 years ago #

But if you click on the "About this book" below the result snippet, link you're taken to a (relatively) stable "permalink" page, right?

Brian M. [PersonRank 10]

17 years ago #

Philipp, so how do you explain the Google Reader Shared Items URLs? This is several pages in, and neither the pages nor the posts are even numbered.
http://www.google.com/reader/shared/14187970455121264404?c=CKGUiIOhpYwC

Colin Colehour [PersonRank 10]

17 years ago #

That's a lot of complaining just about Google Books. I think its important for them to be able to lock down the content of some books. It follows their fair use practice when they are able to only show you snippets to a few pages of a particular book. If you were able to just alter the URL to view pages 5 – 8 and then alter the URL again to view pages 9 – 12 then the service would most likely be stopped because every major book publisher would be suing Google Books out of existence.

Brian M. [PersonRank 10]

17 years ago #

What about YouTube URLs? How do you explain these?

http://www.youtube.com/watch?v=hGiIkceewRA

hGiIkceewRA? That's supposed to be web-friendly? How long is that URL going to last?

Forum home

Advertisement

 
Blog  |  Forum     more >> Archive | Feed | Google's blogs | About
Advertisement

 

This site unofficially covers Google™ and more with some rights reserved. Join our forum!