Google Blogoscoped

Wednesday, January 21, 2009

WhiteHouse.gov’s New Robots.txt

Now that Barack Obama is US president, the White House website saw an overhaul. And as Jason Kottke noticed, so did the whitehouse.gov robots.txt file, which tells search engine crawlers like the Googlebot which content is OK to index. According to Jason here's the before and after:

Before After
User-agent: *
Disallow: /cgi-bin
Disallow: /search
Disallow: /query.html
Disallow: /omb/search
Disallow: /omb/query.html
Disallow: /expectmore/search
Disallow: /expectmore/query.html
Disallow: /results/search
Disallow: /results/query.html
Disallow: /earmarks/search
Disallow: /earmarks/query.html
Disallow: /help
Disallow: /360pics/text
Disallow: /911/911day/text
Disallow: /911/heroes/text
Disallow: /911/messages/text
Disallow: /911/patriotism/text
Disallow: /911/patriotism2/text
Disallow: /911/progress/text
Disallow: /911/remembrance/text
Disallow: /911/response/text
Disallow: /911/sept112002/text
Disallow: /911/text
Disallow: /ConferenceAmericas/text
Disallow: /GOVERNMENT/text
Disallow: /QA-test/text
Disallow: /aci/text
Disallow: /afac/text
Disallow: /africanamerican/text
Disallow: /africanamericanhistory/text
Disallow: /agencycontact/text
Disallow: /americancompetitiveness/text
Disallow: /apec/2003/text
Disallow: /apec/2004-summit/text
Disallow: /apec/2004/text
Disallow: /apec/2005/text
Disallow: /apec/2006/photoessay/text
Disallow: /apec/2006/text
Disallow: /apec/2007/photoessays/2/text
Disallow: /apec/2007/photoessays/text
Disallow: /apec/2007/text
Disallow: /apec/2008/photos/text
Disallow: /apec/2008/text
Disallow: /apec/text
Disallow: /appointments/text
... continues for over 2000 more lines ...
User-agent: *
Disallow: /includes/

PS: WhiteHouse.gov validates as XHTML. [Update: Not anymore, now it shows 1 error. Thanks Veky!]

[Via Andy.]

Update: Kevin Fox (ex-Google employee and now at Friendfeed) says:

This is a bit silly. The old robots.txt excludes internal search result pages and redundant text versions of html pages. This is exactly what robots.txt is for. Google's Webmaster Guidelines state "Use robots.txt to prevent crawling of search results pages or other auto-generated pages that don't add much value for users coming from search engines."

It's understandable that the robots.txt of an 8-year-old site is longer than that of a 1-day-old site, and it's not as if '/secrets/top' or '/katrina/response/' were put in the robots file.

Fun as it may be, this is a non-story.

(Consider this – the search results at WhiteHouse.gov are now indexable, which goes against the Google Webmaster Guidelines...)

[Thanks Kevin!]

Advertisement

 
Blog  |  Forum     more >> Archive | Feed | Google's blogs | About
Advertisement

 

This site unofficially covers Google™ and more with some rights reserved. Join our forum!