
| Before | After |
|---|---|
User-agent: * Disallow: /cgi-bin Disallow: /search Disallow: /query.html Disallow: /omb/search Disallow: /omb/query.html Disallow: /expectmore/search Disallow: /expectmore/query.html Disallow: /results/search Disallow: /results/query.html Disallow: /earmarks/search Disallow: /earmarks/query.html Disallow: /help Disallow: /360pics/text Disallow: /911/911day/text Disallow: /911/heroes/text Disallow: /911/messages/text Disallow: /911/patriotism/text Disallow: /911/patriotism2/text Disallow: /911/progress/text Disallow: /911/remembrance/text Disallow: /911/response/text Disallow: /911/sept112002/text Disallow: /911/text Disallow: /ConferenceAmericas/text Disallow: /GOVERNMENT/text Disallow: /QA-test/text Disallow: /aci/text Disallow: /afac/text Disallow: /africanamerican/text Disallow: /africanamericanhistory/text Disallow: /agencycontact/text Disallow: /americancompetitiveness/text Disallow: /apec/2003/text Disallow: /apec/2004-summit/text Disallow: /apec/2004/text Disallow: /apec/2005/text Disallow: /apec/2006/photoessay/text Disallow: /apec/2006/text Disallow: /apec/2007/photoessays/2/text Disallow: /apec/2007/photoessays/text Disallow: /apec/2007/text Disallow: /apec/2008/photos/text Disallow: /apec/2008/text Disallow: /apec/text Disallow: /appointments/text ... continues for over 2000 more lines ... |
User-agent: * Disallow: /includes/ |
PS: WhiteHouse.gov validates as XHTML. [Update: Not anymore, now it shows 1 error. Thanks Veky!]
[Via Andy.]
Update: Kevin Fox (ex-Google employee and now at Friendfeed) says:
This is a bit silly. The old robots.txt excludes internal search result pages and redundant text versions of html pages. This is exactly what robots.txt is for. Google's Webmaster Guidelines state "Use robots.txt to prevent crawling of search results pages or other auto-generated pages that don't add much value for users coming from search engines."
It's understandable that the robots.txt of an 8-year-old site is longer than that of a 1-day-old site, and it's not as if '/secrets/top' or '/katrina/response/' were put in the robots file.
Fun as it may be, this is a non-story.
(Consider this – the search results at WhiteHouse.gov are now indexable, which goes against the Google Webmaster Guidelines...)
[Thanks Kevin!]
>> More posts
Advertisement
This site unofficially covers Google™ and more with some rights reserved. Join our forum!