WhiteHouse.gov's New Robots.txt - Google Blogoscoped Forum

Forum

WhiteHouse.gov's New Robots.txt (View post)
Kevin Fox	Wednesday, January 21, 2009 15 years ago • 6,287 views
This is a bit silly. The old robots.txt excludes internal search result pages and redundant text versions of html pages. This is exactly what robots.txt is for. Google's Webmaster Guidelines state "Use robots.txt to prevent crawling of search results pages or other auto-generated pages that don't add much value for users coming from search engines." It's understandable that the robots.txt of an 8-year-old site is longer than that of a 1-day-old site, and it's not as if '/secrets/top' or '/katrina/response/' were put in the robots file. Fun as it may be, this is a non-story.
Alex Ksikes	15 years ago #
Interesting! I wonder if this is a sign... that is Google having tighter links with the gov now.
Mauricio Pastrana	15 years ago #
I'd say. Perhaps they just shoved all those files under /includes/hide_this/
Philipp Lenssen	15 years ago #
(Thanks Kevin, I added your comment as an update.)
Veky	15 years ago #
Of course not... if Google had tightier links with government, they could bypass robots.txt all they wish. :-) Besides, validator now reports one error. ;-P
Philipp Lenssen	15 years ago #
(Added an inline update in regards to that, thanks Veky!)
Wesley de Souza	15 years ago #
That's NOT silly as the whole website has changed, I tested some of the blocked url's and they don't exist anymore, so nothing wrong there.
Kevin Fox	15 years ago #
Sorry, to clarify, I meant the number of twitters and blog posts about the old huge robots.txt file was silly. It makes total sense that they start with a fresh file after completely remaking the site.
Wesley de Souza	15 years ago #
I really got it wrong, sorry. =) But I guess we can agree on something: it's good to know they changed to a structure more natural and centered, so fewer things are put on that disallow list.
Mirus	15 years ago #
About that added comment... These don't look like search result pages to me... Disallow: /afac/text Disallow: /africanamerican/text Disallow: /africanamericanhistory/text Disallow: /agencycontact/text Disallow: /911/response/text Disallow: /911/sept112002/text Disallow: /911/text Thus I think the implied point that is made in posts about this are valid (that more things are open) Although it is interesting that searches are indexable on the new site :)
George R	15 years ago #
Whitehouse.gov seems to have eliminated its archives and permalinks. I imagined that this produced broken links on many pages throughout the web. Surprisingly a search of [link:whitehouse.gov] indicates only 21,100 pages. Was it the Bush administration or the Obama administration that removed or moved these files? The Google cache still has some of them. Does the National Archives or Bush library have the rights to these? Shouldn't they be in the public domain? The google cache does not seem to have pages from Clinton's administration at the site whitehouse.gov, so there may be a precedent for this action. Perhaps the new administration intends to do this properly and as an expedient chose a quick and dirty change. Hopefully someone will have an opportunity to properly integrate the new and old pages.
David Mulder	15 years ago #
Small note, it now once again passes the validator :P
Tõnu Samuel	15 years ago #
Just wanted to distribute my find: /911 was added there exactly 1 years after actual events. You can check this from archive.org.
Kevin Fox	15 years ago #
Mirus: As I mentioned, the directories that ended in /text contained text versions of articles that appeared elsewhere in html. Hence in keeping with Google's webmaster guidelines it's appropriate to disallow the '/text/ directories because they represent redundant content that would be presented in an inferior way to the html versions.
Ionut Alex. Chitu	15 years ago #
I think that Google should solve the duplication issue on its own. If my site offers multiple versions for each page: text, HTML, PDF, DOC, the HTML version will probably be the most popular (most linked to) and Google should figure out that it's the preferred version (an improvement would be to show next to the snippet that there are alternative versions: TXT, PDF). The same goes for search results: Google should automatically detect search results pages and lower their importance.
Brandon	15 years ago #
Yes, it would be nice if Google solved such issues on its own. It would lead to a lot less extra work for webmasters ... but alas, ain't completely happened yet.
Powder Lover	15 years ago #
Agree with Ionut Alex. It's always pissed me off that Google tries to tell me how I can name pages, what format they can be in, even to some extent what content can be on them and which HTML tags I can use. Quite a bit of what they "suggest" is actually counter to good UI and good readabililty. I mean, WTF? It's not my job to make things easy for Google.
Philipp Lenssen	15 years ago #
Ionut says > I think that Google should solve the duplication issue on its own. Powder Lover says > It's always pissed me off that Google tries to tell me how I > can name pages, what format they can be in, even to some > extent what content can be on them and which HTML tags I can use. Hmm. Google has a duplicate checker already... it will omit certain similar results. Is there a specific example of where having the same file in both HTML and say PDF format would hurt your rankings?
Robin	15 years ago #
The BBC has picked this up and got it completely and utterly wrong: http://www.jamiedigi.com/2009/01/bbc-gets-it-completely-wrong-about-whitehousegov-robotstxt/
reg4c	15 years ago #
Validates again as transitional, strict 5 errors 1 warning
Philipp Lenssen	15 years ago #
There you go, search results are now excluded in a once more updated whitehouse.gov robots.txt: User-agent: * Disallow: /includes/ Disallow: /search/ Disallow: /omb/search/

Forum home

>> More posts

Blog | Forum more >> Archive | Feed | Google's blogs | About

This site unofficially covers Google™ and more with some rights reserved. Join our forum!