Google Blogoscoped

Forum

WhiteHouse.gov's New Robots.txt  (View post)

Kevin Fox [PersonRank 4]

Wednesday, January 21, 2009
11 years ago5,298 views

This is a bit silly. The old robots.txt excludes internal search result pages and redundant text versions of html pages. This is exactly what robots.txt is for. Google's Webmaster Guidelines state "Use robots.txt to prevent crawling of search results pages or other auto-generated pages that don't add much value for users coming from search engines."

It's understandable that the robots.txt of an 8-year-old site is longer than that of a 1-day-old site, and it's not as if '/secrets/top' or '/katrina/response/' were put in the robots file.

Fun as it may be, this is a non-story.

Alex Ksikes [PersonRank 10]

11 years ago #

Interesting! I wonder if this is a sign... that is Google having tighter links with the gov now.

Mauricio Pastrana [PersonRank 0]

11 years ago #

I'd say. Perhaps they just shoved all those files under /includes/hide_this/

Philipp Lenssen [PersonRank 10]

11 years ago #

(Thanks Kevin, I added your comment as an update.)

Veky [PersonRank 10]

11 years ago #

Of course not... if Google had tightier links with government, they could bypass robots.txt all they wish. :-)

Besides, validator now reports one error. ;-P

Philipp Lenssen [PersonRank 10]

11 years ago #

(Added an inline update in regards to that, thanks Veky!)

Wesley de Souza [PersonRank 1]

11 years ago #

That's *NOT* silly as the whole website has changed, I tested some of the blocked url's and they don't exist anymore, so nothing wrong there.

Kevin Fox [PersonRank 4]

11 years ago #

Sorry, to clarify, I meant the number of twitters and blog posts about the old huge robots.txt file was silly. It makes total sense that they start with a fresh file after completely remaking the site.

Wesley de Souza [PersonRank 1]

11 years ago #

I really got it wrong, sorry. =)

But I guess we can agree on something: it's good to know they changed to a structure more natural and centered, so fewer things are put on that disallow list.

Mirus [PersonRank 0]

11 years ago #

About that added comment... These don't look like search result pages to me...

Disallow: /afac/text
Disallow: /africanamerican/text
Disallow: /africanamericanhistory/text
Disallow: /agencycontact/text
Disallow: /911/response/text
Disallow: /911/sept112002/text
Disallow: /911/text

Thus I think the implied point that is made in posts about this are valid (that more things are open) Although it is interesting that searches are indexable on the new site :)

George R [PersonRank 10]

11 years ago #

Whitehouse.gov seems to have eliminated its archives and permalinks.
I imagined that this produced broken links on many pages throughout the web. Surprisingly a search of [link:whitehouse.gov] indicates only 21,100 pages.

Was it the Bush administration or the Obama administration that removed or moved these files? The Google cache still has some of them. Does the National Archives or Bush library have the rights to these? Shouldn't they be in the public domain?

The google cache does not seem to have pages from Clinton's administration at the site whitehouse.gov, so there may be a precedent for this action.

Perhaps the new administration intends to do this properly and as an expedient chose a quick and dirty change. Hopefully someone will have an opportunity to properly integrate the new and old pages.

David Mulder [PersonRank 10]

11 years ago #

Small note, it now once again passes the validator :P

T├Ánu Samuel [PersonRank 0]

11 years ago #

Just wanted to distribute my find: /911 was added there exactly 1 years after actual events. You can check this from archive.org.

Kevin Fox [PersonRank 4]

11 years ago #

Mirus: As I mentioned, the directories that ended in /text contained text versions of articles that appeared elsewhere in html. Hence in keeping with Google's webmaster guidelines it's appropriate to disallow the '/text/ directories because they represent redundant content that would be presented in an inferior way to the html versions.

Ionut Alex. Chitu [PersonRank 10]

11 years ago #

I think that Google should solve the duplication issue on its own. If my site offers multiple versions for each page: text, HTML, PDF, DOC, the HTML version will probably be the most popular (most linked to) and Google should figure out that it's the preferred version (an improvement would be to show next to the snippet that there are alternative versions: TXT, PDF).

The same goes for search results: Google should automatically detect search results pages and lower their importance.

Brandon [PersonRank 0]

11 years ago #

Yes, it would be nice if Google solved such issues on its own. It would lead to a lot less extra work for webmasters ... but alas, ain't completely happened yet.

Powder Lover [PersonRank 0]

11 years ago #

Agree with Ionut Alex. It's always pissed me off that Google tries to tell me how I can name pages, what format they can be in, even to some extent what content can be on them and which HTML tags I can use.

Quite a bit of what they "suggest" is actually counter to good UI and good readabililty.

I mean, WTF? It's not my job to make things easy for Google.

Philipp Lenssen [PersonRank 10]

11 years ago #

Ionut says
> I think that Google should solve the duplication issue on its own.

Powder Lover says
> It's always pissed me off that Google tries to tell me how I
> can name pages, what format they can be in, even to some
> extent what content can be on them and which HTML tags I can use.

Hmm. Google has a duplicate checker already... it will omit certain similar results. Is there a specific example of where having the same file in both HTML and say PDF format would hurt your rankings?

Robin [PersonRank 0]

11 years ago #

The BBC has picked this up and got it completely and utterly wrong:
jamiedigi.com/2009/01/bbc-gets ...

reg4c [PersonRank 0]

11 years ago #

Validates again as transitional, strict 5 errors 1 warning

Philipp Lenssen [PersonRank 10]

11 years ago #

There you go, search results are now excluded in a once more updated whitehouse.gov robots.txt:

User-agent: *
Disallow: /includes/
Disallow: /search/
Disallow: /omb/search/

This thread is locked as it's old... but you can create a new thread in the forum. 

Forum home

Advertisement

 
Blog  |  Forum     more >> Archive | Feed | Google's blogs | About
Advertisement

 

This site unofficially covers Google™ and more with some rights reserved. Join our forum!