Google Blogoscoped

Saturday, February 24, 2007

How Google Finds Out About Some Deep Websites

Brian Mingus emails this:

Some webmasters seem to be convinced that Google spies on them with the Google Toolbar, tries random directories in an effort to dig content out of the deep web, and other tactics. I have recently tried to keep a web server private to only a small group of people, without any authentication, and quickly realized one way that Google figures this out. Perhaps you knew about it – I think its non-obvious.

The problem is with http referrers, and millions of people publishing their web logs. Google these phrases to see the web logs:

“Generated by Webalizer”
“Created by awstats”

If you link to someone’s website, they can search their logs and find out. You might not care, as long as they don’t link to you. But in my case they inadvertently linked to me through their published web logs and Google then came along and spidered their web logs. Game over – I’ve been found out!

So if you want to do what I am doing, (this is a wiki), you have to instruct people to insert plain text urls only[*]. Because the instant someone clicks on a url from your site, there is a chance you have exposed yourself.

*Or you can password-protect the site, of course.


Blog  |  Forum     more >> Archive | Feed | Google's blogs | About


This site unofficially covers Google™ and more with some rights reserved. Join our forum!