Google Blogoscoped


Greedy Google Ad Bot Debunked  (View post)

Matt Cutts [PersonRank 10]

Monday, April 24, 2006
16 years ago3,820 views

Yup, you've got it exactly.

Elias KAI [PersonRank 10]

16 years ago #

Yep, It is all about Crawling.

Thanks for recommendations.

dpneal [PersonRank 10]

16 years ago #

I think this is a great way of doing it

Tony Ruscoe [PersonRank 10]

16 years ago #

(Just posted this on, but I'll put it here for discussion too...)

There's one thing I'm not 100% clear about:

> Also, note that robots.txt rules still apply to each crawl service appropriately. If service X was allowed to fetch a page, but a robots.txt file prevents service Y from fetching the page, <b>service Y wouldn’t get the page from the caching proxy</b>.

What if the robots.txt file prevents service X from fetching the page but not service Y? Would service X fetch the page and cache it anyway (whilst presumably not indexing it) just in case service Y needed it at a later date? Or would service Y crawl at a later date to cache any pages that service X was disallowed to crawl? (If so, how would service Y know which pages it needed – and would it just end up crawling the whole site again? Does the cache know which pages the bot that provided the cache wasn't allowed to index?)

For example, say this was my robots.txt:

# BOF #

User-agent: *

User-agent: Mediapartners-Google
Disallow: /not-for-adsense/

User-agent: Googlebot
Disallow: /not-for-googlebot/

# EOF #

Would the Mediapartners-Google bot fetch and cache any pages in the /not-for-adsense/ directory to save Googlebot from having to fetch them at a later date? Likewise, would Googlebot fetch any pages from the /not-for-googlebot/ directory so that the Mediapartners-Google bot could retrieve them from the cache and index them?

Could you confirm which approach Google's bots take and confirm whether adding exclusions to the robots.txt file prevents Google bots from crawling, caching or indexing?

Ludwik Trammer [PersonRank 10]

16 years ago #

Tony – think about that like about classic w3cache for normal users. Users/Bots doesn't even have to know that their ISP uses web caching. Their actions are exactly the same as before. John still uses his favourite sites, and Mary her. If they both use the same site it would be downloaded only once.
But their action doesn't change, they do the same things as when their ISP doesn't use web caching, only the transfer is less. The same with bots. GoogleBot and MediaBots does the same requests as before. It's the proxy cache who join them when there is such a possibility.

Forum home


Blog  |  Forum     more >> Archive | Feed | Google's blogs | About


This site unofficially covers Google™ and more with some rights reserved. Join our forum!