Yup, you've got it exactly. |
Yep, It is all about Crawling.
Thanks for recommendations. |
I think this is a great way of doing it |
(Just posted this on MattCutts.com, but I'll put it here for discussion too...)
There's one thing I'm not 100% clear about:
> Also, note that robots.txt rules still apply to each crawl service appropriately. If service X was allowed to fetch a page, but a robots.txt file prevents service Y from fetching the page, <b>service Y wouldn’t get the page from the caching proxy</b>.
What if the robots.txt file prevents service X from fetching the page but not service Y? Would service X fetch the page and cache it anyway (whilst presumably not indexing it) just in case service Y needed it at a later date? Or would service Y crawl at a later date to cache any pages that service X was disallowed to crawl? (If so, how would service Y know which pages it needed – and would it just end up crawling the whole site again? Does the cache know which pages the bot that provided the cache wasn't allowed to index?)
For example, say this was my robots.txt:
# BOF #
User-agent: * Disallow:
User-agent: Mediapartners-Google Disallow: /not-for-adsense/
User-agent: Googlebot Disallow: /not-for-googlebot/
# EOF #
Would the Mediapartners-Google bot fetch and cache any pages in the /not-for-adsense/ directory to save Googlebot from having to fetch them at a later date? Likewise, would Googlebot fetch any pages from the /not-for-googlebot/ directory so that the Mediapartners-Google bot could retrieve them from the cache and index them?
Could you confirm which approach Google's bots take and confirm whether adding exclusions to the robots.txt file prevents Google bots from crawling, caching or indexing? |
Tony – think about that like about classic w3cache for normal users. Users/Bots doesn't even have to know that their ISP uses web caching. Their actions are exactly the same as before. John still uses his favourite sites, and Mary her. If they both use the same site it would be downloaded only once. But their action doesn't change, they do the same things as when their ISP doesn't use web caching, only the transfer is less. The same with bots. GoogleBot and MediaBots does the same requests as before. It's the proxy cache who join them when there is such a possibility. |