Google News' Cloaking Policy - Google Blogoscoped Forum

Forum

Google News' Cloaking Policy (View post)
Colin Colehour	Thursday, July 19, 2007 17 years ago • 4,374 views
The "First Click Free" option is a nice idea because the user benefits by actually being able to read the content and the website/company benefits by getting allowing more users to find that content. While I don't normally pay subscription fees when associated with web content, it's only fair to offer that as an option. Think of all of the 'News Archive' content that were opened up to users when that feature was released. Companies want to be able to make money off their content but also want people to find it. Before the cloaking policy, those sites wouldn't be included in searches. So for News sites, I think its a good policy to have.
Ludwik Trammer	17 years ago #
Am I getting this right? They are promoting clocking not for special Google News Bot, but clocking for normal Google Bot, that would affect also normal search results? Presenting other content to the bot than to the users? That's a serious lack of consistency. And I just hate it when I got paid sites in the results.
John Honeck	17 years ago #
When the regular index is filled with subscription required or first page free sites a new economic model will be developed: Selling targeted email lists. 1) Write site about a subject 2) Work it up to a few thousand visits a day 3) Slap on a subscription requirement 4) Sell targeted email list interested in subject 5) Profit Thanks Google!
Tony Ruscoe	17 years ago #
So you'll probably be included in Google News but get dropped from the main index for cloaking? BTW, their advice for identifying the GoogleBot isn't consistent with what the Webmaster Central Blog says either: << You can verify that the request is actually from our robot by making sure the IP address is within the range of 66.249.64.0/20. >> vs. << The common request we hear is to post a list of Googlebot IP addresses in some public place. The problem with that is that if/when the IP ranges of our crawlers change, not everyone will know to check. In fact, the crawl team migrated Googlebot IPs a couple years ago and it was a real hassle alerting webmasters who had hard-coded an IP range. So the crawl folks have provided another way to authenticate Googlebot. >> http://googlewebmastercentral.blogspot.com/2006/09/how-to-verify-googlebot.html
Colin Colehour	17 years ago #
Ludwik – Googlebot is used for the web index and the news index. List of Google crawlers: http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=40364 I see what you mean on how Google says cloaking is bad (web search) but then they say its totally ok (news content). But how else do you index subscription content that is behind a login box or some other type of form? Is it better to not know about content that you might have to pay for?
Tony Ruscoe	17 years ago #
<< But how else do you index subscription content that is behind a login box or some other type of form? >> Since Google News sources are hand picked, they could easily provide (or ask each webmaster of a subscription-only site to provide) a GUID or other identifier which they're going to send through in the URL or HTTP header request but only when they're scraping for Google News content. That way, they're not encouraging cloaking based on the GoogleBot which would also affects standard web results. Alternatively, they could start adding "Subscription Only" labels to standard web search results too so that it's clear the page you click through to might not necessary contain the snippet you've just read, but a registration form instead. Either way, the webmaster should be forced to do something so that explicitly tells Google that the content provided is being "cloaked".
Philipp Lenssen	17 years ago #
> But how else do you index subscription content that > is behind a login box or some other type of form? Colin, one thing to consider: Google is not an invisible observer – their action of observing leads to reaction of the observed, the Heisenberg phenomenon. In other words: a whole lot of news sites might only start this registration-only revenue because Google gives them a chance to do so by supporting cloaking... because without Google News, they might not be used by enough people, and thus would be forced to make other decisions in regards to their content. So while for some sites you might be right and say "this content would be completely invisible if Google wouldn't participate in cloaking"... for other sites, the exact opposite could be true: "this content would be completely registration-free, and thus visible and indexable, if not for Google." > Is it better to not know about content that > you might have to pay for? Yeah, I think Google News would be better off without any subscription-only sites or "link traps"... they just add noise, and they're confusing to handle because they don't have stable URLs (you blog about some news report and your readers won't be able to follow it, because the page in question does referrer and bot-identification hacks to determine its visibility... well, there's a reason why cloaking is not allowed in Google web search, because it's confusing and deceiving). But again, we can't simply argue "if Google News doesn't support it, we wouldn't see the WSJ anymore". Because the WSJ is relying on crawlers playing along this weird visible/ invisible content stuff, and they might have to find other ways of selling their content when crawlers stop participating in this. If anything, Google could offer an extra feature where people specify they want subscription-based content (but not make it default). And if so, Google would need to implement it in a different way, because they can't tell people to "cloak for the Googlebot". They would need to actually sent out two different bots, and also not re-use content from the crawl proxy for that...
Melanie Phung	17 years ago #
Interesting stuff. No surprise that different branches of the company can give slightly conflicting statements, since the organization is so big. But you'd think that they'd have the message down pat when it comes to cloaking. At SMX in June, every search engine rep EXCEPT Google said that some kinds of cloaking are okay. The example had to with showing the bots only non-tracking URLs whereas real visitors would see URLs with query strings attached (essentially duplicate content). Like I said, every SE rep said that intent mattered and that in a case where the objective is to avoid confusing the spider, cloaking is okay. Seems a similar principle would apply to registrations and logins. But Matt Cutts made it pretty clear that he (and one assumes therefore Google) do NOT approve of cloaking for any reason. His argument, which makes sense, is that techniques are easy to identify, intent is not. So where does that leave us?
Michael Martinez	17 years ago #
Can't say how much they still use it, but Hilltop was developed for and implemented in Google News in 2002. People who want to improve their rankings in Google News should take a look at how Hilltop works.
Colin Colehour	17 years ago #
Philipp, When it comes to linking to a first-click-free news article, couldn't you just redirect to the article by going through Google? Does anyone know of a site where you can read the one article you clicked on only through the link in Google News?
Philipp Lenssen	17 years ago #
Hmm... when you right-click a Google News result link to copy the URL it will actually copy a redirect URL. I don't know how stable those are (or will be), but if we start using hacks like those I think we really start polluting the web (just as when we use services like TinyURL, IMO). Besides, if Google continues to allow that they're ridiculing their own solution (= they could also check if the referrer is Google News, and only allow redirects in those cases, if they're serious about the "first click free when it's Google News" solution).
Colin Colehour	17 years ago #
A work around would definitely make things more complicated when it came to URLs. So what policy change would you prefer? Make cloaking allowable in both News index and Web index? Make cloaking illegal for all search indexes? How do you index content that search engines can't find (because of Forms or Cookies?
Philipp Lenssen	17 years ago #
I've asked Matt for a statement so I'm waiting for that currently so that we have some clarification what Google actually means with what they say, because I find it confusing (two opposing statements on cloaking for Googlebot). As I mentioned above I think outsourcing subscription-content to a different part of Google News (not the default one) might make sense – e.g. "some subscription-only results are hidden, _show all..._", but I don't think having referrer or cloaking deceptiveness is a good solution there either. WSJ (if they believe in the reg-model, I think it cuts them off the conversation in the long term, lowering their revenues) could simply offer the first two paragraphs or so for free, to anybody – Googlebot, unregistered users --, and offer subsequent content only when you register. This way, the story could be indexed by title and keywords available in the first two paragraphs (in both Google News "premium" area as well as default web search). If you absolutely want to see all without registering you're better off chosing a competitor's news source, and no deceptiveness should make you think otherwise. The WSJ (for instance) is simply no good source if you prefer reg-free content. If, on the other hand, Google believes they must have two different cloaking guidelines, it seems they would still need to be forced to also sent out two different bots – how else do they want to manage two guidelines? On a side-note, any kind of "do something special for Googlebot" solutions suggested by Google also strengthen a Google "info monopoly". If you are a smart kid who wants to write your own news crawler to compete with Google News, you'd first have to find ways to do all kind of deals with all kind of big organizations, asking them to please do the nice cloaking for your crawler which they already do for Google News. Chances are, they'll ignore you, because Google at the time has many more users. For these reasons, we should be wary of any "Google-specific settings" we're doing with our server (the Sitemaps protocol is at least carried by a couple of search engines, and it's pretty open so you could connect to it). A monopoly hurts innovation. And I'd rather keep to the W3C when I do my website. And again, Google seems to agree with all that: <<Make pages for users, not for search engines. Don’t deceive your users or present different content to search engines than you display to users>>
JohnMu	17 years ago #
Google News has been listing subscription and paid content for a while now – these items are clearly marked as such in the search results. For an example, try http://news.google.com/archivesearch?hl=en&sa=N&q=princess+diana (as far as I remember, the NYT moves content to their "Select" – paid – section after a certain amount of time, so you will have to use queries that find old content to see the tags). This query brings two different kind of paid items: "… ; Entertains at North Egremont for Mrs. George Rockwood and... $4.95 – New York Times – Oct 17, 1933" and "THE 100 WASTED MINUTES THAT COST DIANA HER LIFE.(Features) Subscription – The People – HighBeam Research – Feb 15, 1998" If you page along, you'll also find "Response Is Monumental for Tickets to Princess Diana's Burial Site Pay-Per-View – Washington Post – ProQuest Archiver – Jan 6, 1998" In the advanced Google news archive search – http://news.google.com/archivesearch/advanced_search – you can also specify the price you're willing to pay, anywhere from free to "at least $50". Some of those items are certainly NOT expensive: http://news.google.com/archivesearch?as_q=iraq&as_price=p5 – but perhaps they're relevant and important enough to merit being listed. I don't know if something like this would make sense for the normal web search results (I doubt it – let them use Adwords). Regarding using multiple bots – Google News does crawl very differently than the normal Googlebot. Additionally, Adsense can now also crawl behind subscription barriers (with the authentication possibilities). I suppose this makes it a bit hard for the crawler proxy, but it's not impossible :-).

Forum home

>> More posts

Blog | Forum more >> Archive | Feed | Google's blogs | About

This site unofficially covers Google™ and more with some rights reserved. Join our forum!