Google Blogoscoped


Page Count Rise for "Copyright 2007"  (View post)

Haochi [PersonRank 10]

Monday, January 29, 2007
17 years ago3,854 views

What's with Yahoo on the first chart? Were they overestimating the results?

Ionut Alex. Chitu [PersonRank 10]

17 years ago #

What about "Copyright 2003-2007"? Or "Copyright © CBC 2007"?

Matt Cutts [PersonRank 10]

17 years ago #

Ugh, there's lots of reasons why these numbers wouldn't be comparable.

I'll tell a quick story. I had a friend at Google who did a bunch of searches at a different search engine, looked at the number of estimated results, and grew alarmed because she thought that the non-Google engine was larger, based purely on their results estimates.

So I showed her how to pick rare words and go to the end of the search results. She was shocked to see results estimates go from thousands of results down to a hundred or so. If you're doing a search with >1000 results, it's very hard to verify how accurate a search engine's results are.

Raw numbers of results also don't take into account how much spam an engine has. For example, I'm looking around my desk and I see screen wipes sitting there, so I type ["copyright 2007" "screen wipes"] and click towards the end of Google and Yahoo. Google reports fewer results, but right now the Yahoo index has urls like broken-lcd-screen.php computer-monitor.php cleaning-and-dusting... clean-up-kit.php office-suspended.php office-cleaning.php cleaner-32-1-1-cd-key.php sun-room/...

Out of the 10 Yahoo results I clicked to see (#91 through #100), 8 were spammy. The other two pages were Russian and Swedish, so I won't go out on a limb and speculate if they were spam.

I really feel like graphing results estimates for a common phrase "copyright 2007" doesn't yield any genuinely useful data to compare between engines. It can do more harm than good, because it pushes engines toward crawling spammy urls and keeping them around, or reporting higher estimates when they're hard to verify.

Anton [PersonRank 0]

17 years ago #

"The other two pages were Russian and Swedish, so I won't go out on a limb and speculate if they were spam."

The Swedish page is kosher. It's a IDG (International Data Group) forum discussing this very topic.

Hong Xiaowan [PersonRank 10]

17 years ago #

Yes. Spam will waste our money and time and make us not to use it again, if there are another selection.

Google seam only gather old in 2005, the reason maybe afraid of Spam, but when 2006 comes, I found that google have good test, I can find newest info at Google too. I can find useful new article at Google instead of spam search engines.

Anyway. For "copyright 2007", I think the perfect search result should be only one. "Copyright 2007" can say nothing for a page. Only at this blog post have some importance.

But search "Copyright 2006" at any search engines, the first is still not the page about "Copyright 2006" of this blog that posted last year..

So the search engins now still beta, still child, have a long distance to walk.

Philipp Lenssen [PersonRank 10]

17 years ago #

I agree with you Matt, I tried to make that important distinction between results quality and purported index size clear in my disclaimer...

Steve Magruder [PersonRank 1]

17 years ago #

In addition to what Ionut Alex. Chitu said...

Many sites with "Copyright 2006" could be stating "Copyright 2006-2007" or an equivalent.

So you could be comparing 2007-updated sites in both data sets.

TOMHTML [PersonRank 10]

17 years ago #

Steve is right, what about "© 2003-2007"?

Tony Ruscoe [PersonRank 10]

17 years ago #

I think some of you are trying to read too much into this and don't really get what Philipp's trying to show here...

These stats are just comparing how many pages each search engine has indexed that contain the phrase "copyright 2007" and in doing so, we are able to make comparisons to see how quickly each one has been able to refresh their index this year.

The key point is that hardly any websites would have contained the text "copyright 2007" unless it was the year 2007 – and that number is now constantly on the increase – meaning that's a pretty good arbitrary phrase to see which search engine is best at refreshing its index to include new content.

If a site or page does use "Copyright 2006-2007" instead of "Copyright 2007" for example, this will be the same for all search engines anyway, so it's kind of irrelevant.

Did I understand right, Philipp? ;-)

Ionut Alex. Chitu [PersonRank 10]

17 years ago #

Yes, but search engines don't have the same sites in their index, or the same pages from a site, and the number of search results is just an approximation (which sometimes is way off).

Forum home


Blog  |  Forum     more >> Archive | Feed | Google's blogs | About


This site unofficially covers Google™ and more with some rights reserved. Join our forum!