Google Blogoscoped

Tuesday, January 13, 2004

The Google Spam Problem

Sad but true: Google has a major spam problem. What’s filling up the results with garbage lately are search results. The pages all contain the term you are looking for, but mostly within search results, or the fake introduction. Let’s say I’m looking for the works of Shakespeare to read online. Nevermind what I’m looking for – the introduction of a page Google lists in its Top 10 will always read the following:

“Welcome to this page on buying cars in New York. We provide all you need to know on buying cars in New York. We know buying cars in New York is difficult. We can help you for free. Below you can find a link list with all the best topics on buying cars in New York:

1. Shakespeare’s Complete Works, Part I
... he jests at scars that never felt a wound ...

2. Shakespeare Online
– Mock-up page

You get the idea. The targeted search phrase here (repeated ad nauseam) is “Buying cars in New York” (or whatever else has the chance to make a buck – real examples are “properties needing renovation in Huxley”, “office space to lease in issy les moulineaux”, or “Cheapest Flight To Teneriff”).
And the whole cabundle is set up as a giant, dumb & automated link farm. I heard the same story about those search-result-link-farms (let’s call them search farms) from Usenet postings, and from colleagues. It ruins many results.

Checking One of Them

To test what’s behind such a search farm, I went to the website (they link to my website, because it can be found in Google results – I’m suspecting the Google Web API to be at work here). Then at the front page I entered the search string “thisisagoogletest” (it was not available at Google). The search result:

“thisisagoogletest information is readily available from as we have an extensive information base derived from many thisisagoogletest service providers.”
– Spam

I suspect they will do something with my term. They might also just query words from a dictionary, or other word database.

The domain was registered on August last year, so it’s just some month old. And there’s around 28,000 pages from this domain stored in the Google index. (Just for fun, I reported the domain using Google’s spam report page.)

So how is the site doing for “properties needing renovation in Huxley”?
It’s number one at Google, of course. And it’s also in the top 10 for properties renovation (no quotes). So this spam-SEO scheme is working. They might get the death penalty from Google, who knows, but the rewards might be coming in right now. They can set up a new domain easily. The problem is that Google lists them in the first place.

This URL, and the thousands or millions of others like it. They are becoming a threat to Google Inc. and have the power to change our searching habits. I’m not saying other search engines are much more smart about it. But I bet they are attacked much less, and much less specific to their algorithms – simply because they are less popular than Google (and have less potential to bring in customers, and money).

For Example: Cars

Let’s take “cars" for example. The following number one spots are occupied with what looks like spam to me (spam in the sense that the information provided is not helpful to the query):

And so on... much of the cities I tried have bogus pages waiting for the Google user. I didn’t do an extensive study, but quickly checked if cars “new york” has spam like this at MSN Search, and it didn’t.


I don’t know if there is a perfect solution to the problem. Link farms are attacking the core of Google; making links count as votes. Once you remove this, you might get rid of the spam, but also with much of what makes Google good. And even if you reevaluate the vote given by links, any mean-spirited Search Engine Optimizer can include a keyword in the title, meta-tag, headings, and content, with not much effort. And then there’s the problem of automated search results or any other approach to fake real content. You end up with meaningful blurbs, nevermind to which topic.

Maybe it’s time for Google to categorize a page and analyze if its content makes sense in the context of the domain? To heighten the value of DMOZ, creating “authority hubs"? To analyze more strongly what people are clicking on – and to punish sites which are regularly clicked on by users who then return to Google to click on the next link? Or to hire a lot of people to evaluate sites*? Or just more heavily analyze how those search farms are build up?

*I’m sure a single skilled person is able to give the death penalty to about a dozen top-level domains per hour. Take 1,000 people. Take one month with 21 working days of 8 hours. That’s 2 million domains, with thousands of pages each – let’s see if spammers can keep that pace.

It’s by far not easy. For every fix, there might be new tricks by SEO spammers. But whoever comes up with the best solution to the problem might be the one winning the search engine wars.

More Google Search By Numbers

Aaron Swartz of Google Weblog and other report that Google has added more search by number features (in the meantime, their “whois” feature seems to be gone). The Google help page’s full number feature set now lists:

Google Guides

Nancy Blachman pointed me to her detailed and illustrated online Google Guide. Nancy is co-author of the following book...

And while at Amazon, I saw this Google book as well:

I finally also got around to look at the book Google Hacks (by Tara Calishain, Editor Rael Dornfest), and there were some inspiring new things to me inside it!


Blog  |  Forum     more >> Archive | Feed | Google's blogs | About


This site unofficially covers Google™ and more with some rights reserved. Join our forum!