Google Blogoscoped

Forum

Spam Detector  (View post)

Mark Draughn [PersonRank 5]

Wednesday, May 24, 2006
13 years ago4,206 views

The author points out that the spam detection algorithms aren't powerful enough to catch sophisticated spam, so I doubt spammers would gain much by running this tool against their sites, but I think a tool like this could be used to help web authors avoid creating pages that accidentally resemble spam.

To that end, I found a few areas where it reports false positives:

First of all, It thinks some of my writing is unnatural text that resembles keyword lists. I don't think any blogger wants to hear that!

Second, it flagged my date-based archive pages as a doorway farm. I assume most real search engines don't make this mistake, otherwise they'd be missing a lot of blogs. I imagine this is one of those tedious programming tasks that the author has avoided in the first pass.

Third, the tool doesn't seem to understand CSS media qualifiers. It flagged a whole bunch of things on my blog as hidden text based on "display:none". Most of those are for the print version of my page, which hides all the links, notes, and ads.

I wonder if the author could get around some of the problems he describes in his FAQ by building this as a Firefox plug-in. This would give him access to a good implementation of the box model and a Javascript engine.

LowLevel [PersonRank 0]

13 years ago #

Hi Mark, I'm the author of that tool. :-)

Doorway farm detection is really a tough job. To achieve the best results the tool should download ALL the pages linked by a block of keyword rich links and then analyze their contents.

This approach would slow down the entire analysis and it would not be possible to produce a fast response. So I chose a compromise: the tool currently downloads and analyzes just two of the linked pages. Unfortunately this choice leads to less precise results and to more false positives. I hope to fix this soon.

Thanks for the bug report about the CSS media qualifiers! I'll give a look to the routine that filters out all the non-screen qualifiers.

About a Firefox plug-in, I believe that it would be very slow. The tool makes many calculations, it even downloads all background images in order to calculate a "mean color". The tool runs many CPU intensive tasks.

This thread is locked as it's old... but you can create a new thread in the forum. 

Forum home

Advertisement

 
Blog  |  Forum     more >> Archive | Feed | Google's blogs | About
Advertisement

 

This site unofficially covers Google™ and more with some rights reserved. Join our forum!