Google Blogoscoped

Forum

Checking the Google Page Count of a Comment to Identify Spam  (View post)

Ross Dargan [PersonRank 0]

Thursday, April 8, 2010
14 years ago13,437 views

Interesting blog, thank you, I will look for more on this topic... :)

James Xuan [PersonRank 10]

14 years ago #

Isn't it kind of funny how the first comment appears to be Spam? :P

The only problem I can see with this is that there would be hundred of comments like "Win." and "lol epic fail" and stuff but those are not necessarily desired comments on a blog any way, so this could be a wake to provoke more thoughtful comments. Although one downside to this is that people may be discouraged by a message telling the that their comment wasn't worthy for the discussion or the user might become frustrated if the system incorrectly identifies their comment as spam when it is simply a boring, generic comment.

Ryan [PersonRank 0]

14 years ago #

I think the false positive rate here would be super high.

Tony Ruscoe [PersonRank 10]

14 years ago #

This is exactly the way I used to check whether suspicious comments were spam when I was moderating this forum. I thought of automating this too, but I guess I must have never got around to suggesting it to you... :)

Another thing you might want to do is check what other form fields are being submitted at the same time as the comment being posted. When you see comments with no URL that appear to be completely pointless, it's possible they were trying to automatically post to a form field with the name "URL" in the hope that it would get automatically linked when published.

David Mulder [PersonRank 10]

14 years ago #

The only problem with this is that it would stop working withing a few months, once spammers realise this is happening and on top of that quite a number of spammers already randomize comments as e.g. wordpress doesn't allow duplicate literal comments if I remember correctly (except if they are shorter than a certain amount of characters)

Tony Ruscoe [PersonRank 10]

14 years ago #

David, in my experience here, spammers are already randomizing comments, but they're posting the same few randomly generated comments to thousands and thousands of websites, so each exact phrase will inevitably still appear in search results thousands of times. Only if they start to post pretty much unique comments would this not work.

George R [PersonRank 10]

14 years ago #

My original comment was in another thread, which at the time had a number of spam entries. Unfortunately my original comment seems to have ben removed. I gave a number of examples from that thread. All but one had more than 1000 hits.

Searching for the above ["Interesting blog, ... topic"] http://www.google.com/search?as_epq=Interesting+blog%2C+thank+you%2C+I+will+look+for+more+on+this+topic currently only produces 4 hits initially. Google offers to include omitted similar results, producing a total of 18 hits. The first two are from this blog and this forum. The other 16 are from other sites which seem to have included copies of what is here. Ross Dargan's above comment would not have been identified as spam. http://blogoscoped.com/forum/169714.html

desalvionjr [PersonRank 2]

14 years ago #

combined with a number of other methods this could be VERY effective.

Philipp Lenssen [PersonRank 10]

14 years ago #

George, the examples you posted are below (when deleting spam threads we usually also delete the replies to the spam, just so the thread will be "normal" again, though there will always be backups):

------------------------------------

198000 ["ok you just have different approach"] http://www.google.com/search?as_epq=ok+you+just+have+different+approach

1349 ["Not bad article, ... approach"] http://www.google.com/search?as_epq=Not+bad+article%2C+but+I+really+miss+that+you+didn't+express+your+opinion%2C+but+ok+you+just+have+different+approach

2860 ["I read about ... similar"] http://www.google.com/search?as_epq=I+read+about+it+some+days+ago+in+another+blog+and+the+main+things+that+you+mention+here+are+very+similar

11600 ["I am not ... skills"] http://www.google.com/search?as_epq=I+am+not+going+to+be+original+this+time%2C+so+all+I+am+going+to+say+that+your+blog+rocks%2C+sad+that+I+don't+have+suck+a+writing+skills

only 325 ["i am new ... keybord"](sic) http://www.google.com/search?as_epq=i+am+new%2C+happy+to+be+here..+need+fix+my+keybord

3270 ["need fix my keybord"](sic) http://www.google.com/search?as_epq=need+fix+my+keybord

------------------------------------

WatchSteveDrum [PersonRank 0]

14 years ago #

This sounds very similar to the concept of an anti-plagiarism program, like TurnItIn.com. Many of the same issues dealing with false positives, quoted text, and the like would probably come up in this context too. Basically, in both cases you're just looking for identical text that doesn't deserve to be identical (for example, a quote). I imagine this case might be even easier than anti-plagiarism though be here we're talking about hundreds or thousands of repeats as opposed to just a few on anti-plagiarism solutions.

So no real point there, just an interesting thought about it...

mak [PersonRank 5]

14 years ago #

Philipp, why don't you implement a simple captcha in this form? I guess it would eliminate a lot of spam.

bugstomper [PersonRank 0]

14 years ago #

George R's examples are all short comments posted apparently by a bot with the spam link being in the poster's home page URL. Is that really a high enough percentage of spam comments to be worth looking up an exact match for all comments, say shorter than some minimum?

In my experience the single most effective block against bots that submit web forms is to require javascript. Not by checking for javascript being enabled, but something that requires javascript to be executed or else the form can't be submitted, such as an onload function that stores the action field in the form element. It may be annoying to people who prefer to use something like NoScript, but they should be used to having to whitelist certain sites.

For some reason this does not seem to be a popular anti-bot measure, but there are quite a few hashcash plugins for various blog and forum software. Hashcash uses javascript on the client side to force the browser to spend some arbitrary amount of time calculating a value that has to be submitted in a hidden field that is added to the form. It works as well as any simpler way of requiring JavaScript, but I guess people think that having a cryptographically sound proof of work function makes it a strong technically sophisticated anti-bot method.

The only advantage I've found for hashcash over adding the few lines of an onload function to enable the form is when someone else has written it as a plugin for the software I'm using so I can click and install it without having to change any code.

Forum home

Advertisement

 
Blog  |  Forum     more >> Archive | Feed | Google's blogs | About
Advertisement

 

This site unofficially covers Google™ and more with some rights reserved. Join our forum!