Finding Hidden Links

Tuesday, December 5, 2006

Finding Hidden Links

Blaues-Haus-Duesseldorf.de is a classical case of hidden links to fool search engines; the HTML defines a background color of “#F7F0FB” and a font color of “#F7F0FB”, thus rendering a keyword-stuffed text in the footer area invisible.

Now why does Google not simply find methods to ban such tricks? The site is older by now, classical SEO spam that’s been at the same location last year, as Siggi Becker tells me, and still has a PageRank of 3 (and surely, it’s not a special case, but one of many). Is it because to understand which text is showing on the page, or might be showing on the page, you need a layout rendering engine which understands CSS, deprecated inline HTML styles, JavaScript... and maybe even a bit of “human” common sense to understand realistic modes of user behavior?

After all, a page using a dynamic navigation menu that opens on mouse hovers also “hides” links, albeit in a way that you won’t think of as fishy. Even static text that may be printed in the background color of a page might be positioned on top of a background image of different colors in a way that it becomes visible again; and maybe its positioning is dependent on an exotic CSS hack that happens to display correctly on typical browser, but isn’t valid according to the W3C rulebook.

Thus, for Google or other search engines to truly understand which text is hidden for the wrong reasons, they’d have to do a lot more than compare background colors with font colors... thanks to DHTML, they even have to do a lot more then generating a screenshot of the site using e.g. a Mozilla rendering engine and then applying OCR (even though this procedure would likely already be much too time-consuming). And even if they’d be able to write code that generates all possible dynamic versions of a page to check which links can be actually reached, and which can’t, who’s to say (except an advanced, “common sense” AI) that the dynamic menus are not simply positioned in places where normal users won’t ever hover over them?

Is there any way in which Google and others can truly differentiate a spam link from a normal link? It’s hard to tell. It might be more realistic that Google’s algorithms assigns negative “spam points” to every action used within the HTML, CSS and so on. It’s as fuzzy as it’s pragmatic: three negative points might be assigned for a page that has the same font color as background color. Another point if text is using a tiny font size. Another point for every keyword in the title that exceeds a dozen keywords. Another point for every non-nofollowed link pointing to a shady neighborhood. Another point if there’s a meta redirect in the page. Another point if you’ve got duplicate content. Another 0.1 point for every keyword repetition inside the text, and so on. Now while every point assigned (taken on its own) may be assigned for a harmless reason that has a non-spammy explanation, chances are that when you cross a threshold of -N points, your page is indeed spam, and can then be automatically banned. (Positive points on the other hand might be assigned for such things as having backlinks/ having a high PR, to make sure that a site like CNN that’s linked from all over the place does not suddenly drop into googleaxed oblivion).

Google, on the other hand – to protect their anti-spam algos from spammers in particular – is not likely to share with us what’s really happening behind the scenes... though, whatever it is, we do see that it’s not working in all cases.

Finding Hidden Links by Philipp Lenssen | Comments (20)

>> More posts

Blog | Forum more >> Archive | Feed | Google's blogs | About

This site unofficially covers Google™ and more with some rights reserved. Join our forum!