Google Blogoscoped

Forum

Finding Hidden Links  (View post)

Michael Keukert [PersonRank 1]

Tuesday, December 5, 2006
17 years ago7,039 views

When Google hired a few Firefox developers last/early this year, speculations run high that they are working on (in that order) a Google browser, a Google OS or a Googlebot which can "see". Also webmasters repeatedly report to see bot-behaviour-patterns with a standard Firefox identification. But all the techniques you mention show that it is quite hard, and probably won't work without human interferance in the forseeable future. But I#d be surprised if they would NOT work on something like this.

John Resig [PersonRank 1]

17 years ago #

I think you're over thinking this a bit. You don't actually have to render the page in order to determine if some text, or some links, are not visible to the user. All you have to do is parse the HTML of the page (which Google already does) then load in the appropriate CSS files and build its inheritance tree. You'll then be able to tell (with a good degree of certainty) if some text or links are not visible – all without ever actually rendering the page. This is the type of thing that Google can throw on their massive cluster and be done with it quite quickly.

Is there any record of Google spidering CSS files for a site? I don't think there is, and until they do they're probably just using some implicit heuristics to determine if a page is "spammy" or not.

Sebastian [PersonRank 0]

17 years ago #

I will never understand why they don't load the css-files.

Philipp Lenssen [PersonRank 10]

17 years ago #

John, it's not as simple as that, e.g. refer to the background image example I mentioned in the post. A background image can consist of two different colors (to use a simplified example), one where content is positioned onto, and one consisting of background color. This way, when you you create text that is also the background color, but you'll position it over the content area of the background image using CSS, your text is perfectly visible and fine... however, you will not be able to determine this unless you actually parse/ view the graphic (like the JPG file referenced in the CSS). By just parsing the CSS, you'd incorrectly label this page as spam by the approach you mention. And that's all not even considering dynamic menus (or dynamically applying CSS styles via JS!), which seem to be a crucial part of the problem... and that means you need to "render" the JS (and even if you did so, I've listed remaining problems).

Fred [PersonRank 0]

17 years ago #

Forget all the rendering. Just monitor Google Web Accelerator traffic for clicks generated by real users. Any links actually clicked by real users can be flagged as visible. Links never followed by any real users can (given enough users to be statistically relevant) be flagged as hidden.

Katinka Hesselink [PersonRank 2]

17 years ago #

That's assuming they use the traffic data they have through adsense and stuff like that for their search-engine. We've been assured that they don't. It's also tricky: it places sites with hidden links that only show up for users that browse through the keyboard or through voice at a disadvantage. Not all hidden links are spam links.

Ryan [PersonRank 0]

17 years ago #

converting to image and detecting text is not a hard thing for Google to do, especially given a cluster of machines and the fact that spidering + updating the cache + updating the rankings are not instantaneous.

I don't think they're doing it, but they could and we wouldn't notice.

The same goes for their human evaluators. we really don't know how they're using them either.

Best bet though, is to file a spam report when you encouter a spammy site. I've filed quite a few, and seen action taken on most of the sites.

JohnMu [PersonRank 10]

17 years ago #

Google can never automatically detect all misleading hidden text / links. There's just about no way. CSS accessability features, CSS dropdown menus, javascript texts, AJAX usage, etc etc etc – it's impossible to find automatically and sometimes even very hard to detect manually. The only thing that could be done is to let certain "features" trigger a red flag to queue up for a manual review – it can't be automatically penalized. (of course there are some features which are easy to detect and I'm sure Google does do that: 1 pixel font with the font-tag comes to mind..

The *real* issue with hidden text is IMHO not when a webmaster or SEO adds hidden text / links to his own site, it's when a hacker goes in and adds hidden text / links to other sites and they are subtle enough to pass under the rough radar of the search engines. Adding hidden text and links to your own site will generally not get you much more than good on-page SEO, no big deal. Adding hidden text / links to someone elses site in a way that they don't notice is something completely different.

I hate examples but stuff like http://www.unesco.org/cgi-bin/webworld/portalsforum/gforum.cgi and http://www.unesco.org/webworld/portal_bib/pages/Cool/ just gets on my nerves... In addition to the hidden links, the webmasters are unresponsive – the site is too large for anyone to feel responsible. Follow the site linked to around a bit and you'll find lots of similar linking structures to all sorts of sites. Crazy.

John Honeck [PersonRank 10]

17 years ago #

Why penalize? Why not just ignore?

A penalty or deindexing is just confirmation that a technique went over the line. After detecting the hidden text simply ignore it and let the visible text be the basis to judge the page.

If anything the person who puts up the hidden text will think it's working and waste their time adding more, better that than finding more creative ways to go undetected.

If someone is lying to you, they have the power.
If you know they are lying, you have the power.
If they know you know, no one has the power.

JohnMu [PersonRank 10]

17 years ago #

Which links would you ignore, Johnweb? All of the links on the page? If all of them, then that would end up with the same result as a penalty / deindexing: with not value being transfered to the linked pages they would quickly disappear from the index (or more likely end up in the supplementals). Perhaps that is already happening more than we think :-).

Which is worse?

In my opinion a full deindexing is much less of a problem than a subtile devaluation: you'll easily spot a full deindexing and go to search for the reasons. You usually wouldn't notice it with a subtile devaluation, you'd just accept it and perhaps just rant about it somewhere (or not).

Philipp Lenssen [PersonRank 10]

17 years ago #

> I hate examples but stuff like unesco.org/cgi-bin/webworld/po ...
> and unesco.org/webworld/portal_bib ... just gets on my nerves...

Can you describe what is wrong with those pages JohnMu?

> A penalty or deindexing is just confirmation that a
> technique went over the line. After detecting the hidden
> text simply ignore it and let the visible text be the basis
> to judge the page.

Interesting approach. If Google really does this...

JohnMu [PersonRank 10]

17 years ago #

If you have Firefox + the Webdeveloper extension, use Information / View Link Information to get a listing of links. You'll spot one that kind of falls out of place :-). Use view source to find where it is. Sneaky.... and certainly not placed there by the webmaster.

JohnMu [PersonRank 10]

17 years ago #

There is also a problem with just ignoring known and recognized abuse: you can assume that they won't be able to recognize all forms – and if when the abuse is recognized it's ignored, you'll be in a situation where it can't harm you but you might have a potential gain (and if not on Google, at least on a different engine). There would be no reason not to use hidden text / links if the worst that could happen is that it's ignored.

It would be like not handing out fines to people who knowingly try to cheat on taxes. If you don't get caught, you can end up paying less taxes. If you get caught, you just pay the normal taxes, no problem: let's all try to cheat.

Tadeusz Szewczyk [PersonRank 10]

17 years ago #

There is no "SEO spam". Either it is SEO or spam. Likewise there is no hot ice or liquid stone. Search engine OPTIMIZATION means making a site to suit the search engine. If you trick the search engine it is search engine spam (SES?)

Philipp Lenssen [PersonRank 10]

17 years ago #

I know we respectfully disagree on this point, Tadeusz: I believe yes, there is spam of many forms – blog spam, email spam, SEO spam, etc. That doesn't mean all blogs are bad! Or all emails! But if a company employs blackhat SEO tactics then it doesn't suddenly stops being an SEO company, it just stops being a *good* SEO company – but not all SEO is good, no matter how much you like to protect your trade (which I understand, of course).

So, IMO, saying "SEO spam doesn't exist" is as wrong as saying "all SEO is spam" (and you rightfully oppose people who say the latter). You really need to invent a new word if you want to give it new meaning, so maybe you can switch to "ESEO" (ethical search engine optimization) or something. I called it EUO (end user optimization):

http://blogoscoped.com/archive/2003_06_06_index.html#200393098

Philipp Lenssen [PersonRank 10]

17 years ago #

Thanks for the followup JohnMu, I've been able to view the links now (without extension actually, in Firefox you can right-click -> page info -> links). The one who places the 44w.de is sneaky indeed.

adame [PersonRank 1]

17 years ago #

Very cool trick.

Mat [PersonRank 1]

17 years ago #

So, maybe I am wrong, but how is it possible to add the links as mentioned in the example (unesco) above? There is no way doing this if you do not have access to the webspace, right? So it is up the the webmaster of the site...

Philipp Lenssen [PersonRank 10]

17 years ago #

Yes, or someone selling a template system, or message board, or someone who hacked the site... several possibilitites...

JohnMu [PersonRank 10]

17 years ago #

Most definitely someone who hacked the site – it's all within the same sub-section of the site. They use several open-source tools on a site like that, and perhaps they forgot to upgrade one of them on time and left room for a cracker to get in.

Cracking your way in and breaking the website is a sure way to get your links out. Cracking your way in and adding your links in a way that is really hard to recognize (a single, hidden link here and there) is something that will remain online for a while and might not ever get noticed by the webmasters. It might not even get a real penalty since it's not overdone and that site links out to a lot of other sites naturally.

Would *you* notice it if someone did this to your site? How would you go about checking your site to see if this happened? To tell you the truth, I probably wouldn't notice it, not even on a tiny site.

Forum home

Advertisement

 
Blog  |  Forum     more >> Archive | Feed | Google's blogs | About
Advertisement

 

This site unofficially covers Google™ and more with some rights reserved. Join our forum!