Google Blogoscoped

Friday, April 11, 2008

Googlebot Submitting Forms to Find More Pages

The deep web is a term referring to all the kinds of pages that are live on the web, but not indexed in search engines for some reason or other. For instance, traditionally search engines mostly follow links in HTML, but from what we know they don’t understand JavaScript (yet) or submit forms. Now Google announced that they have started to experiment with submitting forms for some “high quality sites” by entering words picked from the site into the form’s text boxes, and by selecting different radio buttons, select boxes or checkboxes. Then when the Googlebot determines that web pages found in the results to that submission are valid, “interesting” and unique, they may add them to their search results index.

Google notes that they only do this form submission for “GET" forms. A form using GET results in a parametrized URL like example.com/show?foo=bar. The guidelines for webmasters are that a GET request should never actually change data on the server, like trigger a user registraton or something; for such things, webmasters should use POST, which the Googlebot will not submit. Google also note that they “omit any forms that have a password input or that use terms commonly associated with personal information such as logins, userids, contacts, etc.” Plus, Google say that pages they find will not reduce the PageRank of other pages on the site.

With this move, Google digs a bit deeper than before which may result in more relevant results for searchers, and a smaller “deep web.” And if webmasters misconfigure their scripts or robots.txt files so their site goes against net standards, it may also result in a bit of new confusion for some. On the other hand, this move by Google also has the potential to help webmasters who have such misconfigurations, especially those who aren’t very knowledgeable about web accessibility or SEO, and who don’t put up crawlable links to all their sub-pages (and in reverse, if Googlebot continues to be smarter about what it crawls, in the long run some web developers may also see less incentive to remove small inaccessibilities on their site).

[Thanks Miss Universe! Sketch drawn by MMOArt.]

Update: A correction; Google does parse some JavaScript to find URLs, as TomHTML told me. Google’s Matt Cutts confirmed to me “Google has the ability to scan JS to discover some very clearly provided links.” Whether or not Google parses more complicated JS, Matt leaves open. He says, “Primarily we use it as a way to discover new links to possibly crawl.” [Thanks Tom and Matt!]

Advertisement

 
Blog  |  Forum     more >> Archive | Feed | Google's blogs | About
Advertisement

 

This site unofficially covers Google™ and more with some rights reserved. Join our forum!