Google Blogoscoped

Wednesday, June 4, 2003


Patent 6,526,440

Did you ever hear about Patent 6,526,440? This is the summary of the invention:

“Systems and methods consistent with the present invention address this and other needs by providing an improved search engine that refines a document’s relevance score based on inter-connectivity of the document within a set of relevant documents.

In one aspect, the present invention is directed to a method of identifying documents relevant to a search query. The method includes generating an initial set of relevant documents from a corpus based on a matching of terms in a search query to the corpus. Further, the method ranks the generated set of documents to obtain a relevance score for each document and calculates a local score value for the documents in the generated set, the local score value quantifying an amount that the documents are referenced by other documents in the generated set of documents. Finally, the method refines the relevance scores for the documents in the generated set based on the local score values.”
United States Patent 6,526,440, Bharat, February 25, 2003

Using Links to Measure Importance

Yes, this whole thing is more commonly known as PageRank*. Or for IBM, CLEVER (CLient-side EigenVector Enhanced Retrieval). And other people have thought about the same before, as I have shown with a recent post quoting Umberto Eco in 1995 (during a time where he himself didn’t use the Internet).

*This is one of two Google Inc. patents, the other one being Methods and apparatus for using a modified index to provide search results in response to an ambiguous search query (March 4, 2003).

Two concerns might be raised:

  1. What are the problems of using references to determine authority of a webpage?

  2. Could there be a realistic alternative to determine such?

First of all, systems such as PageRank do not settle the subjective issue of quality. The idea is of course that the more people link to a site, the more people like it. In the eyes of Google, which mentions the “democratic nature of the web”, they give a vote. Google says “Important, high-quality sites receive” higher PR. Important, maybe. But high-quality? This vote given by a net citizen is not intentional as such, because not everybody linking to something is praising it. Instead of a positive vote, one might want to give a negative one.

Negative Reviews Increase Publicity

Take this simple constructed example. Site has a form where you can put in your address. Instead of toys however, the kids will get little dangerous chemistry sets, and many of them blow up their pet hamster. A big discussion evolves online, especially in the blogologue of the highly linked, up-to-date world of online journals. People are raving mad, and everyone links to this site.

What happens now? Naturally, the site’s ranking, if based solely on links and relevant keywords, is boosted. Because everyone is linking to it. And everyone is talking about “free toys” at the same time. And everyone else not knowledgeable about all those politics, simply entering “free toys” into a search engine, will suddenly be presented with the worst site of all.

There are a number of other factors to determine ranking, and more common toy manufactures would certainly still be in better spots than this imagined chemistry set producer. However, this example should have pointed out the inherent problem of misusing votes of the certainly democratic web as entirely positive ones.

Tour Guides Should Know Better

[Stop] [Walk] [Don’t Walk] [One Way]

To the example, we add a metaphor. Imagine a tourist arriving in a strange city. He’s following the most popular tour guide.
Being cautious, he asks the tour guide how he determines what is relevant in the tour to see for tourists.
And the tour guide replies: “I just follow big flashy signs. If a sign is put up in front of an alley, I will bring you near it, because obviously this place must be important.”
The guide would probably not have a reply if the tourist would ask, “What about signs like ’Mind the Dog’, ’Dead End’, ’Please Don’t Enter’, or ’No Tourists Allowed’? How do you know what is relevant to me and where I want to go?”.

What is Relevant?

What determines page relevancy, based on a certain search query?

To continue the prior example, what if someone entering “Free Toys” is actually a journalist who wants to research the whole hamster scandal? Unless the search engine knows more about the user, it cannot make an educated choice about this issue. However, it’s a good guess most people would want actual free toys, as opposed to killing machinery. Most of the times a search engine should fare better in delivering the promised high-quality sites, and disregard those people intentionally looking for bad ones.

Sorting Out Votes

This to me seems to be the crucial part. The first thing a dog for a blind person should know is the difference between “Walk” and “Don’t Walk”. However, when it comes down to it your typical traffic lamp is a 1-bit communication channel. How can you algorithmically determine what is a positive, and what is a negative vote, within natural written language?

I could imagine the following approaches:

  1. Web authors assign a vote via mark-up (rather unrealistic, since SEs don’t control the authors).

  2. Search engine developers sort it out via text analysis* (a workaround, but a more pragmatic one, since SEs control their own algorithms).

*My attempts at automated movie quality determination with Moviebot proved to me it takes more than a five-minute hack to implement this “text analysis”, especially with Google’s current lack of fuzzy OR keyword intersection.

Analyzing Text

First of all, the scope has to be understood (the range a negative or positive word or phrase “covers”), because a single page could talk about different things at once. This in itself is tough. The second hurdle is figuring out what’s positive, and what’s not.
Let’s take a website review example:

“I checked two different sites today. This one, is really fantastic. This other one, however, isn’t good at all. The first one is great, indeed, but the second site really has bad service.”

Here, the scope for the good, and the bad review, would be one sentence each, and then a completely separated sentence with mixed scope (talking about both sites at once). Hard to distinguish. OK, let’s assume that in big numbers, i.e. thousands or millions of reviews, this will be normalized. Pages mentioning would in general be more negative than those talking about

Added to the scope, we have to actually figure out what’s good and bad. For that, one might analyze the words, and phrases. (Simply analyzing the words, like “good”, is not really enough. The single word “good” could be part of the phrase “not good”.)
All these are risky assumptions, but maybe in large numbers, and using a lot of text-processing with a big dictionary, it could work. (A simple ten-keywords-limited Google query is by far not enough to handle this.)

Page Ranking Future

I don’t know what Google, or others, are already doing, and planning to implement, though it seems to me they don’t attempt to differentiate between positive and negative votes at the moment. And it might just be it’s much too unreliable even trying to implement this “VoteRank”. But maybe future search engines will actually understand text content, instead of just its markup, and words on their own. There would be a whole lot of potential for improvement here in trying to serve what a search engine user most wants: relevant results.


Blog  |  Forum     more >> Archive | Feed | Google's blogs | About


This site unofficially covers Google™ and more with some rights reserved. Join our forum!