Google Blogoscoped

Tuesday, May 27, 2003


More fun with the Google Web API: Moviebot tries to give a movie rating from 1 to 10 stars to any title you enter. Unfortunately, it doesn’t work that well. The algorithm is this:

What’s most important here is the choice of the words that define a positive review, and those for a negative one. “Terrific” versus “lame” gave the best results. (I tried “good” and “bad”, combinations of “(brilliant | great | superb)” etc., but this never even remotely worked.)

Too bad for this algorithm that movie reviews tend to compare one movie with another, so we have a mix of titles (especially with classics).
Also, they negate words like “terrific” in a variety of ways (“clearly not a terrific movie”, “not as terrific as ...”, “would have been lame if not for ...” and so on). Phrases like “this movie is good” however are far too rare to be used. Added to all that, sequels also cause problems, even when this can be limited by the original query (-”[movie title] 2”, -"II”, and so on).

Search Engines Put to the Test

Somebody told me he’s using Google for the simple reason that the result is always relevant; that the sites listed are indeed what one was looking for.

I constructed a small test with relatively obvious targets, but even for those the results in major search engines Google, AllTheWeb, and AltaVista, differed greatly.
It’s interesting that searching for “Official Coca Cola Homepage” (no quotes) will not return at all in AltaVista. And the simple keyword “Yahoo” will list the Yahoo! homepage at number 4 in AllTheWeb.

To wrap it up, Google is the big winner, AllTheWeb comes in second, and AltaVista third.

Small Result Relevancy Test
Absolute Ftp Download
vandyke. com/ download/ absoluteftp/
Google Blogoscoped
blog. outer-court. com
White House
www. whitehouse. gov
Google Image Search
images. google. com
www. yahoo. com
Official Coca Cola Homepage
www. coca-cola. com
That popular online auction site
www. ebay. com
What’s the capital of italy please
“Capital city is Rome” in description
Average “off-target” factor (rounded): 0613

To calculate the results I simply add up the distance from number one position, and if it can’t be found at all, I give a default of 20. (People mostly won’t look past the 20th or-so result anyway, so it doesn’t matter much if it will be listed later or not.)

Update: Thanks to a reader comment, I realized I included some of the AllTheWeb “sponsored results” in above calculation. I will leave the test as it is for now since one cannot clearly and intuitively distinguish between sponsored and “normal” results in AllTheWeb, and they also push the relevant sites much further down in the list. Once they clearly separate the result list, I should redo the test.

BlogRank, Extrapolation, Adaptive PageRank

Wired News reports from the 12th International World Wide Web Conference in Budapest:

“Tweaking existing search engines and developing new ways to find specialized data were the subjects of two dozen papers presented (...) Google could increase its search speed using new techniques developed by Stanford University computer-science researchers, who presented a paper on their findings at the conference. “We’ve gotten a lot of press attention today from this work,” said Stanford researcher Sepandar Kamvar. “Some of it is not quite correct. One major caveat: Google will not run five times faster if our research is implemented, but we do expect a 30 percent speed-up.” According to Kamvar, three simple-to-implement tweaks to Google would measurably boost search speed, “freshen” the results and — if the research is further refined — eventually allow for personalized searches. Google’s owners haven’t said whether they will adopt the Stanford research, but Google co-founder Sergey Brin was in the audience.”
– Michelle Delio: Big Changes for Search Engines (Wired News), May 27 2003

The three tweaks, from the quite technical abstracts of the Stanford University papers:


“The web link graph has a nested block structure: the vast majority of hyperlinks link pages on a host to other pages on the same host, and many of those that do not link pages within the same domain. We show how to exploit this structure to speed up the computation of PageRank by a 3-stage algorithm whereby (1) the local PageRanks of pages for each host are computed in-dependently using the link structure of that host, (2) these local PageRanks are then weighted by the “importance” of the corresponding host, and (3) the standard PageRank algorithm is then run using as its starting vector the weighted aggregate of the local PageRanks. Empirically, this algorithm speeds up the computation of PageRank by a factor of 2 in realistic scenarios. Further, we develop a variant of this algorithm that efficiently computes many different “personalized” PageRanks, and a variant that efficiently recomputes PageRank after node updates.”
– Kamvar, Haveliwala, Manning, Golub from Stanford University, Exploiting the Block Structure of the Web for Computing PageRank [PDF]


“We present a novel algorithm for the fast computation of PageRank (...)
The original PageRank algorithm uses the Power Method to compute successive iterates that converge to the principal eigenvector of the Markov matrix representing the Web link graph. The algorithm presented here, called Quadratic Extrapolation, accelerates the convergence of the Power Method by periodically subtracting off estimates of the nonprincipal eigenvectors from the current iterate of the Power Method. In Quadratic Extrapolation, we take advantage of the fact that the first eigenvalue of a Markov matrix is known to be 1 to compute the nonprincipal eigenvectors using successive iterates of the Power Method. Empirically, we show that using Quadratic Extrapolation speeds up PageRank computation by 25-300% on a Web graph of 80 million nodes”
– Kamvar, Haveliwala, Manning, Golub from Stanford University, Extrapolation Methods for Accelerating PageRank Computations [PDF]

Adaptive PageRank

“We observe that the convergence patterns of pages in the PageRank algorithm have a nonuniform distribution. Specifically, many pages converge to their true PageRank quickly, while relatively few pages take a much longer time to converge. Furthermore, we observe that these slow-converging pages are generally those pages with high PageRank. We use this observation to devise a simple algorithm to speed up the computation of PageRank, in which the PageRank of pages that have converged are not recomputed at each iteration after convergence. This algorithm, which we call Adaptive PageRank, speeds up the computation of PageRank by nearly 30%.”
– Kamvar, Haveliwala, Golub from Stanford University, Adaptive Methods for the Computation of PageRank [PDF]


Blog  |  Forum     more >> Archive | Feed | Google's blogs | About


This site unofficially covers Google™ and more with some rights reserved. Join our forum!