More fun with the Google Web API: Moviebot tries to give a movie rating from 1 to 10 stars to any title you enter. Unfortunately, it doesn’t work that well. The algorithm is this:
What’s most important here is the choice of the words that define a positive review, and those for a negative one. “Terrific” versus “lame” gave the best results. (I tried “good” and “bad”, combinations of “(brilliant | great | superb)” etc., but this never even remotely worked.)
Too bad for this algorithm that movie reviews tend to compare one movie with another, so we have a mix of titles (especially with classics).
Also, they negate words like “terrific” in a variety of ways (“clearly not a terrific movie”, “not as terrific as ...”, “would have been lame if not for ...” and so on). Phrases like “this movie is good” however are far too rare to be used. Added to all that, sequels also cause problems, even when this can be limited by the original query (-”[movie title] 2”, -"II”, and so on).
Somebody told me he’s using Google for the simple reason that the result is always relevant; that the sites listed are indeed what one was looking for.
I constructed a small test with relatively obvious targets, but even for those the results in major search engines Google, AllTheWeb, and AltaVista, differed greatly.
It’s interesting that searching for “Official Coca Cola Homepage” (no quotes) will not return coca-cola.com at all in AltaVista. And the simple keyword “Yahoo” will list the Yahoo! homepage at number 4 in AllTheWeb.
To wrap it up, Google is the big winner, AllTheWeb comes in second, and AltaVista third.
Query | Position | ||
---|---|---|---|
AllTheWeb | AltaVista | ||
Absolute Ftp Download vandyke. com/ download/ absoluteftp/ |
2 | 1 | [20] |
Google Blogoscoped blog. outer-court. com |
1 | [20] | [20] |
White House www. whitehouse. gov |
1 | 4 | 5 |
Google Image Search images. google. com |
1 | 1 | 2 |
Yahoo www. yahoo. com |
1 | 4 | 5 |
Official Coca Cola Homepage www. coca-cola. com |
1 | 1 | [20] |
That popular online auction site www. ebay. com |
1 | [20] | [20] |
What’s the capital of italy please “Capital city is Rome” in description |
1 | 1 | [20] |
Average “off-target” factor (rounded): | 0 | 6 | 13 |
To calculate the results I simply add up the distance from number one position, and if it can’t be found at all, I give a default of 20. (People mostly won’t look past the 20th or-so result anyway, so it doesn’t matter much if it will be listed later or not.)
Update: Thanks to a reader comment, I realized I included some of the AllTheWeb “sponsored results” in above calculation. I will leave the test as it is for now since one cannot clearly and intuitively distinguish between sponsored and “normal” results in AllTheWeb, and they also push the relevant sites much further down in the list. Once they clearly separate the result list, I should redo the test.
Wired News reports from the 12th International World Wide Web Conference in Budapest:
“Tweaking existing search engines and developing new ways to find specialized data were the subjects of two dozen papers presented (...)
Google could increase its search speed using new techniques developed by Stanford University computer-science researchers, who presented a paper on their findings at the conference.
“We’ve gotten a lot of press attention today from this work,” said Stanford researcher Sepandar Kamvar. “Some of it is not quite correct. One major caveat: Google will not run five times faster if our research is implemented, but we do expect a 30 percent speed-up.”
According to Kamvar, three simple-to-implement tweaks to Google would measurably boost search speed, “freshen” the results and — if the research is further refined — eventually allow for personalized searches.
Google’s owners haven’t said whether they will adopt the Stanford research, but Google co-founder Sergey Brin was in the audience.”
– Michelle Delio:
Big Changes for Search Engines (Wired News), May 27 2003
“The web link graph has a nested block structure:
the vast majority of hyperlinks link pages
on a host to other pages on the same host, and
many of those that do not link pages within the
same domain. We show how to exploit this structure
to speed up the computation of PageRank by
a 3-stage algorithm whereby (1) the local PageRanks
of pages for each host are computed in-dependently
using the link structure of that host,
(2) these local PageRanks are then weighted by
the “importance” of the corresponding host, and
(3) the standard PageRank algorithm is then run
using as its starting vector the weighted aggregate
of the local PageRanks. Empirically, this algorithm
speeds up the computation of PageRank
by a factor of 2 in realistic scenarios. Further,
we develop a variant of this algorithm that efficiently
computes many different “personalized” PageRanks, and a variant that efficiently recomputes
PageRank after node updates.”
– Kamvar, Haveliwala, Manning, Golub from Stanford University, Exploiting the Block Structure of the Web for Computing PageRank [PDF]
“We present a novel algorithm for the fast computation of PageRank (...)
The original PageRank algorithm uses the Power Method to compute
successive iterates that converge to the principal eigenvector of the
Markov matrix representing the Web link graph. The algorithm
presented here, called Quadratic Extrapolation, accelerates the convergence of the Power Method by periodically subtracting off estimates of the nonprincipal eigenvectors from the current iterate of
the Power Method. In Quadratic Extrapolation, we take advantage
of the fact that the first eigenvalue of a Markov matrix is known
to be 1 to compute the nonprincipal eigenvectors using successive
iterates of the Power Method. Empirically, we show that using
Quadratic Extrapolation speeds up PageRank computation by 25-300% on a Web graph of 80 million nodes”
– Kamvar, Haveliwala, Manning, Golub from Stanford University, Extrapolation Methods for Accelerating PageRank Computations [PDF]
“We observe that the convergence patterns of pages in the PageRank
algorithm have a nonuniform distribution. Specifically, many pages converge to
their true PageRank quickly, while relatively few pages take a much longer time
to converge. Furthermore, we observe that these slow-converging pages are generally
those pages with high PageRank. We use this observation to devise a simple
algorithm to speed up the computation of PageRank, in which the PageRank of
pages that have converged are not recomputed at each iteration after convergence.
This algorithm, which we call Adaptive PageRank, speeds up the computation of
PageRank by nearly 30%.”
– Kamvar, Haveliwala, Golub from Stanford University, Adaptive Methods for the Computation of PageRank [PDF]
>> More posts
Advertisement
This site unofficially covers Google™ and more with some rights reserved. Join our forum!