Google Blogoscoped

Forum

Wikipedia, the Google way

milivella [PersonRank 10]

Wednesday, January 24, 2007
17 years ago3,869 views

One of the flaws of Wikipedia (a project that I think is the best all around) is that its ideal is: given a certain topic, you should write about it everything you can find in reliable sources. Because of the number of informations given in reliable sources, the selection about what to write is in practice left to the editors, without (correct me if I'm wrong) any guideline such as notability (that rules which topics you should write about, but *not* what to write about those topics). It's probably not a great problem, because people (nearly always) know which informations are useful and which not, but it could be avoided, with the help of Google.

The basic idea is that, if many web pages that cite the topic cite another word (or expression), the latter should be cited in an encyclopedic article about that topic. It's the old good Googleshare (http://blogoscoped.com/archive/2003_05_25_index.html). But it needs to be refined, because, this way, the words "the", "and", "of" would be the most important for every topic. How to change the algorithm? I'll make an example:

Pages indexed by Google: 25,700,000,000 (http://blogoscoped.com/forum/80826.html)
Pages with "real": 1,370,000,000 = 5.42% of the total
Pages with "Real Madrid": 6,520,000 = 0.03% of the total
Pages with "Ronaldo": 29,300,000
Pages with both "Ronaldo" and "real": 2,020,000 = 6.89% of the pages with Ronaldo (= Googleshare)
Pages with both "Ronaldo" and "Real Madrid": 1,630,000 = 5.56% of the pages with Ronaldo (= Googleshare)
Notability of "real" in an article about Ronaldo: 6.89 – 5.42 = 1.47
Notability of "Real Madrid" in an article about Ronaldo: 5.56 – 0.03 = 5.53
I.e. If you are writing an encyclopedic article about Ronaldo, what you should cite is "Real Madrid" (the team he plays for), not "real", even if the latter has (necessarily!) an higher Googleshare.

Wikipedians could choose a level of notability (in the case of searches not restricted to a given language, something like 2.5%?) to choose what to include and what not to include in a certain voice. This, I think, is a possible usage of the "lowercase semantic web" in the sense Philipp uses the expression (http://blogoscoped.com/archive/2005-01-27-n48.html , not microformats...).

milivella [PersonRank 10]

17 years ago #

A better formula:
square of the "popularity in topic" divided by the "general popularity"

Applied to the previous example:
Notability of "real" in an article about Ronaldo: 6.89^2 / 5.42 = 9
Notability of "Real Madrid" in an article about Ronaldo: 5.56^2 / 0.03 = 1,070

milivella [PersonRank 10]

17 years ago #

I think that this idea could have some implementations (all trivial...). Here, I'd like to make a challenge: find the definition for "Google" (a word present in 1,120,000,000 web pages), i.e. the search query (you can use quotation marks, OR, NOT, etc.) that has the best result for "Google".

Remember, the result is:
((pt/1,120,000,000)*100)^2 / ((pg/25,700,000,000)*100)
where:
pt = number of pages where your search query appears *together* with Google
pg = number of pages where your search query appears in *general*

My shot (easy):
"search engine"
pt = 90,200,000
pg = 203,000,000
result = 81

To you.

Philipp Lenssen [PersonRank 10]

17 years ago #

By the way, when you enter [ronaldo] into Google, and click on the "Similar pages" link of the first (Wikipedia page) result, you'll get some related stuff too...
http://www.google.com/search?hl=en&lr=&q=related:en.wikipedia.org/wiki/Ronaldo

But here's an approach I used for http://www.CoverBrowser.com to find "related words". It's also related to Googleshare. Take a look at http://blogoscoped.com/archive/2006-10-09-n22.html and scroll down to "To add a search engine to the site".

milivella [PersonRank 10]

17 years ago #

Thanks for the feedback, Philipp!

I agree: Yahoo Term Extraction (that you use for CoverBrowser) is similar to my "Improved Google Share": they both tell you if something is relevant. But there are some differences:
- You give to YTE a text and it select the words or expressions (i.e. you give "Clark Kent who" and it "understand" that the meaningful string is "Clark Kent", and not "Clark Kent who" or "Clark"). I think, that, with a little more searches, even my "Improved Google Share" could do the same.
- YTE parses a text and find which words or expressions are relevant; you could specify a context, but you don't have to. IGS needs a context. Again, IGS could do the same work, considering every word (or expression) both as context and as information-to-be-valued-in-a-context. It wouldn't need any more search, because you only swap context and information (i.e. from the data above it's possible to calculate the relevence of Ronaldo for Real Madrid too.)
- (These first two differences could be set aside, giving to YTE an input composed by single expressions divided by dots and a context. I'm curious to see how YTE and IGS would score with the same input: I'll try soon. If someone want to propose a context and some expressions, I'll be glad to use them.)
- IGS is a simple and public formula.

Sidenote: IGS lets you construct networks like this http://blogoscoped.com/archive/2005-12-02-n25.html made using YTE, and with IGS you can even select how many links-per-node or how much relevance include (no need to say that you don't have to use Wikipedia or any other reference, because the entire web would be the reference).

milivella [PersonRank 10]

17 years ago #

I am developing this idea here:
zhouzhuang . livejournal . com / 1458 . html
(I don't want Googlejuice, but only, if any, interested readers.)

Forum home

Advertisement

 
Blog  |  Forum     more >> Archive | Feed | Google's blogs | About
Advertisement

 

This site unofficially covers Google™ and more with some rights reserved. Join our forum!