Google Blogoscoped

Wednesday, May 28, 2003

Reader Eric tells that since today, AlltheWeb spell-checks search queries.

“One reason Excite and so many other Internet businesses from the 1990s stumbled was that they saw their original business idea not as an end but as a means. Excite had the best searching technology of its day, but the company saw searching as a steppingstone on the way to becoming the Internet equivalent of a television network. Searching would attract users, but what would keep them was to turn the search engine into a portal on the Web. At least that was the idea. So Excite, Yahoo, and others added staff and increased expenditures, driven by the idea that searching alone wouldn’t be enough to sustain a significant Internet enterprise. They were wrong.

All Google does is searching. (...)
Every Internet product or service is utterly dependent on searching.”
– Robert X. Cringely - What’s Next: Do One Thing Right, May 29, 2003

Google US Puzzle Championship

Now you can flex your muscles at the Google US Puzzle Championship. (Thanks Emily.)

GQL Search Agents Unveiling Truth

It’s been only a short time since I discovered how useful the Google Web API can be.
More and more I begin to see the use of the Web as today’s biggest database, and search engine queries as its SQL*. That is, collecting data in a bigger scope than “where’s the most relevant single page”. Just like giant Web surveys to get closer to the truth (or at least; the world’s accepted and public truths).

*If you’re not familiar with SQL, it’s the Structured Query Language for databases.

But because the Semantic Web isn’t too realistic, it’s a natural language processing effort. And only works in big numbers — small numbers aren’t as fault-tolerant. Luckily for us, the Web’s content is growing every day.

Fuzzy Queries

During those efforts of “GQL”, the Google Query Language (or SEQL, the Search Engine Query Language), it would be immensely helpful if Google — or another big search engine which hands out an API to developers — would support a “best of” OR operator. That is, if I enter:

(cat | cats | dogs | birds | cat-food)

... I would be getting a ranking based on the most relevant intersection of all keywords, versus a strict and simple “must contain at least one” algorithm. Well, I don’t think this is what’s currently happening. A page containing “cats” and “dogs” is just as relevant as one containing just “cats”. And if we’d use the AND operator (default search, that is), the page must contain both “cats” and “dogs”. But we need a bit of fuzzyness here.

Just think of the task of quantifying movie quality, i.e. separating positive reviews from negative ones. The following would be optimal:

[movie title x]” (good | excellent | superb | brilliant | great | nice)

... vs ...

[movie title x]” (bad | worst | boring | monotonous | lame | disappointing)

... would return the “best” results. In order to use this result, I would have to give a relevancy factor limit. For example, I would be able to express “return pages of relevancy 75% and above alone”. Then, the returned page-count would be a good assumption on how popular the movie is. (Of course, this could be extended to a variety of other topics and examinations. And not just opinion polls.)

There are other features a search engine like Google should have to give developers a better chance to use it for data mining (such as a true wildcard operator, no keyword limit, and so on.)

Search Agents of the Future

A Google Query Language would be nothing less or more than a wrapper around Google, handling the behind-the-scenes NLP assumptions and “survey” data. What I did with the bots so far is nothing else. (Though obviously on a much simpler level.)
In the end what lies behind e.g. Egobot is nothing else but turning the question “Where is X located ...” into “X is located at ...”. And Moviebot does nothing else but count pages with certain words. And maybe somehow, the two approaches used here can be combined. But I think it’s time we get more powerful Google query options.

It would be quite fascinating if one day in the near future, we could let free a search agent (a kind of second-generation search engine bot), and it would discover truths, relationships, opinions and tendencies for us. All on its own, just by analyzing the webpages of you and me. Pages which, taken as individual ones, contain errors, spelling mistakes, completely subjective views, broken links, misconceptions, plain lies... and certainly no semantic markup. But pages that, taken as a whole, contain humanity’s “truth” of what the world was, what it is, and what it’s going to be.
The result here would indeed be much larger than the sum of its parts — once we have the methodology at hands to effectively query the sum.

The Google Century

I played around with querying Google for numbers and checking how many pages online exist for different numbers. Of the numbers 1 to 10, three are occupied by Netscape — while “4” and “6” are Netscape-territory, “5” is prominently missing and taken by Microsoft Internet Explorer... probably because there was no Netscape 5 browser!

Now what’s more interesting: if you enter the years 1900-2015 into Google and check the page-count, you get following result:

The peak here is 2002, i.e. 16% of all webpages today contain this number. (The peak seems to be quite natural, since 2002 is the most recent year since the invention of the World Wide Web that completely passed.)
Assuming that a high percentage of pages containing “2002” and so on are simply those with a copyright notice of that year, the sudden rise at the beginning of the World Wide Web and up to current years is natural. (Add to that daily news events recorded online.)
But we can get more “representative” measurements by checking pre-WWW decades.

Let’s do that for the Beatles:

To make up for the general number of pages containing a year versus the ones for our keyword (e.g. 1960 and 1970 are always relatively high), we normalize the graph by calculating the inverse correlation and its peak — a sort of “Zeitgeist mindshare” showing tendencies:

As we can see, the peak for Beatles “fame” (or, strictly speaking, number of websites containing both terms “1963” and “Beatles”, as opposed to just “1963”) is 1963, which does make sense (the first Beatles US-single was released that year, and they were already big in the UK).

In contrast, the Disco movement has a much later peak at 1978. Considering that John Badham’s “Saturday Night Fever” with John Travolta is from ’77, that also makes sense (note that I zoomed into this “Googleshare” diagram, so it cannot be absolutely compared with the above “Beatles” one — only the relative peak can be compared):


Blog  |  Forum     more >> Archive | Feed | Google's blogs | About


This site unofficially covers Google™ and more with some rights reserved. Join our forum!