Google Blogoscoped

Thursday, June 12, 2003

From Search Engine to Find Engine

From Search Engine to Find Engine

Google Answers is a human-to-human search engine for those who aren’t interested in actually searching. Could this answer system ever be implemented algorithmically, so that not people, but programs, would be ready to help? One should think it would make for the ultimate search engine. What would need to happen for this to come true? If nothing else, today’s Web is a bit of a mess. First of all, standards would be nice.

Standardization Efforts

It’s inspiring to not analyze what data is online and what can be done with it; but to think about what could be done with it, if it would be formatted in better ways. (A Googler’s Web Dream, so to speak.)
The World Wide Web consortium, put into life by the man who invented the World Wide Web (and many of the technologies that come with it), is putting forth online publishing recommendations. Those happen to be search engine friendly. But not truly for the sake of creating websites for searchbots; rather, to optimize for any device on any medium for any kind of visitor.

Media Independence

The device can be a mobile phone, the medium Text-to-Speech, accessed in a car by a driver. The concepts “green link”, “Flash animation to the right”, “title displayed in the browser title bar” completely fail to cover this situation.
However, HTML, the common language for webpages, never was intended to talk about these things in the first place. Rather, it is a structural mark-up language. And yet, that is exactly how a large part of web developers would format their content. The very simple and elegant solution to talk about colors, positioning and so on, are stylesheets (CSS, Cascading Style Sheets). Though they’re gaining popularity, the majority of the web (including the big G itself) does not put them to good use.

So what does HTML talk about? It actually should only mention; this is important; this is a quote; this thing there is an address, here’s the title, and this table heading is for this table data. But if only a minority of webmasters mark-up their data in such ways, there’s simply no commercial need to analyze these meanings. It just wouldn’t be pragmatic to focus on it because in large numbers, there’s no significant bonus gained from this analysis.

However, accessing the web from portable devices will become more mainstream in the future. Web creators will have to decide wether they have the time (or get paid for the time) to mark-up the same data in a variety of ways for a variety of devices (print, screen, handheld, TTS (Text-to-Speech), to name a few — there might even be some we don’t know yet, because this device hasn’t been invented yet)... or to simply switch to logical structuring with media-specific stylesheets on top.

Find Engines

How could a Search Engine actually profit if future developers were following standards*, thus turning the majority of online content into perfectly structured mark-up?*

*I’m not talking about the Semantic Web, by the way, which is an effort going one step further than simply logically structuring a page — it’s intended to add specific meaning. (The effort is interesting, though slightly unrealistic until the Web took the first step of structuring content.)

Well, it could turn into a Find Engine, or Conclusion Engine.
To explain; a typical SERP (Search Engine Result Page) is a listing of title-links, with short (more or less relevant) description snippets. It’s an assistance tool in searching, but it’s not doing your job of finding. Currently, you can ask a question over at Google Answers, ask a real live human being. This person is a Google research expert and will do the job of searching for you, giving you only the found answer to a question. If this could be even remotely automated — I’m not talking about a replacement of human research, or Google Answers — it would be the next generation of Search Engines. The post-Google era.

Imagine all the logical conclusions a search engine could yield if it had actual connection* data like the following:

A is an abbreviation for B.
C said D.
E is the title of F.
G is an address.**

*The only real connection a search engine can trust today is that of a pointer from here to there, the good old hyperlink. And already, very relevant conclusions (like Google’s PageRank algorithm) stem from it.

**I’m almost thinking that any mark-up which is not a connection between two sets of data (like “G is an address”) is useless in terms of information processing and automatized conclusions.

We would quickly see SERPS like this:

The following are quotes by Albert Einstein: “God does not play dice”, ...

This is the table-of-contents for the page ...

The company is located here ...

The Web is not standardized; maybe it’s impossible to ever get to that point, and so it will be the much harder implementation of intelligent algorithms in natural language text processing. (Though collectively, we could help speed that up a bit with structural mark-up.) But even then, this Find Engine might be created, once algorithms are perfected, and once the data pool is large enough for us to say “the Web does not play dice”.

Then, Creation Engines

Looking further ahead: after the search engine that can reach conclusions on its own, we would be a short distance away from the Creation Engine. Instead of only serving a result to a query on the state of what is, this page would be brave enough to create new theories. Maybe soon enough we’d even see Creation Engines coming to approximations on what data is missing to reach the needed conclusion; and in busy preparation for a future user request of the same sort, Creation Engine Bots would be send out to post questions in relevant newsgroups, help chats, and online discussion forums.
Or Google Answers, for that matter.

GoogleGuy on SEO

GoogleGuy is working for Google Inc. and gives answers to SEO questions at the WebMasterWorld forum; the future of the Open Directory Project, reporting Spam, Google and feedback (“the recent ’extreme geolocation’ thread was a good prompt to make sure that we went back and made sure everything worked better”), infinite penalties, Google’s view of SEOs, and more.

Google Pen on Google Blanket

Slate and the Googlephone

I can’t tell how good the portable device discussed in Robin, to the Googlephone! Need a Web browser to settle bar bets? Try the Sidekick.”* (by Paul Boutin, June 12, 2003) really is. Trying Google with my handphone around two years ago was a failure; slow, small, and even though Google’s instant HTML to WAP conversion is a great idea, they cut off the page after a certain amount of characters. But is a color screen really the most important in quickly retrieving online information? Luckily for purist, cross-media developers, WAP2 adopts the XHTML1 Strict standard and gets rid of WML — and some cell phones already support it, along with CSS tailored to handhelds. Now all we need is webmasters sticking to standards, or on-the-fly webpage conversion from cluttered to clean (which is harder than it may sound, if structural markup is missing on a page).
And I guess once we got all this working... bar bets will lose their appeal. If you can instantly know the right answer, why get into a big argument?

*Ironically, when going through Paul Boutin’s article I had to fight two separate Flash animations to the right — while one was blinking for my attention, the other was busy throwing shadows all over the page — and came across his line; “That’s not to say [the Sidekick device is] perfect. The browser doesn’t do JavaScript, animation, or pop-up windows, which makes many sites unusable. Most embarrassingly, it fails to load some sites at all, including [this site] Slate.”
Embarrassing to Slate, I suppose?
And when did a pop-up window ever make a site usable?

Google in the Pit Lane?

The previously discussed Google indexing limit rumour is now covered at Google-Watch.org as the Y2K+3 theory:

“Let’s speculate. Most of Google’s core software was written in 1998-2000. It was written in C and C++ to run under Linux. As of July 2000, Google was claiming one billion web pages indexed. By November 2002, they were claiming 3 billion. At this rate of increase, they would now be at 3.5 billion, even though the count hasn’t changed on their home page since November. If you search for the word “the” you get a count of 3.76 billion. It’s unclear what role other languages would have, if any, in producing this count. Perhaps each language has it’s own lexicon and it’s own web page IDs. But any way you cut it, we’re approaching 4 billion very soon, at least for English. With some numbers presumably set aside for the freshbot, it would appear that they are running out available web page IDs.

If you use an ID number to identify each new page on the web, there is a problem once you get to 4.2 billion. Numbers higher than that require more processing power and different coding."
Is Google broken?, June 9, 2003

How to Beat Google

Arnaud Fischer, AltaVista information retrieval and search engine technology product marketing manager from 1999-2001, discusses What’s it Going to Take to Beat Google (June 12, 2003):

An interesting article, but somehow, nothing of the above looks as if it could be the next Next Big Thing in search engine technology.

Yes, engines might be trained, but collectively please to judge which sites are more relevant. And do we really want a personalized search? It might be nice when the next Google suddenly assumes this and that about you, but then again, we want different things at different times. Maybe today I want to research a human virus, and tomorrow a computer virus.

Vertical search engines do make sense, but splitting up the interface into even more niche results is not wanted I believe. One interface covering all the web (flat and deep) seems more reasonable.

Query disambiguation is also nice. Maybe in a way it doesn’t interfere with a straight-forward result list. Keep it simple. Isn’t that exactly what AltaVista forgot around the years 1999-2001, when it was completely put out of the race by Google?

It seems to me, relevancy still rules (and can still be optimized, even by Google). Otherwise, trust users to enter a second query to redefine a search. Trust them to be able to enter their hometown + pizza to find a near-by restaurant. But please — do serve up the best pizza.

Centuryshare calculator has been debugged; there was a bug preventing it to show statistics from 1950-1990.

And here’s the Centuryshare for “Hippies”:

Google Answers Qpet

Google Answers customer Qpet-ga is writing a book on “Quality of Life”, about 300 pages long, covering philosophy, anhtropology, and psychology. Those subjects are also focus for Qpet’s well-priced questions at Google Answers, which are really some of the most interesting posted. Here’s a chronological selection, and you can see this should make for a truly great book:

And the following one, which surely would receive many votes for the Answer of the Year Award (if there would be such a thing):

  • 24 hours (Answered by missy-ga)
  • Advertisement

     
    Blog  |  Forum     more >> Archive | Feed | Google's blogs | About
    Advertisement

     

    This site unofficially covers Google™ and more with some rights reserved. Join our forum!