Google Blogoscoped

Thursday, September 11, 2003

Nutch, Open Source Search Engine

“Meet Nutch, the open-source search engine. Open-source applications are unusual in that the code upon which the software runs is not owned by a private, commercial company but rather bound by a simple license that allows anyone to use, modify, and even profit from it free of charge, as long as they pledge to contribute their own innovations back into the code base. Because of this, anyone will be able to access Nutch’s code and use it to their own ends, without paying licensing fees or hewing to a particular company’s set of rules.”
– John Battelle, SearchDay, 11 Sep 2003

A Spider Trap

Via email in reply to “I’ve built a spider trap”:

“There’s something very much like a spider trap at Except that I’ve been careful to disallow it in the robots.txt file and in the meta tags, because I only mean to catch Evil robots (those that don’t follow directives, that is), not gentle Google. Actually, the plan mostly backfired, because we’ve mainly had problems with server overload coming from mad robots trying to download millions of pages from the Book of Infinity.

The important thing to note is that pages in the Book of Infinity, although in (virtually) infinite number, are all static: they don’t change when you reload the page or come back some time later. Otherwise it would be too easy to tell that it isn’t a real Web site.

Source code for the trap is available in”
David A. Madore

Google Filters Thomas Jefferson and Legislative Information

“Set your Google filtering to strict and search for the Library of Congress. Now, open another window, and turn filtering off in your option area and search for the same thing. You will notice that in the non filtered browser, the THOMAS Legislative Information on the Internet website is the third or fourth entry, but on your filtered browser it doesn’t show up at all. (...)

I’m sure that Google has nothing against Thomas Jefferson or even Legislative Information, but, it does bring to mind what else isn’t being seen”
– WebAdept, and Google Filtering, September 10, 2003

Google Frequent Searchers

Spotted this at Webmasterworld: the Google Frequent Searchers. The page is talking about a Google counter that keeps track how many searches you are doing. It doesn’t tell how to activate the counter, even thought it tells how to deactivate it...

“Do you search with Google a hundred times a day? Do you reach for Google before the phonebook, the dictionary or the newspaper? Do you think, just maybe, you’re a Google frequent searcher?

Now you can know for sure. The Google search counter is accurate, easy to administer and precisely calibrated for your computing environment. It provides clear and instantaneous results showing exactly how often you use Google. For information on how the search counter works, read on.”
– Google Inc, Google Frequent Searchers

Blogger Pro Almost Free

Blogger Pro users like me were charged for all the extra-features. Most of those are now available in free Blogger too. Blogger Pro users will now get a free t-shirt to make up for the confusion. But certain extras, like automatic RSS-Feed generation (an absolute must, I’d say), are still reserved to paying bloggers only.

By the way, Blogger Archive generation is still too buggy to be useful. For my second blog I gave it another try and it still didn’t work, so I had to use my self-made calendar tool again.

Niche Search

“Fast-Talk Communications, an Atlanta-based start-up, said its technology could scour 30 hours of audio recordings in one second and pull out specific phrases. ’We are the Google of audio-video content,’ the chief executive, Ray Naeini, said.”
– Chris Gaither, Niche players can charge premium to find what Google can’t (The Boston Globe), September 10, 2003

Google Domains

Here you can see the Google domains, grabbed with the Domain Count application.

"I’ve Built a Spider Trap"

I heard this before. Someone posting, “I’ve built a spider trap”. And the post got closed down at as far as I can tell. So I’d like to know, is it possible to built a “Spider trap” – which if I understand correctly would automatically generate random pages for a searchbot to do infinite crawling on one’s site?

What it would take:

Is all this even possible? Well, I certainly wouldn’t try with my own server (four reasons for that: I doubt it works; it’d be unethical; it’d waste my bandwith; it’d get me penalized).


Blog  |  Forum     more >> Archive | Feed | Google's blogs | About


This site unofficially covers Google™ and more with some rights reserved. Join our forum!