Google Blogoscoped

Sunday, November 2, 2003

Google and AllTheWeb in Opera

Capt. Cornelius shows that by typing F2 or F8 and then “s term”, Opera displays both Google and AllTheWeb search results (11/2/03, in German); also see the screenshot. Just typing “g term” in the address bar will show Google-only results.

Building a Word Frequency List

I’m currently building an English word frequency list by quering the Google Web API for about 27,000 words. For each word, I store the number of web pages containing it in a database. Once finished, the list will provide an overview of e.g. the top 100 words by word usage online, or the 100 least used words.

*I started out with a much bigger word database, but now use a smaller one to keep the project to get faster results in this project.

Similar lists have been created (but as far as I know they were never based on the data contained within the WWW). “The Write Way” says:

“Shakespeare, who was one of our most prolific and enduring writers, used approximately 22,000 different words in his published works. Well-educated people today, use about 5,000 different words when speaking and about 10,000 in their writing. Most of us have a ’working vocabulary’ of 2,000 (which means that there are over 788, 000 words that are gathering dust on the shelves of our minds). Of those 2,000 words, the most commonly used are: the, of, and, to, a, in, that, is, I, it.

Those ten little words (...) account for 25% of all speech.

“There are fifty words, which make up 60% of everything we say – and only two of these have more than one syllable

Many people wanted to know what those fifty words were, so here is a list from the British National Corpus. The BNC is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English (...)”
– Jennifer, The Write Way (Write101.com), 28 April 2000

Listed below are the Top 50 as compiled by the BNC:

1-10

the

at

of

and

a

in

to

it

is

was

11-20

I

for

you

he

be

with

on

that

by

are

21-30

not

this

but

’s

they

his

from

had

she

which

31-40

or

we

an

n’t

were

been

have

their

has

would

41-50

what

will

there

if

can

all

as

who

have

do

What to Do With a Frequency List

Besides just Top 10 lists, I have different ideas once the data is gathered.
One thing is improved memomarking (currently, I take the N longest words of a given page to create a “safe” bookmark – as opposed to the N rarest words).
Also, for a given text I might be able to calculate its comparative complexity by analyzing how frequent the words used are.
Added to that I might be able to analyze complexity (or how unusual words are) within the structure of a text.

Plus I might give another try to a past idea that didn’t work out (I used the longest words approach). You enter a description of something and the algorithm tries to guess what you are talking about. To do that it would google the results for word groups taken from the input; it will then read out the Google-cache of resulting pages, and find out the least common words; finally, it will compare which of the otherwise rare words are not rare within the context of the search result.
To illustrate, if your input is “Nihilist philosopher from German, who was rather misogynic, and wrote ’Thus Spoke Zarathustra’”, the tool would google “Nihilist German”, “German misogynic”, “Nihilist Zarathustra” etc., and try to spot rare shared words. (Now that I have the dictionary data, I might even just check for words which are not in this dictionary.) If this works out the rarest words for this example would then be “Nietzsche”.

Nestlé Buys Google, or Google News Indexes Humor

This interesting find shows that Google News also indexes humorous news sites (but I bet it wouldn’t if it knew):

“Friday, reports surfaced that Microsoft was interested in buying Google. Instead, food giant Nestlé announced that its sweetened offer to buy the Internet search engine company has been accepted.

Google will complement NestlĂ©’s long list of well-known brands, including Perrier water, Stouffer’s, LifeSavers, and Nesquick. Nestlé says Google will be renamed NesGoogle and have a recipe section added to its main page.”
– David Graham, Nestlé to buy Google, November 02, 2003

Microdoc: Why Google Will Never Partner with Microsoft

[Googlesoft]

Microdoc discusses Why Google Will Never Partner with Microsoft (10/31/2003). Agreed, if they do, Google-users like me would run away in masses.

Advertisement

 
Blog  |  Forum     more >> Archive | Feed | Google's blogs | About
Advertisement

 

This site unofficially covers Google™ and more with some rights reserved. Join our forum!