Google Blogoscoped


Google to Release N-gram Data  (View post)

/pd [PersonRank 10]

Thursday, August 3, 2006
17 years ago5,029 views

"13 million different words."

In all langages that have been published on the web ??

"cat", "Katze", "القطه ", "gato"

would this conist of a data sub et of one word [cat] ?? is this understanding of mine correct ??

Andreas S [PersonRank 1]

17 years ago #

Finally. That will make creating spam blogs that look realistic so much easier ;-).

JohnMoo [PersonRank 1]

17 years ago #

hey, the perfect tool to create more search engine spam! I've been using WordNet + Wikipedia mirrors for too long, I need something better.

Randall [PersonRank 0]

17 years ago #

This could be used for the random sentence generation algorithm. Spammers rejoice!

Haochi [PersonRank 10]

17 years ago #

*after discarding words that appear less than 200 times.*

Caleb E [PersonRank 10]

17 years ago #

[put at-character here] /pd – I'm guessing this is simply unique tokens present in google's index. so even, for instance, cat and cats are different words.

/pd [PersonRank 10]

17 years ago #

Caleb: Correct!!

A word would be hashed into a token, but what I am trying to infer is there any rational way that tokens represent a similar word [in a different languages] are they linked in some form ??

in formal lisp this look like this [Head / RestofHead]

["cat" / "Katze", "القطه ", "gato"], then I could actually take [a / [b,[RestofHead]] and then function seek(b, lingo) to know which lingo that word (b) is dereived from..

remember the KnowledgeBase would all be in hashed tokens .. so by using a technique like this one can translate one-2-one and then run the outputs thru an NLP interface for grammer check and para phasing and all that yada yada nice stuff....

just the webworldwide words by themselves implies no meaning in the data iteslf I am going back to my orginal question asked before

"Seriously.. what is value is created by publishing all the words on the web ?"

Note :' / " = Pipeschar and forum edits does not permit the pipe char

Josue R. [PersonRank 10]

17 years ago #

i don't think worrying about searh engine spam will be there problem, if you look it my way... when speach recognizion software is implemented on all devices (email, voice mail, etc) next you'll have to worry about voice-spam and someone will have to create an anti-spam filtering voice pattern application, as it will be hard for good filters to recognize false messages. So its not really the search engine we should worry about or emails but voice data in the next few years. Just my two cents of the future using the N-gram datasets (but i'm probably going too far with his, you just never know, like we didn't know aout email spam in the early years of the internet)

mrbene [PersonRank 10]

17 years ago #

You mean you haven't gotten voice spam already?

Bane of the answering service. I mean, answering machines, answering services were handy – but when I hear "you have new messages" and the message starts out with "Hi, my name is..."

It's called "Voice Mail Broadcasting", and it's why I don't have an answering machine.

Company that offers Voice Mail Broadcasting in addition to Bulk Email:

Company that has primary focus on Voice Mail Broadcasting:

Software that allows you to perform your own Voice Mail Broadcasting

Up there with fax-spam on my least favourite things!

Forum home


Blog  |  Forum     more >> Archive | Feed | Google's blogs | About


This site unofficially covers Google™ and more with some rights reserved. Join our forum!