Google to Release N-gram Data - Google Blogoscoped Forum

Forum

Google to Release N-gram Data (View post)
/pd	Thursday, August 3, 2006 20 years ago • 5,676 views
"13 million different words." In all langages that have been published on the web ?? "cat", "Katze", "Ø§Ù„Ù‚Ø·Ù‡ ", "gato" would this conist of a data sub et of one word [cat] ?? is this understanding of mine correct ??
Andreas S	20 years ago #
Finally. That will make creating spam blogs that look realistic so much easier ;-).
JohnMoo	20 years ago #
hey, the perfect tool to create more search engine spam! I've been using WordNet + Wikipedia mirrors for too long, I need something better.
Randall	20 years ago #
This could be used for the http://en.wikipedia.org/wiki/Dissociated_press random sentence generation algorithm. Spammers rejoice!
Haochi	20 years ago #
/pd, after discarding words that appear less than 200 times.
Caleb E	20 years ago #
/pd – I'm guessing this is simply unique tokens present in google's index. so even, for instance, cat and cats are different words.
/pd	20 years ago #
Caleb: Correct!! A word would be hashed into a token, but what I am trying to infer is there any rational way that tokens represent a similar word [in a different languages] are they linked in some form ?? in formal lisp this look like this [Head / RestofHead] ["cat" / "Katze", "Ø§Ù„Ù‚Ø·Ù‡ ", "gato"], then I could actually take [a / [b,[RestofHead]] and then function seek(b, lingo) to know which lingo that word (b) is dereived from.. remember the KnowledgeBase would all be in hashed tokens .. so by using a technique like this one can translate one-2-one and then run the outputs thru an NLP interface for grammer check and para phasing and all that yada yada nice stuff.... just the webworldwide words by themselves implies no meaning in the data iteslf ..so I am going back to my orginal question asked before "Seriously.. what is value is created by publishing all the words on the web ?" http://blogoscoped.com/forum/60269.html#id60273 Note :' / " = Pipeschar and forum edits does not permit the pipe char
Josue R.	20 years ago #
i don't think worrying about searh engine spam will be there problem, if you look it my way... when speach recognizion software is implemented on all devices (email, voice mail, etc) next you'll have to worry about voice-spam and someone will have to create an anti-spam filtering voice pattern application, as it will be hard for good filters to recognize false messages. So its not really the search engine we should worry about or emails but voice data in the next few years. Just my two cents of the future using the N-gram datasets (but i'm probably going too far with his, you just never know, like we didn't know aout email spam in the early years of the internet)
mrbene	20 years ago #
You mean you haven't gotten voice spam already? Bane of the answering service. I mean, answering machines, answering services were handy – but when I hear "you have new messages" and the message starts out with "Hi, my name is..." It's called "Voice Mail Broadcasting", and it's why I don't have an answering machine. Company that offers Voice Mail Broadcasting in addition to Bulk Email: 5star-email-marketing.com/ voice-broadcasting.html Company that has primary focus on Voice Mail Broadcasting: www.voiceshot.com/public/ outboundcalls.asp Software that allows you to perform your own Voice Mail Broadcasting phonetree.com/business/ products/3500.htm Up there with fax-spam on my least favourite things!

Advertisement

Blog | Forum more >> Archive | Feed | Google's blogs | About

Advertisement

This site unofficially covers Google™ and more with some rights reserved. Join our forum!