Thursday, August 3, 2006

Google to Release N-gram Data

Google announced that soon you’ll be able to order 6 DVDs containing “N-gram” datasets for research and development. If I understand this right, this means you’ll be getting a database of over a Billion 5-word sentences that you can order by popularity (the number of times the text appears online), so that for example you’d be able to continue the sentence “cats like to eat” with the words “mice,” “milk” and so on. This might be useful for speech recognition, OCR, machine translation, spelling suggestions, suggest-like features and more. The dataset is supposed to contain over 13 million different words.

