Google Blogoscoped

Monday, February 5, 2007

Googleshare Translations

Wired recently ran a pretty interesting article on the machine translation efforts of Jaime Carbonell of Meaningful Machines. From what we know, Google’s translation efforts so far work on the idea of parallel text; when you have older texts human-translated into different languages, you can use statistical analysis to automatically figure out translations for other texts. These approaches may be best in town, but if Google’s Arabic machine translation is any indicator, they’re still not good enough to give meaning to every text you translate, let alone translate text into anything native-sounding. But quote Wired on Jaime’s approach:

[T]he Meaningful Machines system uses a large collection of text in the target language (in the initial case it’s 150 Gbytes of English text derived from the Web), a small amount of text in the source language, and a massive bilingual dictionary. Given a passage to translate from Spanish, the system looks at each sentence in consecutive five- to eight-word chunks. (...)

The options spit out by the dictionary for each chunk of text can number in the thousands, many of which are gibberish. To determine the most coherent candidates, the system scans the 150 Gbytes of English text, ranking candidates by how many times they appear. The more often they’ve actually been used by an English speaker, the more likely they are to be a correct translation. “We declare our responsibility for what has occurred” is more likely to appear than, say, “responsibility of which it has happened.”

I was wondering how well this algorithm would fare not with 150 Gbytes of data, but the corpus of the whole world wide web – seen through the eyes of, say, Google.

Let’s take an example. If I want to translate the German sentence “Ich hab heftige Kopfschmerzen heut morgen” into English, Google (via Systran) returns “I have violent headache heut tomorrow.” A better translation would be “Got a major headache this morning.” If we split up the words and look for plain single-word translations, we get this table:

Ich hab’ heftige Kopfschmerzen heut’ morgen
me
I
self
have
had
having
got
belongings
boisterous
fervid
fierce
fiercely
forcible
hard
heavy
impetuous
impetuously
intense
keen
keenly
severe
short-tempered
tempered
tempestuous
tempestuously
testily
vehement
vehemently
vigorous
violent
violent
violently
headache today tomorrow
morning

A simple script can iterate through all possible permutations of word combination chunks, let’s say 3 a time, and grab the Google page count for those chunks. So we’d try the phrase search...

... and we realize that “I have intense” is likely the most common way to put this. (This is a theoretical example, I didn’t compute all permutations, nor did I compile the word translation list fully automated.) We then start over from “intense ...” and try out further combinations, until we hit the end.

What happens is that we’re calculating a kind of “googleshare” of words... the likelihood that some given words appear together on a page (furthermore, in this case the likelihood they are appearing next to each other on the page, as we’re doing a phrase search). I don’t know how well above would actually work in “real life,” though Meaningful Machines’ results indicate that the algorithm can be quite successful, and I find it hard to imagine that a larger corpus hurts the feasability of this.

Now, while the script is trivial, it’s however only semi-trivial to get a good dictionary that contains a lot of these words (including synonyms). But that’s no magic, either, at least if you got the money to license the dictionary. There are a lot of other crucial details to get right, too, of course (e.g. the German “heut morgen” of above sample sentence cannot be correctly translated if you translate each of the words separately – “heut” is colloquial writing for “heute” which means “today,” but “heut morgen” means “this morning”). What may be really tough though – unless you have direct access to Google’s web corpus, not having to resort to screenscraping their server – is to query the page count for all permutations within a reasonable time limit (say, a second).

Advertisement

 
Blog  |  Forum     more >> Archive | Feed | Google's blogs | About
Advertisement

 

This site unofficially covers Google™ and more with some rights reserved. Join our forum!