Google Blogoscoped

Wednesday, March 10, 2010

No, we can’t translate "Yes we can"
By Roger Browne

Ten years ago, the best-available translation software analysed the source text to determine its structure: subject, object, nouns, verbs, phrases, etc. From the structure tree, a new text could be generated in the target language.

The precise details of Google’s translation algorithms are not published, but the structure tree is not the main mechanism. Instead, there is a corpus – an enormous database of parallel works. These are works available in more than one language as a result of a previous human translation.

Based on equivalents found in the corpus, Google obtains translations for various multi-word fragments from the source text, then blends those together into what is usually a coherent sentence in the target language.

The system doesn’t work so well on fragments that weren’t translated in the corpus. For example, the phrase “Yes we can” was used prominently in Barack Obama’s election campaign, and was therefore included untranslated in many foreign language news reports. You can see this in a search for [obama “yes we can”] on google.de.

As a result, Google Translate is not always able to translate that phrase, even when used in a context unrelated to Barack Obama.

In a test I performed today, I found that the phrase “Yes we can” was not translated into these languages:

Catalan, Czech, Dutch, Finnish, French, German, Hungarian, Italian, Polish, Portugese, Slovak, Spanish, Turkish.

It was translated into these languages:

Afrikaans, Albanian, Arabic, Belarusian, Bulgarian, Chinese, Croatian, Danish, Estonian, Filipino, Galician, Greek, Haitian, Hebrew, Hindi, Icelandic, Indonesian, Irish, Japanese, Korean, Latvian, Lithuanian, Macedonian, Malay, Maltese, Norwegian, Persian, Romanian, Russian, Serbian, Slovenian, Swahili, Thai, Ukrainian, Vietnamese, Welsh, Yiddish.

What can we conclude from this? Probably that the corpus for each language on the first list includes a higher proportion of Obama campaign reports than the corpus for any language on the second list.

[Thanks Ilan and Philipp!]

Advertisement

 
Blog  |  Forum     more >> Archive | Feed | Google's blogs | About
Advertisement

 

This site unofficially covers Google™ and more with some rights reserved. Join our forum!