Google Blogoscoped

Monday, May 24, 2004

Separating Words

Last year I had the chance to visit the Byteburg, a German castle with hi-tech inside.
Besides Kai Krause (inventor of Kai's Power Tools, Goo, Bryce ...) and other fascinating people I was talking to designer Athena. She's a native English speaker, and struggling with our language she was wondering if there is a dictionary for those German words which are made up of several other words (like "Weltgesundheitsorganisation", World Health Organization).

One idea came to my mind back then – a simple algorithm which would split two words into different lengths. Now using the Google Web API all one would need to do is measure the page-count for the words from full length to one letter.
For example (this is also possible in English):

 1. swordmaster     24700
 2. swordmaste         46
 3. swordmast         101
 4. swordmas           13
 5. swordma            27
 6. swordm             75
 7. sword         7510000
 8. swor            15900
 9. swo            145000
10. sw           20300000
11. s           730000000

I will assume that two-letters appear too often, which would make the highest increment on word 7, "sword". We can now also cut off the remainding "master" and know the two separate words the merged word is made up of.
Let's reverse above to see:

1. swordmaster      24700
2.  wordmaster       9170
3.   ordmaster         22
4.    rdmaster        161
5.     dmaster       6850
6.      master   52400000 
7.       aster     809000
8.        ster    1030000
9.         ter    6440000
10.         er   16800000
11.          r  291000000

This time the highest increment is from "dmaster" to "master", making it the main word part.

But what about merging two words to build Athena's German dictionary? For that task, we would need a medium-sized dictionary of say 30,000 words. We would then google the page-count for all word combinations (swordmaster, swordmister, swordmustard ...) to find out which ones are actually used.
However, 30,0002 are 900,000,000 single queries. At 1,000 requests per day using Google's developer interface, it would take us 2,528 years... and I'm not sure today's languages are still in existence by then.


Blog  |  Forum     more >> Archive | Feed | Google's blogs | About


This site unofficially covers Google™ and more with some rights reserved. Join our forum!