Last year I had the chance to visit the Byteburg, a German castle with hi-tech inside.
Besides Kai Krause (inventor of Kai's Power Tools, Goo, Bryce ...) and other fascinating people I was talking to designer Athena. She's a native English speaker, and struggling with our language she was wondering if there is a dictionary for those German words which are made up of several other words (like "Weltgesundheitsorganisation", World Health Organization).
One idea came to my mind back then – a simple algorithm which would split two words into different lengths. Now using the Google Web API all one would need to do is measure the page-count for the words from full length to one letter.
For example (this is also possible in English):
1. swordmaster 24700 2. swordmaste 46 3. swordmast 101 4. swordmas 13 5. swordma 27 6. swordm 75 7. sword 7510000 8. swor 15900 9. swo 145000 10. sw 20300000 11. s 730000000
I will assume that two-letters appear too often, which would make the highest increment on word 7, "sword". We can now also cut off the remainding "master" and know the two separate words the merged word is made up of.
Let's reverse above to see:
1. swordmaster 24700 2. wordmaster 9170 3. ordmaster 22 4. rdmaster 161 5. dmaster 6850 6. master 52400000 7. aster 809000 8. ster 1030000 9. ter 6440000 10. er 16800000 11. r 291000000
This time the highest increment is from "dmaster" to "master", making it the main word part.
But what about merging two words to build Athena's German dictionary? For that task, we would need a medium-sized dictionary of say 30,000 words. We would then google the page-count for all word combinations (swordmaster, swordmister, swordmustard ...) to find out which ones are actually used.
However, 30,0002 are 900,000,000 single queries. At 1,000 requests per day using Google's developer interface, it would take us 2,528 years... and I'm not sure today's languages are still in existence by then.
>> More posts