Director of Google Research
Quote: "It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts. " – Sherlock Holmes
More data vs. better algorithms. AI/ML research has typically focused on optimizing algorithms on small data sets. They should instead go collect more data, run the algorithms on them again, and see which ones come out on top. With enough data, many algorithms (think NN, SVM etc...) become fairly equal. References Banko and Brill (2001) of Microsoft Research. See
+ Show PDF
Linux humor: Enter "cat" and "dog" into Google Sets. Now enter "cat" and "more". labs.google.com/sets
Uses Windows XP on an IBM Thinkpad.
Google Translate (GT) is based on the principles of Statistical Machine Translation. See Peter F. Brown, John Cocke, Stephen A. Della Pietra, Vincent J. Della Pietra, Frederick Jelinek, John D. Lafferty, Robert L. Mercer, and Paul S. Roossin. A statistical approach to machine translation. Computational Linguistics, 16(2):79–85, June 1990.
GT now gets 55% accuracy on English to Arabic. Human agreement on human translations is 60%. After this point they have no standard by which to measure their progress!
GT got bett resu by prun all engl word to 4 lett. Also saved a lot of spac.
That was a strange talk. The take home msg was that current algorithms are better if you train them on more data. Hello? What is the point of that.
GT, seems to do OK, but how exciting is it to use a an algorithm that does not build on the structure of language. It would be so much better to train a model to speak two languages and then let it translate. Of course it won't work right away, but that would be called nice science work.
Concerning the ceiling effect, reading the literature about this should tell them that a common criterion is to translate something forward and backward between two languages until the performance goes down.
Poor Peter Norvig, he has to come to Boulder, give a boring talk to hire people that ask questions like: "Is Google into hardware?"
> The take home msg was that current algorithms are
> better if you train them on more data. Hello? What is the
> point of that.
I suppose it's a matter of focus – focus on getting more data than worrying about finding the perfect algorithm.
> GT, seems to do OK, but how exciting is it to use a
> an algorithm that does not build on the structure of language.
I think Google is very pragmatic – what works works, and the user's happy.
> Concerning the ceiling effect, reading the literature
> about this should tell them that a common criterion is
> to translate something forward and backward between
> two languages until the performance goes down.
I wonder how well human translators fare in this test.