The quality of the optical character regocnition Google runs over the scanned pages varies, depending on the book font used. Take this piece from Shakespeare’s Hamlet:
Pol. I, fafhion you may call it, go to, go to.
Ophe. And hath giuen countenance to his fpeech
My Lord, with almoft all the holy vowes of heauen.
Pol. I, fprings to catch wood-cockes, I doe knowe
When the blood burnes, how prodigall the foule
It’s hard to make out a word (hath Google no spellchecker?), and if you see the original, you can understand why automation had some troubles here:
A better ASCII transcript could add asterisk characters to indicate italics, add spaces to indent, and would not confuse “f” with an “s”... and it might perhaps cross-check the words with an old English dictionary. Project Gutenberg also transcribes public domain works to text, but they’re doing a better job – then again, they’re also doing it semi-manually, formatting and proofreading the OCR’d books one by one with helpers, whereas Google’s approach is more machine-based and possibly easier to scale (though Google still employs humans to turn the pages of the book, at least if we go by some of the fingers that we saw scanned before!).
[Via Inside Google Book Search.]
>> More posts