Google Books Text Versions

Wednesday, July 4, 2007

Google Books Text Versions

Some of Google’s full view books can now be seen in a plain text version, additional to the existing images of text. This will allow you to copy the text for personal use, or to create a website of it, or have it be read by a screen reader, or do anything else you want (commercial and non-commercial). To search for full view books, use the advanced search page of Google Books and check the “full view” option. Now when you’re in the book view of a result, click the “view plain text” link at the top right. (Not all full view books were showing this link when I tried.)

The quality of the optical character regocnition Google runs over the scanned pages varies, depending on the book font used. Take this piece from Shakespeare’s Hamlet:

Pol. I, fafhion you may call it, go to, go to.
Ophe. And hath giuen countenance to his fpeech
My Lord, with almoft all the holy vowes of heauen.
Pol. I, fprings to catch wood-cockes, I doe knowe
When the blood burnes, how prodigall the foule
...

It’s hard to make out a word (hath Google no spellchecker?), and if you see the original, you can understand why automation had some troubles here:

A better ASCII transcript could add asterisk characters to indicate italics, add spaces to indent, and would not confuse “f” with an “s”... and it might perhaps cross-check the words with an old English dictionary. Project Gutenberg also transcribes public domain works to text, but they’re doing a better job – then again, they’re also doing it semi-manually, formatting and proofreading the OCR’d books one by one with helpers, whereas Google’s approach is more machine-based and possibly easier to scale (though Google still employs humans to turn the pages of the book, at least if we go by some of the fingers that we saw scanned before!).

[Via Inside Google Book Search.]

Google Books Text Version ... by Philipp Lenssen | Comments (11)

>> More posts

Blog | Forum more >> Archive | Feed | Google's blogs | About

This site unofficially covers Google™ and more with some rights reserved. Join our forum!