Google Blogoscoped

Forum

Madrid’s Complutense University Library in Google Book Search

Ionut Alex. Chitu [PersonRank 10]

Tuesday, September 26, 2006
14 years ago2,519 views

"Out-of-copyright books previously only available to people with access to Madrid’s Complutense University Library, or the money to travel, will now be accessible to everyone with an Internet connection, wherever they live [through Google Book Search]."

booksearch.blogspot.com/2006/0 ...

Philipp Lenssen [PersonRank 10]

14 years ago #

On a related note, I managed to get a lot of you into Google Books, albeit this didn't seem to work for everyone!
books.google.com/books?vid=ISB ...

E.g. names that work:
[brinke guthrie]
[tony ruscoe]

Names that don't work even though they appear in the book:
[ionut alex. chitu]
[tomhtml]

Ionut Alex. Chitu [PersonRank 10]

14 years ago #

This works:
books.google.com/books?vid=ISB ...

Tony Ruscoe [PersonRank 10]

14 years ago #

And [alex chitu] works too:

books.google.com/books?vid=ISB ...

Does Google use OCR to index these books? It looks like it's mis-read "ionut" as something else as it doesn't seem to highlight it.

Tony Ruscoe [PersonRank 10]

14 years ago #

Yup... [lonut alex chitu] (note the lowercase "L") works:

books.google.com/books?q=lonut ...

:-)

Ionut Alex. Chitu [PersonRank 10]

14 years ago #

>>Does Google use OCR to index these books?

What else?

Maybe they should accept already-digitized books. You just submit a PDF, for example.

Tony Ruscoe [PersonRank 10]

14 years ago #

Well, I thought they might just extract the text from the PDF (where possible).

Philipp Lenssen [PersonRank 10]

14 years ago #

> Yup... [lonut alex chitu] (note the lowercase "L") works:

Nice find. It never occurred to me that you can optimize some Google Book searches by using common OCR (0<R?) errors.

> Well, I thought they might just extract the text from
> the PDF (where possible).

True, Lulu.com could just send the PDF to Google Book Search. But this is one of those Dilbert cases where you type something in the computer, print it out, send the printout to a company and they scan it and OCR it :) . I suppose it's easier for Google to have a single well-working process than to optimize for small fish like Lulu. So I guess in this case they have more trouble because the words are in italics and not found in an English dictionary.

zmarties [PersonRank 10]

14 years ago #

It's not just the text that Google want, it's information about where on the page the words are found so that they can highlight them. Extracting text from PDF is rarely straightforward – extracting positional information is much harder still, so the simpest way is as you say to simply print the PDF out and OCR it.

There are some documents where the OCR process goes very wrong. Consider this page of old printed English, with the "long s" symbol, which looks like a modern "f" character

books.google.com/books?vid=0zC ...

By the way, is anyone else seeing the problem that the highlighting, which is supposed to be on the word "sale" (which I had to search for as "fale") in all cases misses the actual word – being always below and to the right of where it should be?

Philipp Lenssen [PersonRank 10]

14 years ago #

The yellow marker is off here too...

This thread is locked as it's old... but you can create a new thread in the forum. 

Forum home

Advertisement

 
Blog  |  Forum     more >> Archive | Feed | Google's blogs | About
Advertisement

 

This site unofficially covers Google™ and more with some rights reserved. Join our forum!