Google Blogoscoped

Forum

Google Books Text Versions  (View post)

Jordan [PersonRank 0]

Wednesday, July 4, 2007
16 years ago8,241 views

Google isn't getting S and F mixed up. The letters that look like Fs are long S's. It's a character that is no longer used in English, but was at Shakespeare time. Shakespearean English is also not old English, but rather is 'Early Middle English'. If it was Old English, us native modern english speakers would probably not be able to understand it.

Wikipedia enlightens:

http://en.wikipedia.org/wiki/Long_s
http://en.wikipedia.org/wiki/Early_Modern_English

Keri Morgret [PersonRank 1]

16 years ago #

There's a program out there to help with some of this, called reCAPTCHA. http://recaptcha.net/

reCAPTCHA is working on books going into the Internet Archive. When there's a word that the OCR program says it cannot read correctly, it goes into a CAPTCHA. There are two words presented as the CAPTCHA, one unknown, one known. If the known word is typed correctly, the system assumes the unknown word is likely correct. It'll present the unknown word to several other people to verify the answer is correct.

AN [PersonRank 3]

16 years ago #

Jordan: "Google isn't getting S and F mixed up. The letters that look like Fs are long S's."

Exactly, and Google gets them mixed up because it doesn't seem to know a long s. Nothing a dictionary couldn't help with ("fashion" is in there, "fafhion" not). I think that's what Philipp meant and said.

Gerd [PersonRank 0]

16 years ago #

Google has a real problem if your looking at the ascii version of an original printed in the german Fraktur-Lettertype. Unfortunately that is the lettertype used in most of german books that are out of copyright.

Below is an example from Brothers Grimm Fairy Tales;

Or would you guess that:
Der frofd)kömg oîrcr îrtr eiftrne íjeinnd).
3n ben alten 3«íí", we baê SBunfdjen nod) geholfen
Ijat, teilte ein fiönig, beffen ïodjter maren aQe fd)ín, aber

actually reads:

Der Froschkönig oder der eiserne Heinrich
In den alten Zeiten, wo das Wünschen noch geholfen
hat, lebte ein König, dessen Töchter waren alle schön, aber

What about experiences with other non latin scripts, and other lettertypes?

PS.: Thought that OCRopus would help with that kind of problems. Since i'm visiting the DFKI (German Center of AI) next week, i'm trying to get more information on that.

Tony Ruscoe [PersonRank 10]

16 years ago #

<< Shakespearean English is also not old English, but rather is 'Early Middle English'. If it was Old English, us native modern english speakers would probably not be able to understand it. >>

Jordan, you're right, of course, but "old English" does not necessarily have to mean "Old English".

Martin Porcheron [PersonRank 10]

16 years ago #

> "Jordan, you're right, of course, but "old
> English" does not necessarily have to mean
> "Old English".

I've had an English teacher in the past who cringed when someone claimed Shakespeare was old English. But truth be told, it's pretty damn old compared to me, so it's Old English.

Philipp Lenssen [PersonRank 10]

16 years ago #

> Google isn't getting S and F mixed up. The letters
> that look like Fs are long S's.

OK, Google is getting F and long S mixed up, then... but that's the same end result: they should display as normal "s" what they now display as normal "f", like AN above mentions. I think that's called transliteration, added to the translation.

Here in Germany, they also had that letter in newspapers up to the 1940s, so you often stumble across it reading antique stuff...

J. McNair [PersonRank 10]

16 years ago #

Um, yeah. What everyone else said about long and minuscule "s".There are OCR systems specifically designed for certain script styles or ranges of fonts like Fraktur or Blackletter. Some can even decode the more obscure ligatures, simply because these scripts were semi-standardized. Somehow, I doubt they can do much for obscure scribal abbreviations (like SPQR for Senatus et Populusque Romanus – "The Senate and People of Rome"). That takes a certain degree of scholarship.

On the other hand, this only helps books using Latin and Latin-ish alphabets. I know Google's OCRopus is asking for help here.

Michael Lines [PersonRank 0]

16 years ago #

There are other problems, like 'v' for 'u'. Also, it would probably solve some of this to use unicode instead of ascii:

Plain text view of p.1:

The Tragedie of
HAMLET
Prince ofDenmarke.
Enter Barnardo, and Francifco, two Centinels.
Bar. T T"T T"Hofe there ?
Bar. w ? Long liue the King,
Fran. f f Nay anfwere me.Stand and vnfolde your felfe. Fran. Earnardo.
Bar. Hee.
Fran. You come moft carefully vpon your houre,
Bar. Tis now ftrooke twelfe, get thee to bed Francifco,
Fran. For this reliefe much thanks, tis bitter cold,
And I am fick at hart.
Bar. Haue you had quiet guard ?
Fran. Not a moufe ftirring.
Bar. Well, good night:
If you doe meete Horatio and Marcellus,
The riualls of my watch, bid them make haft.
Enter Horatio, and Marcellus.
Fran. I thinke I heare them, ftand ho, who is there ?
Hora. Friends to this ground.
Mar. And Leedgemen to the Dane,
Fran. Giue you good night.
Mar. O, farwell honeft fouldiers, who hath relieu'd you ?
Fran. Earnardo hath my place ; giue you good night. Exit Fran.
B. Mar.

Rob Fuller [PersonRank 1]

16 years ago #

I'm not sure there's anything a spellchecker could do to help. Spelling was not standardised in English until much later than Shakespeare, so there will be multiple variant spellings of the same word in an old text such as this. Apparently there are at least 25 different versions of the spelling of Shakespeare's own name in contemporary writing (http://shakespeareauthorship.com/name1.html).

With the exception of the long s, the Google OCR system seems to have reproduced faithfully what is printed – which is the way it should be, isn't it?

Colin Colehour [PersonRank 10]

16 years ago #

WebProNews mentions this post.

http://www.webpronews.com/topnews/2007/07/05/google-book-search-gets-text-layer

Forum home

Advertisement

 
Blog  |  Forum     more >> Archive | Feed | Google's blogs | About
Advertisement

 

This site unofficially covers Google™ and more with some rights reserved. Join our forum!