Google isn't getting S and F mixed up. The letters that look like Fs are long S's. It's a character that is no longer used in English, but was at Shakespeare time. Shakespearean English is also not old English, but rather is 'Early Middle English'. If it was Old English, us native modern english speakers would probably not be able to understand it.
Wikipedia enlightens:
http://en.wikipedia.org/wiki/Long_s http://en.wikipedia.org/wiki/Early_Modern_English |
There's a program out there to help with some of this, called reCAPTCHA. http://recaptcha.net/
reCAPTCHA is working on books going into the Internet Archive. When there's a word that the OCR program says it cannot read correctly, it goes into a CAPTCHA. There are two words presented as the CAPTCHA, one unknown, one known. If the known word is typed correctly, the system assumes the unknown word is likely correct. It'll present the unknown word to several other people to verify the answer is correct. |
Jordan: "Google isn't getting S and F mixed up. The letters that look like Fs are long S's."
Exactly, and Google gets them mixed up because it doesn't seem to know a long s. Nothing a dictionary couldn't help with ("fashion" is in there, "fafhion" not). I think that's what Philipp meant and said.
|
Google has a real problem if your looking at the ascii version of an original printed in the german Fraktur-Lettertype. Unfortunately that is the lettertype used in most of german books that are out of copyright.
Below is an example from Brothers Grimm Fairy Tales;
Or would you guess that: Der frofd)kömg oîrcr îrtr eiftrne íjeinnd). 3n ben alten 3«íí", we baê SBunfdjen nod) geholfen Ijat, teilte ein fiönig, beffen ïodjter maren aQe fd)ín, aber
actually reads:
Der Froschkönig oder der eiserne Heinrich In den alten Zeiten, wo das Wünschen noch geholfen hat, lebte ein König, dessen Töchter waren alle schön, aber
What about experiences with other non latin scripts, and other lettertypes?
PS.: Thought that OCRopus would help with that kind of problems. Since i'm visiting the DFKI (German Center of AI) next week, i'm trying to get more information on that. |
<< Shakespearean English is also not old English, but rather is 'Early Middle English'. If it was Old English, us native modern english speakers would probably not be able to understand it. >>
Jordan, you're right, of course, but "old English" does not necessarily have to mean "Old English". |
> "Jordan, you're right, of course, but "old > English" does not necessarily have to mean > "Old English".
I've had an English teacher in the past who cringed when someone claimed Shakespeare was old English. But truth be told, it's pretty damn old compared to me, so it's Old English. |
> Google isn't getting S and F mixed up. The letters > that look like Fs are long S's.
OK, Google is getting F and long S mixed up, then... but that's the same end result: they should display as normal "s" what they now display as normal "f", like AN above mentions. I think that's called transliteration, added to the translation.
Here in Germany, they also had that letter in newspapers up to the 1940s, so you often stumble across it reading antique stuff... |
Um, yeah. What everyone else said about long and minuscule "s".There are OCR systems specifically designed for certain script styles or ranges of fonts like Fraktur or Blackletter. Some can even decode the more obscure ligatures, simply because these scripts were semi-standardized. Somehow, I doubt they can do much for obscure scribal abbreviations (like SPQR for Senatus et Populusque Romanus – "The Senate and People of Rome"). That takes a certain degree of scholarship.
On the other hand, this only helps books using Latin and Latin-ish alphabets. I know Google's OCRopus is asking for help here.
|
There are other problems, like 'v' for 'u'. Also, it would probably solve some of this to use unicode instead of ascii:
Plain text view of p.1:
The Tragedie of HAMLET Prince ofDenmarke. Enter Barnardo, and Francifco, two Centinels. Bar. T T"T T"Hofe there ? Bar. w ? Long liue the King, Fran. f f Nay anfwere me.Stand and vnfolde your felfe. Fran. Earnardo. Bar. Hee. Fran. You come moft carefully vpon your houre, Bar. Tis now ftrooke twelfe, get thee to bed Francifco, Fran. For this reliefe much thanks, tis bitter cold, And I am fick at hart. Bar. Haue you had quiet guard ? Fran. Not a moufe ftirring. Bar. Well, good night: If you doe meete Horatio and Marcellus, The riualls of my watch, bid them make haft. Enter Horatio, and Marcellus. Fran. I thinke I heare them, ftand ho, who is there ? Hora. Friends to this ground. Mar. And Leedgemen to the Dane, Fran. Giue you good night. Mar. O, farwell honeft fouldiers, who hath relieu'd you ? Fran. Earnardo hath my place ; giue you good night. Exit Fran. B. Mar. |
I'm not sure there's anything a spellchecker could do to help. Spelling was not standardised in English until much later than Shakespeare, so there will be multiple variant spellings of the same word in an old text such as this. Apparently there are at least 25 different versions of the spelling of Shakespeare's own name in contemporary writing (http://shakespeareauthorship.com/name1.html).
With the exception of the long s, the Google OCR system seems to have reproduced faithfully what is printed – which is the way it should be, isn't it?
|