German Spiegel Opens Up Archive - Google Blogoscoped Forum

Forum

German Spiegel Opens Up Archive (View post)
Suczker	Thursday, February 14, 2008 18 years ago • 5,702 views
Thanks for the link to the archive. But you are right, the implementation is impossible. I hope, they will yet do something with it.
Luca	18 years ago #
In the same days one important (most one ?) Italian Newspaper opens its online historical archive http://archiviostorico.corriere.it/archivio/ricercaAvanzata.jsp Unfortunately from 1992.. but about 1 Million Article.
Philipp Lenssen	18 years ago #
Update: I created a search engine for the Spiegel archive http://blogoscoped.com/spiegel/
Platypus	18 years ago #
">> Oder googel es fÃ¼r Wikipedia und mehr." ...kleiner Schreibfehler ;-) Best wishes!
Philipp Lenssen	18 years ago #
Platypus, where's the spelling error you see? In case you mean "googel", both this spelling and "google" are OK for German... http://blogoscoped.com/archive/2004_08_26_index.html#109355071461570144
Alex Ksikes	18 years ago #
Are you retriving results and screenscrapping from SPIEGEL? Or did you actually build your own search engine with all the SPIEGEL data dnled?
Martin	18 years ago #
Philipp, your search engine only returns one hit for "KÃ¶ln". Hard to believe that's all.
Philipp Lenssen	18 years ago #
Alex, it's pure screenscraping. With access to the raw source data, one could build stuff like "show the first occurrence of this keyword" and so on, which would be very neat. With screenscraping, such a feature would require to download several result pages for background computation, which would take longer (though I might add stuff like this). Martin, it's true, the script currently has problems with Umlaute characters. Even when I'm dealing all in UTF-8 (which Spiegel says they have, and which my script uses too for output) somehow there's the usual encoding issues. This already made many workarounds necessary when outputting the stuff; it might be related to the XML functionality of PHP as well. I'll try to improve it where possible.
Ionut Alex. Chitu	18 years ago #
I wonder if they intend to include this in Google News Archive Search.
Philipp Lenssen	18 years ago #
PS: If anyone wants the PHP5 source to the engine please email me.
Philipp Lenssen	18 years ago #
> I wonder if they intend to include this in > Google News Archive Search. Interestingly enough they apparently disallow most archive pages crawling. Amazing how much traffic they throw away with that decision, compared to offering the archive as crawler-friendly linked pages. http://wissen.spiegel.de/robots.txt
David Orban	18 years ago #
The 'Corriere della Sera', Italy's widest circulation newspaper opened its archive a couple of weeks ago as well. http://archiviostorico.corriere.it/archivio/ricercaAvanzata.jsp

Forum home

>> More posts

Blog | Forum more >> Archive | Feed | Google's blogs | About

This site unofficially covers Google™ and more with some rights reserved. Join our forum!