Google Blogoscoped

Forum

German Spiegel Opens Up Archive  (View post)

Suczker [PersonRank 0]

Thursday, February 14, 2008
16 years ago5,035 views

Thanks for the link to the archive. But you are right, the implementation is impossible. I hope, they will yet do something with it.

Luca [PersonRank 10]

16 years ago #

In the same days one important (most one ?) Italian Newspaper opens its online historical archive

http://archiviostorico.corriere.it/archivio/ricercaAvanzata.jsp

Unfortunately from 1992.. but about 1 Million Article.

Philipp Lenssen [PersonRank 10]

16 years ago #

Update: I created a search engine for the Spiegel archive http://blogoscoped.com/spiegel/

Platypus [PersonRank 0]

16 years ago #

">> Oder googel es für Wikipedia und mehr." ...kleiner Schreibfehler ;-)

Best wishes!

Philipp Lenssen [PersonRank 10]

16 years ago #

Platypus, where's the spelling error you see?
In case you mean "googel", both this spelling and "google" are OK for German...
http://blogoscoped.com/archive/2004_08_26_index.html#109355071461570144

Alex Ksikes [PersonRank 10]

16 years ago #

Are you retriving results and screenscrapping from SPIEGEL? Or did you actually build your own search engine with all the SPIEGEL data dnled?

Martin [PersonRank 0]

16 years ago #

Philipp, your search engine only returns one hit for "Köln". Hard to believe that's all.

Philipp Lenssen [PersonRank 10]

16 years ago #

Alex, it's pure screenscraping. With access to the raw source data, one could build stuff like "show the first occurrence of this keyword" and so on, which would be very neat. With screenscraping, such a feature would require to download several result pages for background computation, which would take longer (though I might add stuff like this).

Martin, it's true, the script currently has problems with Umlaute characters. Even when I'm dealing all in UTF-8 (which Spiegel says they have, and which my script uses too for output) somehow there's the usual encoding issues. This already made many workarounds necessary when outputting the stuff; it might be related to the XML functionality of PHP as well. I'll try to improve it where possible.

Ionut Alex. Chitu [PersonRank 10]

16 years ago #

I wonder if they intend to include this in Google News Archive Search.

Philipp Lenssen [PersonRank 10]

16 years ago #

PS: If anyone wants the PHP5 source to the engine please email me.

Philipp Lenssen [PersonRank 10]

16 years ago #

> I wonder if they intend to include this in
> Google News Archive Search.

Interestingly enough they apparently disallow most archive pages crawling. Amazing how much traffic they throw away with that decision, compared to offering the archive as crawler-friendly linked pages.
http://wissen.spiegel.de/robots.txt

David Orban [PersonRank 1]

16 years ago #

The 'Corriere della Sera', Italy's widest circulation newspaper opened its archive a couple of weeks ago as well.

http://archiviostorico.corriere.it/archivio/ricercaAvanzata.jsp

Forum home

Advertisement

 
Blog  |  Forum     more >> Archive | Feed | Google's blogs | About
Advertisement

 

This site unofficially covers Google™ and more with some rights reserved. Join our forum!