Thanks for the link to the archive. But you are right, the implementation is impossible. I hope, they will yet do something with it. |
In the same days one important (most one ?) Italian Newspaper opens its online historical archive
http://archiviostorico.corriere.it/archivio/ricercaAvanzata.jsp
Unfortunately from 1992.. but about 1 Million Article.
|
Update: I created a search engine for the Spiegel archive http://blogoscoped.com/spiegel/ |
">> Oder googel es für Wikipedia und mehr." ...kleiner Schreibfehler ;-)
Best wishes! |
Platypus, where's the spelling error you see? In case you mean "googel", both this spelling and "google" are OK for German... http://blogoscoped.com/archive/2004_08_26_index.html#109355071461570144 |
Are you retriving results and screenscrapping from SPIEGEL? Or did you actually build your own search engine with all the SPIEGEL data dnled? |
Philipp, your search engine only returns one hit for "Köln". Hard to believe that's all.
|
Alex, it's pure screenscraping. With access to the raw source data, one could build stuff like "show the first occurrence of this keyword" and so on, which would be very neat. With screenscraping, such a feature would require to download several result pages for background computation, which would take longer (though I might add stuff like this).
Martin, it's true, the script currently has problems with Umlaute characters. Even when I'm dealing all in UTF-8 (which Spiegel says they have, and which my script uses too for output) somehow there's the usual encoding issues. This already made many workarounds necessary when outputting the stuff; it might be related to the XML functionality of PHP as well. I'll try to improve it where possible. |
I wonder if they intend to include this in Google News Archive Search. |
PS: If anyone wants the PHP5 source to the engine please email me. |
> I wonder if they intend to include this in > Google News Archive Search.
Interestingly enough they apparently disallow most archive pages crawling. Amazing how much traffic they throw away with that decision, compared to offering the archive as crawler-friendly linked pages. http://wissen.spiegel.de/robots.txt |
The 'Corriere della Sera', Italy's widest circulation newspaper opened its archive a couple of weeks ago as well.
http://archiviostorico.corriere.it/archivio/ricercaAvanzata.jsp |