Google Blogoscoped

Forum

Which Sources Does Google News Index?  (View post)

iZeitgeist [PersonRank 10]

Tuesday, August 1, 2006
13 years ago4,734 views

Excellent Work Philipp!

I am Stunned =)

I hope this gets Google's attention to change the 4500 sources number!

TOMHTML [PersonRank 10]

13 years ago #

Only english sources ;-)

And the number didn't change since... March, 21st 2003!
web.archive.org/web/2003032102 ...

Seth Finkelstein [PersonRank 10]

13 years ago #

Cool! Wonderful work!

iman [PersonRank 1]

13 years ago #

amazing.

/pd [PersonRank 10]

13 years ago #

Philipp, there must be certainly more sources.. afterall you were polling using
" semi-random words "-- which will only mean a smal fraction of items being returned based on the "seeded words"!!

I like the yield here.. its has a very broad spectrum of sources..

What does the "type" column mean in the sample data ? does it imply paid/free subscriptions ??

Hey, can your convert the lreturn list into an OMPL share ??

Iolaire McFadden [PersonRank 6]

13 years ago #

My employer CoStar Group which also publishes commercial real estate news is listed. So in a way I see that as validating that semi-random does pull in smaller focus news topics.

stefan2904 [PersonRank 10]

13 years ago #

great work, philipp!

Roger Browne [PersonRank 10]

13 years ago #

This blog is listed as "Outer-Court" rather than "Google Blogoscoped", so I guess other news sources might also be listed under less-familiar names.

Philipp Lenssen [PersonRank 10]

13 years ago #

Thanks for the comments everyone!

> Philipp, there must be certainly more sources..
> afterall you were polling using
" semi-random words "-- which will only mean
> a smal fraction of items being returned based on
> the "seeded words"!!

I don't think the fraction is that small. But let me explain the "semi-random" words: I used a couple of hundred popular words like "a" or "the" from the dictionary as well as a couple of hundred two-letter combinations like "vx", plus a couple of hundred more rare words, plus a list of words I manually created a while ago (like "Google", "Bush", "Iraq"), plus a couple of numbers (like 1, 2, 3). I sorted ~70% by date published and ~30% by relevancy, on result lists of 100 each. Around 10% of all times I simply queried for either "a" or "the". Take a look at a search for "the" sorted by date... you can refresh every other minute to get new sources:
news.google.com/news?hl=en& ...

Almost immediatelly, you'll get a couple of thousand sources using this method. But after a while, less and less new sources will be found on average... this might be because you're nearing something like an 80%/ 90% completeness range. But that's guesswork of course... might be there's 20,000 sources and I just didn't discover all. But if I need to take a guess, I would guess it's more like 10,000 sources than 20,000.

Whether or not Google News are correct when they say "around 4,500" depends on how you count. If you count e.g. all CNN sources as 1, the number might be much smaller than 8,700. But since they keep adding sources and they don't adjust their number, we can assume they're counting as I did, but they're just not telling people the real number anymore. And "above 4,500" of course is always correct. It's like the joke where the doctor tells the patient, "You're gonna die." and the patient says "Really?" and the doctor replies, "Well, not anytime soon, but someday we're all gonna die..."

Philipp Lenssen [PersonRank 10]

13 years ago #

Jorn Barger of RobotWisdom comments:

<<if you specify a source and search for "+the" you can get an rss/atom feed for that source that's often better than the sources' own, if they even offer one (eg the New Yorker doesn't)>>
robotwisdom2.blogspot.com/2005 ...

Very interesting tip!

alek [PersonRank 10]

13 years ago #

I agree the actual number is probably close to 10,000 – your methodology appears to provide reasonable coverage and my guess is rate of additions is diminishing.

Since Google hand-selects their sources, they know EXACTLY how many sources they are pulling – you should start a poll to see when they change the "4,500" number on their page! ;-)

Philipp Lenssen [PersonRank 10]

13 years ago #

> guess it's more like 10,000 sources than 20,000.

Gotta edit my sentence for clarity... I meant "closer to 10,000 sources than 20,000". E.g. could be 9,000, but probably not 18,000...

alek [PersonRank 10]

13 years ago #

I was agreeing with you Philipp – your coverage should be close to complete – my guess is if Google discloses the number, you are within 10-20% of it.

/pd [PersonRank 10]

13 years ago #

I sent in a msg to the news prd team. requesting them to update the true up values on numbers of sources.. lets see what they do.

Sohil [PersonRank 10]

13 years ago #

So are these 8000+ Permanent ?

Owen [PersonRank 0]

13 years ago #

Just FYI – the sources are NOT permanent. Anyone can request to be added as a source. Google then makes a determination as to whether or notyou really ARE a source. But you can also be de-listed if someone complains. I was involved once in getting a plagiarism site delisted (the site was scraping content from several other sources, changing two or three words and then republishing it with a new headline and author – we are talking two or three words out of 3000 here) and Google checked it out and delisted it within a few days.

Philipp Lenssen [PersonRank 10]

13 years ago #

Update: The crawler has been running this week and I've updated the tool. It's now showing 9,336 Google News sources.

alek [PersonRank 10]

13 years ago #

You might consider showing a graph of number of total sources uncovered as a function of time – that is probably pretty asymptomic. Also, if you keep this running, it could be used to uncover NEW sources and also figure out DROPPED sources (assumes that sources post content with some frequency).

WoW!ter [PersonRank 1]

13 years ago #

Philipp, great piece of work you did. But were you aware of a similar job that has been maintain at privateradio.org for some years already ? Have a look at: privateradio.org/blog/i/google ...

Philipp Lenssen [PersonRank 10]

13 years ago #

Yes I saw that site, in fact it kind of prompted me to do this 'cause it lacked so many sources...

This thread is locked as it's old... but you can create a new thread in the forum. 

Forum home

Advertisement

 
Blog  |  Forum     more >> Archive | Feed | Google's blogs | About
Advertisement

 

This site unofficially covers Google™ and more with some rights reserved. Join our forum!