Google News Content Hosting, Duplicate Detection

Forum

Google News Content Hosting, Duplicate Detection (View post)
Paul Fisher	Saturday, September 1, 2007 16 years ago • 5,233 views
Unfortunately, I don't think the URLs are ever stable. After a couple of weeks, all of the AP/Reuters/etc.-type content expires, and it's removed from all the sites of the press agency (most of which are hosted anyway). I doubt Google will be any exception.
Hashim	16 years ago #
Bad move for Ap and the like. They just lowered the value of thier own product by striking a deal with an aggregator. If Google continues to remove "duplicate" AP stories from the news results, publishers will begin to notice that the AP stories they publish capture significanlty less traffic, and will be not be worth the syndication price.
Roni	16 years ago #
Type site:ap.google.com in the Google News search..
Philipp Lenssen	16 years ago #
site:ap.google.com http://news.google.com/news?hl=en&ned=us&ie=UTF-8&q=site%3Aap.google.com&btnG=Search Right now, ~1,844 results for AP and ~5,539 for AFP. You can check the domains like this: http://ap.google.com/article/x http://afp.google.com/article/x Not sure what the subdomains for the UK Press Association Group and Canadian Press are though.
Ionut Alex. Chitu	16 years ago #
http://ukpress.google.com/article/ALeqM5hDPSFC-LFqeC2OS-tX6ORmplZ2Ww http://canadianpress.google.com/article/ALeqM5jKpfUI5q1ojPdDL2JTUnRWHveLdA
Mathias Schindler	16 years ago #
It will be important to watch if google is still going to defend the right to just crawl 3rd party web sites without having to get a licensing deal....
Oren Goldschmidt	16 years ago #
I'm pretty sure they'll move to monetize these feeds <i>very</I> fast. They'll probably find some cute way to display ads within them before the month is over. Long gone are the days where features spent 2 years in the "labs" as either buggy or featureless alphas, every new feature that's been rolled out since early 2006 has been more or less mature from and the first thing that happens once a google product functions stably is ad integration.
mrbene	16 years ago #
Read through Google's SEC filings. "The operating margin we realize on revenues generated from ads placed on our Google Network members’ web sites through our AdSense program is significantly lower than the operating margin we realize from revenues generated from ads placed on our web sites because most of the advertiser fees from ads served on Google Network member web sites are shared with our Google Network members. " Page 19: http://www.sec.gov/Archives/edgar/data/1288776/000119312507175880/d10q.htm And this is backed up with an ever increasing amount of revenue coming from "Google web sites", and less and less coming from "Google Network member web sites" – 59/41 June 2006, 63/37 March 2007, 65/35 June 2007. More content delivered from Google servers = more opportunity to both learn about the user and deliver ads to the user.
Philipp Lenssen	16 years ago #
By the way, the AdSense placed on other websites outside of Google also help Google learn about the user... every AdSense can both track a user as well as display ads.
mrbene	16 years ago #
I disagree a certain amount Philip. AdSense is delivered from googlesyndication.com, and clicks are generally tracked through googleadservices.com. No content is pulled from the google.com domain – which means that no actions taken through AdSense can be easily attributed back to a Google account. Contrast to this the outbound link tracking from Google News – a logged in user has their request redirected through a URL rewrite with the onclick event through news.google.com/news/url?<dynamiccontent>. The web browser automatically provides all cookies from the google.com domain, making it easy for any actions taken to be attributed to the Google account.
Philipp Lenssen	16 years ago #
Yeah, they know even more about you when you are logged in and at a Google property. When I said "user" above I didn't necessarily mean the Google account holder, just the user surfing the web, and that users surfing patterns. When the user goes from site A to site B, and both are non-google.com domains, and both have AdSense somewhere on the page, then Google (not Google.com, but Google Inc) can track this user. That doesn't mean they have the user's first or last name, but just some random identifier. Similar for all sites holding a Google Analytics tracker script. Though once you end up on Google.com they could connect this data with the AdSense data of course, by letting the one domain communicate with the other through e.g. a parameterized iframe... Oh, and there's DoubleClick, and the Google Toolbar + advanced options with the potential to follow "you" around :)
Trogdor	16 years ago #
Right you are, Philipp. This is the same idea that Brett Tabke put forward in his most recent post in the robots.txt blog (5/29/2007 – Can Google Predict the Future?) Basically, he explained how thanks to AdSense being on so many sites, and DoubleClick being on the others, Google can track visits to something like 90% of the web. And, because it can do that, with some relatively simple AI built-in, most surfers can go to, say, 5 websites, and by that time Google can tie them to their previous visits (based on topical habits) with very high certainty. And don't forget the Google Toolbar, which quite often comes pre-loaded in the browser thanks to some deals Google has made. Got a Google Account? Even easier. Fascinating reading. http://www.webmasterworld.com/robots.txt
mrbene	16 years ago #
Ah. Yes, there's a difference in my mind between "user" and "ID" – a "user" being a specific sub-type of ID that can be recreated. The actions taken on sites with advertisements should be associated with at least three IDs, at least implicitly – the ID of the advertiser, the ID of the site displaying the ad, and the display of the ad itself (although the first two IDs could be stored on the back end against the last). The ID of the web browser could be stored as well – just as the ID assigned by Google Analytics could be stored... But show me a situation where these IDs are being sent to a google.com domain and you'll have news. Doubleclick falls into the same bucket as GA and AdSense – while Toolbar is definitely storing data against a specific "you", which is exposed through the web history functionality. I don't know if I'd call this a "supporting" strategy, but the Toolbar helps promote the "always logged in" user, which promotes easy access to all the different Google services, which is back to the original point – more traffic and more opportunity from the Google servers.
mrbene	16 years ago #
Trogdor – great little read. I also like how you used "surfers" and not "users" :) The issue that I see is the ID – Doubleclick uses a cookie in the doubleclick.net domain, adsense uses googlesyndication.com and googleadservices.com. Cookies don't travel across domains – you've got to pass that along on the URL or something. Google accounts won't make associating doubleclick.net cookies with googlesyndication cookies any easier – unless you've got Toolbar installed. In which case there is additional information available. The overall point that I'm not following is this: How does the ID used by AdSense get transferred to the Google domain? Especially when the Toolbar is not installed!
Philipp Lenssen	16 years ago #
[whoops, I accidentally deleted my comment] > Fascinating reading. Also, let's say you 1. visit Google while logged-in = Google has "your" IP now, connected to your name (say, you registered Gmail as "John Doe") 2. visit some webpages with AdSense/ DoubleClick = Google has "an" IP now 3. visit Google again shortly thereafter (say, the same day) = the Google.com cookie of course identifies you as John Doe again, and Google checks that your IP is still the same as above. Adding up #1 + #3, Google now can be 99.99x% sure that that all browsing of #2 was John Doe, because what are the chances that the same user gets the same dynamic IP? OK, perhaps excluding company proxies and such, but isn't there a way to filter out those by measuring the kind of traffic patterns they trigger? And as Google is working within US laws, the gov't can quietly poll them for any information – and in some contexts Google Inc is not allowed to disclose it when that happens. http://www.aclu.org/safefree/patriot/18490res20040819.html Good user + Good government = no conflict Bad user + Good government = conflict Good user + Bad government = also conflict Sometimes, people just look at the "I'm not a bad user" part. Which might be true, but a bad government can hurt even good people. And sometimes people just look at the "My government is not bad" part, which is true again, but sometimes governments change. But I'm not trying to say that we're doomed, I'm just trying to find out about the potential theoretical issues that exists. Only if you know them can you measure the risks, right...
mrbene	16 years ago #
99.99% certainty if the user is logged in to Google services and also browsing externally? I don't think so. I'd personally estimate that more than 0.01% of the internet traffic is coming through AOLs caching proxies. And through corporate proxies. As a known quantity, there are about 1.5 million web users who don't allow the AdSense JavaScript to render the iframe from the pagead2 subdomain (http://easylist.adblockplus.org/) – in itself 0.125% of the roughly 1.2 billion internet users (http://www.internetworldstats.com/stats.htm). Then dealing with the time factor – I'd grant maybe over 80% chance of identifying a Google Account holder by their IP while they are actively logged in to their Google Account. However, actions taken when they are not logged in? Here you start to consider IP churn (my cable modem gets a new IP every 24 hours – this is set by the ISP so that I do not host any servers, and is becoming a more common practice), multiple users in one household (all my housemates have Google accounts and computers, and are all logged in at the same time, and all go through the same IP – who is it that's visiting an adsense site 4 hours after the last action?). I do concur that advanced analysis will be able to identify some users while they are not logged in, and associate some actions taken while in the logged out state with some users. I strongly disagree that this will be anywhere close to 99% of actions for 99% of users. I also think that there are two different standards – I have looked at this from a revenue generating model, where processing masses of data quickly is more important than examining the data in serious depth. You are looking at this data in terms of individual privacy, where examining data in depth is more important than handling it quickly. I'll admit that with time and processing, more detail could be found – but I do not think that this would reach beyond the 80% certainty range, and I do not believe that the automated systems can reach anywhere close to that yet.
Philipp Lenssen	16 years ago #
As I said... "excluding company proxies and such". So my question was more related to whether or not those can be effectively filtered out in any way in the first place. We already discussed above how browsing patterns can be related back to an individual – in the same vein, it may be possible to identify "group browsing patterns" (traffic coming from e.g. a company proxy, a household, a wifi in an internet cafe etc.). Those with ad-blockers too can already be filtered out, you just need to have a small test-case on Google.com and associate that result back with the user name.
mrbene	16 years ago #
Ok, I'll definitely grant that: The actions associated with a specific IP address on AdSense sites between authenticated requests back to the Google servers, where the IP is not being simultaneously used by more than one Google account, can, with fairly strong certainty be associated back to that Google account. You technically wouldn't have to filter out those with ad-blockers, since you wouldn't be getting any data from them :P As for browsing patterns – I've worked with site pathing at the database level, and the amount of recurrence is somewhere around 1% – even on highly sticky sites. To me, this means that even the same user will take a different path through through a site, each time they revisit. I'd expect the same to hold true dealing with multi-site paths – if you are looking specifically enough to identify me, you'll only see me once. If you make the analysis generic enough to catch most of my variations, other people will match the behavior.
Kevin	16 years ago #
[moved] As expected: http://ap.google.com/article/ALeqM5i21Jolkmw2yGRSaqt4pB4PaYHs1Q No ads as of right now.

Forum home

>> More posts

Blog | Forum more >> Archive | Feed | Google's blogs | About

This site unofficially covers Google™ and more with some rights reserved. Join our forum!