Google Blogoscoped

Monday, August 7, 2006

AOL Shared Private Search Queries

AOL released their user’s search queries – around 20 36 million queries collected from half a million users over a period of three months*. AOL claimed the log might be useful for “personalization, query reformulation or other type of search research.”

What’s really interesting is that queries were connected to a user ID... and there goes your privacy. Based on a sequence of searches it is often trivial to connect a person to a user ID. For example, user 500 may search for “link:mysite.com”, and then user 500 may search for the name “John Doe.” Now you can verify that mysite.com’s webmaster is John Doe from San Francisco, and you have a good indicator that user 500 is indeed John Doe. Finally, you look at other queries from this user – like, “jobs San Francisco” – and you have strong indicators that John Doe is looking for a job behind his current boss’s back.

Michael Arrington wraps it up by saying, “The utter stupidity of this is staggering. AOL has released very private data about its users without their permission. While the AOL username has been changed to a random ID number, the ability to analyze all searches by a single user will often lead people to easily determine who the user is, and what they are up to.” Sure, the data is of great value to research indeed, but as Asdf in the forum comments, “Poor AOL users.”

Now, AOL (whose search results are Google-powered) was smart enough to take down the data quickly after they launched the site. Think of the PR problems that erupted when word got out search engines were subpoena’d to release search queries a while ago. But of course, once you release something on the ’net, you can’t remove it anymore. Not only is the AOL page in question now available through the Google Cache, you can also download the full 439 MB dataset through mirror sites.

Taking a look at the data, you can see it includes:

[Thanks Asdf.]

*The AOL Readme file explains just what data is provided: “AnonID - an anonymous user ID number. Query - the query issued by the user, case shifted with most punctuation removed. QueryTime - the time at which the query was submitted for search. ItemRank - if the user clicked on a search result, the rank of the item on which they clicked is listed. ClickURL - if the user clicked on a search result, the domain portion of the URL in the clicked result is listed.” AOL adds: “Please be aware that these queries are not filtered to remove any content.”

Update: Oh, the irony – the top search on AOL is “google”.

Here are some more top lists (as opposed to what AOL did, my data is aggregated and non-private):

Top 10 containing “Cancel AOL”

  1. cancel aol (237)
  2. cancel aol service (89)
  3. cancel aol account (87)
  4. how to cancel aol (55)
  5. cancel aol membership (22)
  6. how do i cancel aol (18)
  7. cancel aol services (10)
  8. show me how to cancel aol.com (9)
  9. cancel aol.com (9)
  10. cancel aol subscription (8)

Top 20 containing “Sex”, “Porn”, “F***”, “Nude” or “Naked”

  1. porn (12,189)
  2. sex (11,426)
  3. free porn (7,118)
  4. porno (2,972)
  5. nude girls (2,601)
  6. nudes (2,444)
  7. nude (2,321)
  8. sex positions (2,107)
  9. naked girls (2,050)
  10. sex toys (1,740)
  11. free sex (1,698)
  12. sex stories (1,696)
  13. sex.com (1,652)
  14. free sex stories (1,594)
  15. porn.com (1,481)
  16. freeporn (1,457)
  17. animal sex (1,327)
  18. sexy girls (1,227)
  19. www.sex.com (1,221)
  20. sexual positions (1,166)

Update 2: John Battelle and others got a statement from AOL saying:

This was a screw up, and we’re angry and upset about it. It was an innocent enough attempt to reach out to the academic community with new research tools, but it was obviously not appropriately vetted, and if it had been, it would have been stopped in an instant.

Although there was no personally-identifiable data linked to these accounts, we’re absolutely not defending this. It was a mistake, and we apologize. We’ve launched an internal investigation into what happened, and we are taking steps to ensure that this type of thing never happens again.

You have to wait for paragraph five to hear something closer to the truth (my emphasis):

There was no personally identifiable data provided by AOL with those records, but search queries themselves can sometimes include such information.

And that’s the whole point – there’s lots of personally identifiable data contained in AOL’s release, and this data can be linked back to even those searches not containing that data... because AOL grouped queries of individual users. Milly comments, “As usual with corporate (and many other ;) apologies, they couldn’t bring themselves to be truly honest.”

[Thanks Milly.]

Update 3: I’m following up with some user profile revelations.

 

Blog  |  Forum     more >> Archive | Feed | Google's blogs | About

 

This site unofficially covers Google™ and more with some rights reserved. Join our forum!