Determining a Comic Book's Popularity With Google

Wednesday, August 22, 2007

Determining a Comic Book’s Popularity With Google

At Cover Browser I'm making use of the Google page count to determine the highlights of a particular comic series. For instance, this approach returns the following comic as the most popular single issue of The Amazing Spider-Man, which started in the 1960s and has over 500 issues out by now:

Comic buffs will know what's so special Amazing Spider-Man #129: it has the first appearance of the Punisher character which became very popular. But how can this be determined automatically, and not just for Spider-Man, but other comic book series as well (like X-Men, Batman, or Hulk)?

The algorithm behind this is quite straightforward. First, you can use the Google SOAP API to return the page count for the following queries (the SOAP API is discontinued by now, but if you don't have an existing API key you can also use screen-scraping to get the page count):

"amazing spider-man 1"
"amazing spider-man 2"
"amazing spider-man 3"
"amazing spider-man 4"
...
"amazing spider-man 500"

As you can see, I'm assembling the Google page count for each issue on its own. For instance, if you search Google for "amazing spider-man 1" (including the quotes) you'll get about 27,400 results, meaning this issue is mentioned online that often – an indicator of its overall popularity. Search for "amazing spider-man 2", and the page count will be 10,400.

Because we work with large numbers, we don't need to particularly care if indeed every web page found is about the comic issue at hand (e.g. someone on a message board may write, "The X-Men 2 movie was amazing. Spider-Man 3 fades in comparison," which would return a hit). We just need to make sure the query is sufficiently distinct to not retur quite too many false positives. There are other sources for errors, but by and large, the page count we get is a good approximation for the comic issue's popularity among the web community.

I used PHP to write the individual numbers into a MySQL database, but just to clarify, here are these numbers plotted onto a graph using Google Spreadsheets (note I used previously polled, slightly older numbers for this graph; at Cover Browser, the most recent numbers are used):

Again you'll notice the peak on issue #129. First appearances of characters are often high-priced, and much talked about. Furthermore, you could also add seed words to each query – like "first appearance", or "John Romita" (the name of a comic book artist) – to get different graphs & peaks.

What is left to do is to automatically locate those peaks to rank them for our result page at Cover Browser. We can't just look at the overall page count for an issue because it may be the case that an issue receives overall more mentions during a certain time period.

Now, I'm homegrowing these algorithms so there may be better approaches to it, but what you can do is simply look at every issue's 10 left and 10 right neighbors, and divide an individual issue's page count with the page count of a neighbor. For instance, if the previous issue has a page count of 5,000 and our current issue has a page count of 10,000, then the division results in a popularity value of 2; if the previous page count on the other hand has a count of 8,000, our current issue would only get a value of 1.25. Adding up all those divisions and taking their average results in popularity values for each issue, which can then be used to rank & display them highest first.

Applying this algorithm to the Fantastic Four comic book, we get the following all time series highlight:

[Fanta

~~This big-headed dude~~ Galactus can eat planets or destroy a whole galactic system if you catch him in the wrong mood. No wonder he's so popular.

Determining a Comic Book’ ... by Philipp Lenssen | Comments (5)

>> More posts

Blog | Forum more >> Archive | Feed | Google's blogs | About

This site unofficially covers Google™ and more with some rights reserved. Join our forum!