Google Blogoscoped

Forum

Cloning the Google API  (View post)

Mark Draughn [PersonRank 5]

Friday, November 4, 2005
18 years ago

I don't know about cloning the API, but you'd think Google would at least sell the service.

Let's see, at the peak marginal rate, the Smugmug photo hosting service sells bandwidth at US$2.50 per gigabyte.Google does more work per hit than Smugmug, but they don't need any extra storage, so let's assume the price would be similar. If a Google API response is about 10 KB, then at the same rate Google could sell 40000 hits for a dollar. Or 4000 hits per dollar if they're more like 100KB each.

Heck, they could offer to deduct it directly from your adsense account.

Torsten [PersonRank 1]

18 years ago #


I think Mark is significantly underestimating the cost of a query in a large search engine. This cost is not mainly bandwidth cost, but the actual query execution inside the engine, and is a lot higher than the cost of 10 or 100KB of bandwidth. And yes, more machines and more storage are needed for extra queries. Just a guess – a Google query might cost >1 cents to produce, in terms of machine and energy costs, or at least a decent fraction of a cent. So the cost of a million queries would be in the several thousands of $.

I don't know Google's internal numbers, of course. But a million queries per day for everybody, or even for a few ten thousand people, does not seem realistic any time soon. Would be nice though.

Mark Draughn [PersonRank 5]

18 years ago #

Yeah, I estimated on bandwidth because I have figures for that. Still, I think it's got to be a lot less than 1 cent per query, otherwise they wouldn't give away 1000 a day to anyone who asks. On the other hand, most people don't use their full thousand on most days.

Hmm. It looks like a query usually finishes in under 200ms if the response page is to be believed. Glancing at IBM rack-mount prices, lets say Google pays $6000 for a server. If they keep it 3 years, that's $2000/year. Double it to $4000 to cover operation, overhead, etc. Divide by 31.5 million seconds per year is $0.00012684 per second or $0.00002537 per 200 ms query. At this rate, 40000 queries will cost about a buck. Add another buck for the bandwidth, and it works out to 20000 queries/dollar.

There are plenty of caveats here. For one thing, more than one processor may participate in a query---which would raise the cost---although there are limits to how well a single query can be parallelized. Also, the query rate must vary over the day, so they won't get 100% utilization. On the other hand, the most common queries can be cached. A single gigabyte of RAM in a front-end server could cache at least the results for the top 10000 most common queries, which could knock the average query time down a lot.

That was a higher CPU cost than I would have guessed. Still, I'll guess the marginal cost per query is on the order of 1/100th of a cent.

Philipp Lenssen [PersonRank 10]

18 years ago #

There's another "hidden" cost Google may think of; give the people all the API requests they need, and they can create their own Google. If you limit them to 1,000, 10,000, 100,000, even 1,000,000... their growth is clearly limited.

Torsten [PersonRank 1]

18 years ago #


Mark – my estimate of 1cents per query may have been high, and maybe it is 0.1 or 0.2 cents. However, your calculations are off. There are MANY servers involved in a single query, and the response time does not tell you anything about throughput as there are many queries active at the same time on each machine. I'll give you my best guess: assume 200 machines involved in a query, each indexing 60 million pages (for a total of 12 billion), handling 50 queries per second. This already includes caching of frequent results, and might be off by a factor of 2 or more in either direction. Machines will be less than $6K per piece, but there are significant operational costs.

Another important consideration: unlimited queries allow better reverse engineering of ranking methods, leading also to better methods for manipulation.

Mark Draughn [PersonRank 5]

18 years ago #

The more machines involved, the higher the cost per query. But if each machine is handling several queries at a time, that breaks the other way since it divides the cost across all queries in progress. Storing the documents on 200 machines sounds reasonable, but they wouldn't all have to participate in a query.

I'm only guessing, but I think a search works like this: The search terms are canonicalized first, then a hash is calculated on each term to figure out which server holds its reverse index. Those servers are queried and send back an ordered list of documents, including a document ID, the position within the document, and a score based on word rarity, page rank, etc.

Probably each word's document list fits on a single server, so the degree of parallel processing is less than or equal to the number of terms. This makes Google's 32-terms-per-query limit a bit suggestive, but who really knows?

As they come in to the main query engine, these lists are merged to generate a relevance estimate score for each document and then sorted by relevance. The process probably cuts off as the likelihood of finding high-relevance documents diminishes. Then each document ID in the top group is hashed to figure out which document cache server holds a copy. That server is then queried for perhaps for a more accurate estimate of its relevance. It also returns document information including URL and an excerpt. As with the terms, only servers containing documents need to participate. The results are sorted again if the relevance changed and then sent out as a response.

At least that's my theory. Throw in parallel processors, caching algorithms, multi-channel disk controllers, various network topologies...and it all gets pretty fuzzy.

By the way, this is certainly a lot more fascinating than any other blog discussion I've had in the last month. Thanks Philipp for providing the forum.

Torsten [PersonRank 1]

18 years ago #

What you describe is called a global index organization. The problem with it is that you need to send very large lists of postings across the local network in order to combine (intersect) postings for different terms. It is very unlikely that any major search engine would use such a scheme, and I would expect that they basically use some version of the local index organization that I mentioned, with additional replication of course, or maybe a smart hybrid that we don't know about but that would be a surprise.

Also, query processing can be done very nicely in parallel and the resulting overhead is relatively small, at least compared to the amount of work per query. Using 100 machines should not be a problem at all.

Mark Draughn [PersonRank 5]

18 years ago #

Oh, I think I see what you mean. They partition the web across, say, 100 machines, and all the machines search their piece of the web at the same time and return their top N documents to a central machine (or cascade of machines) for merging and sorting. Is that what you mean?

Yeah, that makes a lot more sense for a loosely-coupled system than what I was thinking. Scales better too. And you can add new nodes without re-hashing the index so it's more flexible. And different nodes can re-index different parts of the web at different intervals. It explains a lot. Pushes up my cost-per-query guesstimate too.

Like I said, it's been fascinating.

Michael "Flarn" Norton [PersonRank 0]

17 years ago #

A better idea would be a third-party Google API that searches the same way humans normally do. It wouldn't have the limits.

Forum home

Advertisement

 
Blog  |  Forum     more >> Archive | Feed | Google's blogs | About
Advertisement

 

This site unofficially covers Google™ and more with some rights reserved. Join our forum!