Google Blogoscoped

Wednesday, August 30, 2006

Matt Cutts on Google Updates

This is a transcript of Google’s Matt Cutts explaining a bit more how the Google back-end works (edited for clarity):

Hey everybody, good to see you again! I thought I’d talk about datacenter updates, what to expect for the next few weeks in Google, and stuff like that this time. (...)

There is always an update going on, practically daily, if not daily. A pretty large fraction of our index is updated every day as we crawl the web.

We also have algorithms and data pushes that are going out on a less frequent basis. For example there was a data push on June 27th, July 27th, and then August 17th. And again, that’s not [recent?], that’s going on for 1.5 years. If you seem to be caught in that, you’re more likely to be reading on an SEO board. You might wanna think about ways that you could back your site off... think less about what the SEOs on the board are saying, sort of not be optimizing quite as much on your site. That’s about as much advice as I can give I’m afraid.

Bigdaddy was a software infrastructure upgrade, and that upgrade finished around in February. It was pretty much a refresh to how we crawl the web, and how we partly index the web. And that’s been done for several [months] and things have been working quite smoothly.

There was also a complete refresh or update of our supplemental results index infrastructure. That happened a couple of months after Bigdaddy, so it’s been done for a month or two. It was a complete rewrite, so the indexing infrastructure is different than our main indexing infrastructure.

You expect to see a few more issues whenever you roll that out. We saw more small off the beaten path stuff, like minus or exclusion terms, the noindex meta tag, stuff like that. And the way the supplemental results worked with the main web index, you’d often see site results that missed and site results estimates that were too high. There was at least one incident where there was a spammer that some people thought had 5 billion pages... and whenever I looked into it, the total number of pages that their biggest domain had was 150,000 under 50,000 pages. So they’d been adding up all these “site:” estimates and ended up with a really big number that was just way, way off.

One nice thing is we have another software infrastructure update which improves quality as the main aspect, but also improves our site crawling estimates as well. It’s just sort of like a side benefit. I know that that is out on all datacenters in the sense that it can run in some experimental modes, but it’s not fully on in every datacenter. We were shooting for the end of the summer to have that live everywhere, but again that’s a hope, not a promise. So if things need more testing they’ll work for longer to make sure everything goes smoothly, and if everything goes great, then they might roll it out faster.

[Matt goes on to explain that there is no big need for Google to make “site:” page count estimates more precisely, and that a webmaster better spends time on improving the site (like looking at server logs and making the site more relevant to specific niches). Matt adds that as many people are requesting a more precise “site:” count, they might think about it though.]

The whole notion of watching datacenters is going to get harder and harder for individuals going forward. Because number 1, we have so much stuff launching in various ways. I’ve seen weekly launch meetings where there are a double digit number of things, and these are things that are under the hood. So they’re strictly quality, they’re not changing the UI or anything like that. If you’re not doing a specific search in Russian, or Chinese, you might not notice the difference. But it goes to show that we’re always rolling out different things, and at different data centers you might have slightly different data.

The other reason why it’s not as much worth watching datacenters is because there’s an entire set of IP addresses. And if you’re a super-duper gung ho SEO, you’ll know “72.2.14.whatever”. That IP address will typically go to one datacenter, but that’s not a guarantee. If that one datacenter comes out of rotation – you know, we’re gonna do something else to it, we’re gonna actually change the hardware infrastructure (and everything I’ve been talking about so far is software infrastructure) – then that IP address can point to a completely different datacenter.

So the currency, the ability to really compare changes and talk to a fellow datacenter watcher and say, “What do you see at 72.7.14.whatever?” is really pretty limited. I would definitely encourage you to spend more time worrying about the results you rank for, increasing the quality of your content, looking for high-quality people that you think should be linking to you and aren’t linking to you (and not even know about that), stuff like that. (...)

The fact of the matter is, we’re always going be working on improving our infrastructure, so you can never guarantee a ranking, or a number 1 for any given term. Because if we find out that we can improve quality by changing our algorithms or data or infrastructure, or anything else, we’re going to make that change. The best SEOs in my experience are the ones that can adapt, and that say “OK, this is the way the algorithms look right now to me, if I want to make a good site that will do well in search engines, this is the direction I want to head in next.” And if you work on those sort of skills, then you don’t have to worry as much about being up at 3am, and talk on a forum about “What does this datacenter look like to you, did it change a whole lot?” and stuff like that.

Find more transcripts of Matt’s talks.

[Matt cartoon by Matt.]


Blog  |  Forum     more >> Archive | Feed | Google's blogs | About


This site unofficially covers Google™ and more with some rights reserved. Join our forum!