Google Blogoscoped

Wednesday, August 2, 2006

Google’s Matt Cutts on Duplicate Content and More

This is a partial transcript of Google’s Matt Cutts’ Q&A videos. The text is edited for clarity.

Q: When does Google detect duplicate content, and within which range will duplicate be duplicate?

A: Good question. That’s not a simple answer... the short answer is, we do a lot of duplicate content detection. It’s not like there’s one stage where we say, OK, right here is where we detect the duplicates. Rather, it’s all the way from the crawl, through the indexing, through the scoring, until finally just milliseconds before you answer things.

And there are different types of duplicate content. There’s certainly exact duplicate detection. So if one page looks exactly the same as another page, that can be quite helpful, but at the same time it’s not the case that pages are always exactly the same. And so we also detect near duplicates, and we use a lot of sophisticated logic to do that.

In general, if you think you might be having problems, your best guess is probably to make sure your pages are quite different from each other, because we do do a lot of different duplicate detection... to crawl less, and to provide better results and more diversity.

Q: I like to explicitly exclude a few of my sites from the default moderate SafeSearch filter, but Google seems to be less of a prude than I’d like to prefer. Is there any hope of a tag, attribute or other snippet to limit a page to unfiltered results – or should I just start putting a few nasty words in the alt tags of blank images?

A: Well, don’t do them in blank images, you know, put ’em in your meta tags.

Whenever I was writing the very first version of SafeSearch, I noticed that there were a lot of sites which did not tag their pages at all in terms of “we’re being adult content.” There’s a lot of industry groups, there’s a lot of industry standards, but at that time the vast majority of porn pages just sort of ignored those tags. So, it wasn’t that big of a win to just go ahead and include that.

A short answer to your question is: to the best of my knowledge there is no tag that could just say, “I am porn, please exclude me from your SafeSearch.” It’s wonderful that you’re asking about that. Your best bet? I would go with meta tags. Because SafeSearch, unlike a lot of different stuff, actually does look at the raw content of a page (or at least the version that I last saw looks at the raw content of a page). So if you put it in your meta tags, or even in comments – which is something that isn’t usually indexed by Google very much – we should be able to detect porn that way. Don’t use blank images... don’t use images that people can’t see.

Q: Sometimes I make a select box spiderable by just putting links in the “option” elements. Normal browsers ignore them, and spiders ignore the options. But since Google is using the MozillaBot, and Mozilla renders the page before it crawls it, Mozilla would remove the link element from the Document Object Model tree.

A: In essence, you’re saying: can I put links in an option box? You can, but I wouldn’t recommend it. This is pretty non-standard behavior, it’s very rare. It would definitely make my eyebrows go up if I would see it. It’s better for your users and it’s better for search engines to probably just take those links out, put them somewhere in [*] the sitemap. In that way, we’ll be able to crawl right through, and we don’t have to have hyperlinks or anything like that.

Q: I would love to see a “define" type post where you define terms that you Googlers use that we non-Googlers might get confused about. Things like data refresh, orthogonal, etc. You may have defined them in various places, but one sheet type of list would be great.

A: A very good question. At some point I have to make a blog post about hosts vs domains, a bunch of stuff like that. But several people have been asking questions about June 27th/ July 27th. So let me talk about those a little bit in the context of a data refresh vs an algorithm update vs an index update.

I use the metaphor of a car. Back in 2003 we crawled the web and indexed the web about once every month. And when we did that, that was called an index update; algorithms could change, the data would change, everything could change all in one shot. So that was a pretty big deal. WebmasterWorld would name those index updates.

Now that we pretty much crawl and refresh some of our index every single day, it’s an everflux, an always sort of going on process.

The biggest change that people tend to see are algorithm updates. You don’t see many index updates anymore, because we moved away from this monthly update cycle. The only times you might see them is if you’re computing an index which is incompatible to the old index. For example, if you change how you do segmentation of CJK (Chinese, Japanese and Korean), something like that, you might have to completely change your index, and build another index in parallel. So index updates are relatively rare.

Algorithm updates basically are when you change your algorithm. Maybe that’s a change in how you score particular pages. You say to yourself, “The PageRank matters this much more” or “it matters this much less,” things like that. And those can happen pretty much at any time. We call that asynchronous, because whenever we get an algorithm update and it evaluates positively, and it improves quality, and it improves relevance... we go ahead and push that out.

And then the smallest change is called a data refresh. And that’s essentially like, you’re changing the input to the algorithm. You’re changing the data that the algorithm works on.

An index update, with a car metaphor, would be changing a large section of the car... like changing the car entirely. Whereas an algorithm update would be things like changing a part in the car... maybe changing out the engine for a different engine, or some other large part of the car. A data refresh is more like changing the gas in your car. Every one or two weeks (or three weeks if you’re driving a hybrid!) you change what actually goes in and how the algorithm operates on that data.

So for the most part, data refreshes are a very common thing. We try to be very careful about how we safety-check them. Some data refreshes happen all the time; for example, we compute PageRank [edit: continually and] continuously. There’s always a bank of machines refining the PageRank based on incoming data. And PageRank goes out all the time – anytime there’s a new update to our index, which happens pretty much every day. By contrast, some algorithms are updated every week or every couple of weeks, so those are data refreshes that happen on a slower pace.

The particular algorithm that people were interested in on June 27th/ July 27th has actually been live for over a year and a half now. It’s data refreshes that you’re seeing that change the way people’s sites rank. In general, if your site has been affected, you know, go back and take a fresh look and see, is there’s anything that might be exceedingly over-optimized. Or, maybe you’ve been hanging out on SEO forums for such a long time, that you need to have a regular person come in and take a look at the site and see if it looks OK to them. If you tried all the regular stuff and it still looks OK to you, then I would just keep building regular good content, try to make the site very useful. And if a site is useful, then Google should fight hard to make sure it ranks where it should be ranking.

*Oops, I didn’t get this part. Anyone?

Find more transcripts of Matt’s talks.


Blog  |  Forum     more >> Archive | Feed | Google's blogs | About


This site unofficially covers Google™ and more with some rights reserved. Join our forum!