Ensuring Google Knows It's Your Content

Wednesday, June 13, 2007

Ensuring Google Knows It’s Your Content

If you run a blog, you might have noticed several “shadow blogs” which scrape your content, omit your name in the credits, but display big ads instead. In a Google-world, where traffic (and revenue) is often dependent on your content’s position in search results, this can harm you most if those shadow blogs actually appear above you in the SERPs. If you have a stronger PageRank, this might act as defense, and Google now handed out some more details on this in a blog post discussing duplicate content:

At the summit at SMX Advanced, we asked what duplicate content issues were most worrisome. Those in the audience were concerned about scraper sites, syndication, and internal duplication. (...) Here’s the list of some of the potential solutions we discussed (...)

Providing a way to authenticate ownership of content

This would provide search engines with extra information to help ensure we index the original version of an article, rather than a scraped or syndicated version. Note that we do a pretty good job of this now and not many people in the audience mentioned this to be a primary issue. However, the audience was interested in a way of authenticating content as an extra protection. Some suggested using the page with the earliest date, but creation dates aren’t always reliable. Someone also suggested allowing site owners to register content, although that could raise issues as well, as non-savvy site owners wouldn’t know to register content and someone else could take the content and register it instead. We currently rely on a number of factors such as the site’s authority and the number of links to the page. If you syndicate content, we suggest that you ask the sites who are using your content to block their version with a robots.txt file as part of the syndication arrangement to help ensure your version is served in results.

I think in an ideal world the best thing is still for Google to record at which place a piece of content was published first, and to think of that place as the content creator. It’s not foolproof (I can copy something from a print magazine and post it online first, in case the print magazine is a bit slower in updating their website), but it also saves webmasters from investing any additional work in telling Google they own the content. On the other hand, such a timestamp comparison requires a near real-time indexing of the web – and currently, PageRank-heavy site get to be indexed quicker (and it’s not completely unrealistic to assume that some sites gain good PageRank by building up a scraped content network).

In any case, I hope Google won’t start shifting too much of their work over to webmasters. Some of us like to help, but in general we’d like to just build our sites, and not put our energy in improving any specific corporate-owned search engine; adding nofollows to links, reporting paid links with the recently released form, opting in to “enhanced image search” features for your site, or creating a Sitemaps file are often tasks where you wonder, “can’t the search engine makers figure this out themselves?”

Ensuring Google Knows It’ ... by Philipp Lenssen | Comments (10)

>> More posts

Blog | Forum more >> Archive | Feed | Google's blogs | About

This site unofficially covers Google™ and more with some rights reserved. Join our forum!