Monday, January 30, 2006

How Much Did Google Agree to Censor?

How much does Google self-censor (or “filter”, as they call it) in China? We don’t know exactly, as Google doesn’t tell us so far.

What we do know is that the Chinese gov’t handed Google some sort of list of sites which should be censored. We also know that Google says they want to comply with this list so they’ve got at least some presence in China, which they say will ultimately help the Chinese. From the results in we can further assume this is a list of domains (or sub-domains, or sub-folders on domains). A particular blacklisted domain can contain a dozen, or millions of pages, which are then censored altogether – independent of whether every single page includes “blacklisted ideas.”

When you search Google China, you will see a disclaimer at the bottom of self-censored results. It’s not in a position that every user will focus on, as this eye tracking study shows. But if you are looking for it, you will find it. This disclaimer will only appear if on this page itself a site has been censored. That means even when on the first page there is no such disclaimer, on the second or third page they may be. (Try a search for “chairman” on, and you will find that only by scrolling through the first 4 result pages you hit a censored result.)

(A user, by simply looking at the page, has no chance of finding out which page is missing – that’s sort of the point of censorship. A user also has no way of finding out, by looking at the page, how many results are missing. By directly querying for the domains included in a result list for the same search, however, we can reverse-engineer what lacks.)

But let’s concentrate on the top 10 only, and not later result pages. The first page of results is what Google considers most relevant by definition of its algorithms. This is what most people look at. (At least those of us who are used to trust that the top 10 indeed includes the most relevant pages – this behavior can change depending on what we expect from search engines.)

I took a text from a page that itself is censored in (censored following orders from the Chinese gov’t); it’s called “On the Tyranny of the Chinese Communist Party”, and I don’t make assumptions of any sort on the factualness of the text (we do know the author of the text is currently imprisoned in China). I then queried the Yahoo API to extract meaningful terms from this text, to then check in a search – covering “all websites” by selection – if this phrase was censored or not (by looking for Google’s Chinese censorship disclaimer).

Are the terms extracted also typical daily searches for a Chinese user? No, absolutely not. That would be another experiment. To begin with, a Chinese user won’t only search English, of course. Because the list of terms was extracted from a gov’t-critical text, we have reason to assume a much higher percentage of terms hitting censored results (16.4% in this case). So, this experiment’s setup cannot at all find out the approximate percentage of searches that hit on censored results. Note that even if we’d see search logs from Chinese search engines to try to find out the percentage of censored results, the fact that a Chinese search user may suspect a query is “sensitive” may in turn decrease the chance of her entering it – because she doesn’t want to risk being uncovered, or because she thinks that the results for this topic will be only propaganda anyway. We can also assume that the number of pages containing “sensitive” material is much less than it would be if China’s internet users were allowed to publish anything freely. If Google ever tells us only N% of all searches hit on censored results, we should keep that in mind as well.

So, the following is a small selection of phrases where there are 1 or more results censored in the top 10... according to Google itself. The list makes no assumption to the quality of the censored results, i.e. some results may still be relevant on human analysis, or human analysis may deem some results that are now pushed to the top to be deceptive propaganda. Next to saying that these results are definitely censored, we can only additionaly safely say that by definition of Google’s algorithm of what is most relevant, these result pages have a decreased relevancy. (I’m including this list as an image as I don’t want to risk being listed above more relevant sites for these terms in Google, as this post offers no information on these terms.) You can see the list contains seemingly “non sensitive” words like “accuse”, “hatred” or “hearts” as well as more political terms:


