I don't know what all the fuss is about... Almost non Polish websites are in UTF-8 (thought my sites are), they are all in iso-8859-2 or windows-1250. Those standards were adopted before UTF-8 and the first is a Polish standard certified by a government agency, and the second one is a native encoding in a Polish versions of Windows.
So every one who gets content from other sources deals with changing encoding a lot. And this takes just one function in PHP, python or whatever language they are using... noting to get excited about. Maybe it takes Google so much time to make Google News Poland because we are using so much different encodings? That would be funny...
The problem is that the sources in question use encodings that aren't widely supported, or schemes where the page is sent as us-ascii or iso-8859-1 and transformed into the Indic script by some browser-specific code.
For example, rajasthanpatrika.com/ or bhaskar.com/ won't show up correctly in Firefox even if you have the right fonts installed. (And in case anyone's looking for a solution, Padma – padma.mozdev.org/ – works. ;))
Available since 3/12.
When I saw that, I searched on this forum if we have already wrote about that, but I found nothing. So I supposed it was old. I was wrong :-(
Anyone here have the market share of Google News?
dear sir i thanks for all google famally you have give me same news in mahrastras daly samachar ex. lokmat , sakal , janmaddhm you have collect lot of news . thanks
Google should be appreciated for this. Most of the Indian language sites use proprietary fonts that show English/ASCII characters as indic glyphs. Although Unicode is widely supported these sites never bothered to migrate to unicode.
Converting from proprietary fonts to Unicode may look hard on paper, but it can be done. Unlike Google, Firefox extension Padma (padma.mozdev.org/) transforms various indic scripts that are used in various indic language sites. Unicode Conversion Gateway (uni.medhas.org/), which is based on Padma extension dynamically converts indic sites to unicode format.
An interesting quirk Google is dealing with here is the fact that India has almost 30 states, and most of them have their own language. I'm not talking regional dialects here (i.e. Texans saying "y'all" and Minnesotans saying "dontcha know, ya"), I'm talking about separate languages ... and most Indians speak at most 2 of those languages. Making matters worse, even then, those 2 are not common, generally it's the state they live in, and a state nearby.
So to be able to digest news for a country whose own citizens deal with this kind of language barrier, and to do it well enough to brag about it ... now THAT is something special.