Toronto journalist and accessibility consultant Joe Clark comments on a lack of ATAG compliant blogging tools (ATAG are the Authoring Tools Accessibility Guidelines of the World Wide Web Consortium) – this includes Google’s Blogger:
“Google doesn’t believe in valid code. Their computers are strong enough to bulldoze through whatever shite people put out, as far as they’re concerned.”
Joe points to an interesting statement made by Google co-founder Sergey Brin, who remarked on the Semantic Web:
“Look, putting angle brackets around things is not a technology, by itself. I’d rather make progress by having computers understand what humans write, than by forcing humans to write in ways computers can understand.”
While I’m not a big believer in the Semantic Web*, I do strongly believe in the merits of accessibility – as in, plain Berners-Lee HTML (the “Strict” variants; not necessarily validating, but no DHTML please). I’m sitting in the train everyday on my way to work, and while I can easily read, post and delete in the Google Blogoscoped forum with my Nokia cell phone**, or do a lot of other stuff on the web, I can’t check my Gmails.
*People didn’t even use the address element properly, and it’s been around longer and actually always had visible effects in browsers.
**I’m using NetFront Access, the hands-down best Symbian OS browser.
So far when Google thinks of accessibility it always seems to think of having to provide a separate HTML page. This error is obvious in the Google Palm homepage, in comments which have been made in the beginning of the Gmail Beta program (namely, that accessibility will follow via separate HTML), and in Google News (there’s a link to a “text version”). But the key to web accessibility is not redundant content, it’s plain HTML the way it was designed by Tim Berners-Lee – device-independence was one of the major problems he tried to solve with it.
Apparently, a vulnerability in Gmail’s handling of the reply-to field in some situations reveals emails of other people. [Thanks David C.]
Update: Slashdot has the news from Google that they fixed this bug.
Daniel Brandt of Google-Watch.org released the source to his Google screen-scraper and this bit is making its round-trip through search engine news sites. I also released similar code in this blog before. Actually, acting as a Google proxy and displaying Google search results without ads on your own server is trivial for a programmer. (Just grab the result URL via HTTP in a language like PHP, do some text splitting and searching, possibly cache the data, and output the HTML you like after doing additional calculations with the result set – these are the basic steps of all screen-scraping.) On the other hand, it is not trivial for Google to create these search results, or for any developer to create a search engine from scratch.
“These engines crawl the public web without asking permission, and cache and reproduce the content without asking permission, and then use this information as a carrier for ads that generate private profit.”
So Daniel’s basic point is that screen-scraping of his sort is morally OK because Google does the same; i.e., it screen-scrapes the whole web, and then puts ads on parts of that content in its search result. That much is true. However there is one huge difference, and it makes Daniel’s point a weak one:
Google respects the “Robots.txt” file, an agreed-upon standard to tell automates scraping the web just what is OK to get from one’s server. Daniel Brandt does not respect this “Robots.txt” file (Google disallows all automated user-agents from grabbing its search URLs, and instead offers the limited Google Web API to interested developers.)
So yes, Google does ask for permission before republishing your content by using the web’s default way for doing so: checking if there’s a “Robots.txt” file present, and respecting what it says. (Do I always respect “Robots.txt” myself? No, plain curiosity convinces me I sometimes do want to screen-scrape sites like Google, simply because the Google Web API is so limited. I do not know if the “Robots.txt” file has any actual legal implications. Still I wouldn’t base anything serious on it, and also shut down my tools upon request; my serious sites like FindForward, on the other hand, use only APIs. This also makes sure your tool won’t break from one day to the other simply because the other site decided to rearrange parts of its HTML output or URL scheme.)
Daniel also says:
“Our review of the legal situation has convinced us that we are covered by ’fair use’ under the Copyright Act.”
Now copyright is another issue, and a very fuzzy one at that. Technically, whenever you visit a web site, you download a copy of it to your hard disk. Including all images in it. Think of it as the same situation as when your brain memorizes copyrighted work: you just can’t help it because you need to look at this work.
The core to this problem thus is not making copies, which we do all day. It’s distributing those copies to others, short-cutting the original channel and depriving them of commercial benefits of ads and such. We can work with the material only within what copyright law deems as “fair use.” Does Google make “fair use” of what it find online, and does it give back to the community as much as it takes?
“The larger issue here is that the commercialization of the web became possible only because tens of thousands of noncommercial sites made the web interesting in the first place. All search engines should make a stable, bare-bones, ad-free, easy-to-scrape version of their results available for those who want to set up nonprofit repeaters. Even if it cuts into their ad profits slightly, there’s no easier way to give back some of what they stole from us.”
Daniel misses out one major point: every webmaster in the world I know is very happy when Google displays his content. Because it brings visitors who click on the result link, and visitors is what you usually want to attract from publishing something online (out of commercial or other interest). People know how the “Robots.txt” file works, and yet they decide not to use it – they want Google to screen-scrape. SEO, Search Engine Optimization, partly is about how to make a site more friendly to screen-scrapers. Yes, Google is giving back to the web, and in great numbers – in numbers greater than any other search engine.
This being said, there is another side to it: yes, personally I would love for Google to officially allow screen-scraping their site and to change their “Robots.txt” file; I would love for them to release better APIs (covering Google Images, Google Groups, and others); I would love to have an RSS feed for Google News, and so on. The difference here to Daniel Brandt of Google-Watch and some others is I believe in no way is Google Inc. obliged to do any of that.
And it’s not even evil not to do it.
Wouldn’t it be nice if for every site we visit, we would automatically see a display of how many pages the domain at large contains in Google? This could be made into a nice Firefox extension. E.g. you visit “www.example.com/subpage/1.html”, then Google should be queried for “site:www.example.com”, and the resulting page count should be displayed in the Firefox status bar at the bottom. In addition to PageRank – a non-perfect, yet greatly helpful “level of trust” indicator – this would give you an instant feeling of which “neighborhood” you are in (is it a fresh site with 10 pages and PageRank 1... or a large news network with 1,000,000 pages, and a PageRank 6?).
>> More posts