Google Blogoscoped


The Google Auto Linker

Kevin Fox [PersonRank 4]

Thursday, February 3, 2005
19 years ago

Before I got to the end of your passage, I was thinking along the same lines: Put in a threshold of, say, six words (any strings less than that don't get linkified), perhaps put in some visual indication of how many hits that particular phrase has (eg changing text size to relate to number of hits ala flickr tags: and then let teachers upload term papers to identify incidences of plagiarism.


Philipp Lenssen [PersonRank 10]

19 years ago #

I just tried the treshold, set it to 25 letters which worked well, and as side-effect also made the program much faster. The link-weight is interesting as well: one could actually decrease the size of popular phrases, so the more original a thought the bigger it ends up on the page. (I did something along those lines, but on a word-by-word basis, at and .)

Now what's really missing for this to work is I would need a way to pick the meaningful sentences out of the stream, and not simply start parsing from the beginning onward... e.g. I could say "if it's a very long string and yet has many hits" than the author hit on a popular idea, or quoted something – but for that to work I'd need to check every possible substring within the string, and it would be consisting of a sort of cross-nested links.

Hmmm. I will play around with the tool some more, I think it should not cross symbols like "." or "?" which indicate the end of a sentence. It might improve the overall linking relevancy.

Brian [PersonRank 0]

19 years ago #

Try this:

1) rather than interfacing with google for every string, first compute the readability of strings locally using the Flesch-Kincaid Readability Test [1]
2) Take several of the lowest scored strings and then plug those into Google. This will return only results which are "interesting" as in, wow, I really sort of expected I was the first person to ever say that :)

it's actually very useful as is, though.

Brian [PersonRank 0]

19 years ago #



Philipp Lenssen [PersonRank 10]

19 years ago #

I tweaked the program a little and updated the original post (sentence ends are respected now).

Philipp Lenssen [PersonRank 10]

19 years ago #

By the way, I also have a database of word frequencies to plug into here somehow... it's the Google page-count for each of the 27,000 words of a dictionary.

Brian [PersonRank 0]

19 years ago #

You could make a game out of this. Take any paragraph (famous ones would be fun) and reword it in a way that retains its semantics, uses proper syntax, and yet no phrase contained within has ever been uttered on the WWG =)

Forum home


Blog  |  Forum     more >> Archive | Feed | Google's blogs | About


This site unofficially covers Google™ and more with some rights reserved. Join our forum!