Google Blogoscoped

Forum

[Meta] Small forum change in auto-linking

Philipp Lenssen [PersonRank 10]

Thursday, March 6, 2008
16 years ago5,191 views

There was a problem in this forum with how URLs auto-linked. As of now, you don't need to always include a blank after the URL anymore, e.g. you can use brackets immediately before and after, writing (http://example.com). Sometimes the blank still needs to be used e.g. when writing a dot, as otherwise the auto-linker thinks the dot is part of the URL...

Roger Browne [PersonRank 10]

16 years ago #

Good move. Brackets can of course be part of a valid URL, so let's see what happens with this:

(http://www.example.com/foo(bar))

Roger Browne [PersonRank 10]

16 years ago #

Seems it didn't work.

Philipp Lenssen [PersonRank 10]

16 years ago #

Good point Roger. Until I find something smarter, I just did a quick hack so that brackets will be handled old-style if there seems to be Wikipedia links anywhere within the comment... because isn't Wikipedia one of the biggest use-cases for brackets in URLs?

http://en.wikipedia.org/foo(bar)

Philipp Lenssen [PersonRank 10]

16 years ago #

http://example.org/foo(bar)

Philipp Lenssen [PersonRank 10]

16 years ago #

PS: Does anyone know a good solid auto-linking regular expression dealing with all these cases?

Tony Ruscoe [PersonRank 10]

16 years ago #

Technically, the brackets should be URL encoded anyway, so I don't think this is a problem. (Same goes for spaces – they're valid characters in a URL but they usually get URL encoded to %20.)

e.g. http://en.wikipedia.org/wiki/Foo%28bar%29

Philipp Lenssen [PersonRank 10]

16 years ago #

(Hmm, anyone got a link to an actual Wikipedia article using brackets?)

David Mulder [PersonRank 10]

16 years ago #

http://en.wikipedia.org/wiki/Hindenburg_%28Mangaka%29

Philipp Lenssen [PersonRank 10]

16 years ago #

(Thanks David. Looks like while you pasted it with encoded brackets, it shows up with normal brackets in e.g. Google, so I better keep excluding Wikipedia from the new linking mechanism.)

Roger Browne [PersonRank 10]

16 years ago #

If you click on David's link with Firefox, the address bar shows the brackets URL-encoded. If you click on the same link with Konqueror, the address bar shows the brackets in plaintext.

So brackets are inevitably going to be cut-and-pasted into posts.

Ionut Alex. Chitu [PersonRank 10]

16 years ago #

I think a better idea is to keep the brackets in the URL. Most of the issues are related to brackets that are closed after a URL. So you should detect:

"(like this page: http://something.com/) "

) should not be a part of the URL.

Nobody will write: " http://something.com/(great site, actually)."

Ramibotros [PersonRank 10]

16 years ago #

I dunnow but this might help: http://www.truerwords.net/articles/ut/urlactivation.html

Tony Ruscoe [PersonRank 10]

16 years ago #

> So brackets are inevitably going to be cut-and-pasted into posts.

If we're using those rules, does Konqueror also maintain spaces in URLs? If so, you'd never be able to create an accurate parser for auto-linking those, unless you use some kind of markup.

You could try to implement this like Microsoft. e.g. autolink obvious URLs but allow people to write longer URLs containing spaces, brackets, etc. in angle brackets. For example:

http:// example.com/foo (bar) links only:
http:// example.com/foo

but

<http:// example.com/foo (bar)> links:
http:// example.com/foo (bar)

Philipp Lenssen [PersonRank 10]

16 years ago #

Well, I think the algorithm should be something like "the URL ends if there's a ')' unless there was a '(' before within the URL". Would have to figure out how to do that with a regular expression, or just parse/ convert it myself using non-regex code. (Currently the regular expression loads a function which also handles stuff like YouTube auto-embedding, picture embedding, adding the top-arrow for internal-thread references and so on.) But it seems the currently active solution might work on most cases and only perhaps require a once-in-a-blue-moon editing from a moderator...

Philipp Lenssen [PersonRank 10]

16 years ago #

Update: The code now doesn't check against Wikipedia when disabling the new auto-linking mechanism, but check against occurrence of "_(" anywhere in the comment. This should cause even less troubles (though it still requires moderation of some rarer cases, until a better solution is found...).

Philipp Lenssen [PersonRank 10]

16 years ago #

Test 1:
Hello (http://en.wikipedia.org/foo) world.

Philipp Lenssen [PersonRank 10]

16 years ago #

Test 2:
Hello http://en.wikipedia.org/foo_(bar) world.

Ionut Alex. Chitu [PersonRank 10]

16 years ago #

Very interesting (although I don't know who would post http://en.wikipedia.org/foo_(bar)) .

Tony Ruscoe [PersonRank 10]

16 years ago #

Heh. I think we should see how it goes now. It's probably going to catch 99% of links, so it should save us quite a bit of time.

My tests:

This is http://example.com
This is http://example.com.
(This is http://example.com)
(This is http://example.com.)
(This is http://example.com).

Tony Ruscoe [PersonRank 10]

16 years ago #

Hmm. I think it's just as important that any full-stop / period at the end of the URL with white-space following it shouldn't get included in the link.

Roger Browne [PersonRank 10]

16 years ago #

Are there any real URLs with the full-stop/period as the last character?

Tony Ruscoe [PersonRank 10]

16 years ago #

I don't think so. A domain / IP obviously can't end in a full-stop and you can't have a file name ending in a full-stop (under Windows, at least).

Tony Ruscoe [PersonRank 10]

16 years ago #

(Weird. It seems the regular expression just added a space before my closing bracket. And it will probably do the same here...)

Haochi [PersonRank 10]

16 years ago #

http://ihaochi.com/files/auto-link-url-temp.php

Motti [PersonRank 10]

16 years ago #

[put at-character here]Tony: A domain can end in a full stop as technically all domains end with the zero-th-level domain "." (I forget what the technical term is) and is used (e.g.) in DNS records. If you type (say) microsoft.com. (with the final full-stop/period) in most browsers it will work.

Can anyone find a regular URL ending with a period/full-stop (the filetype: operator in google doesn't help here obviously) or, even better, figure out a general method to find URLs ending with a "."?

Tony Ruscoe [PersonRank 10]

16 years ago #

Motti, exactly. Although the full-stop is used in DNS records, they're not generally used in links. We're not talking about theory; we're talking about practice. You don't need it and it shouldn't really be linked, although it does work in most browsers, just like you say.

David Mulder [PersonRank 10]

16 years ago #

Just had to try the new system...
((test)) http://www.foo.com/te_(test)
(((test)) http://www.foo.com/te_(test)
(((test)) http://www.foo.com/te_(test))

Motti [PersonRank 10]

16 years ago #

A comma just got included in a URL in my post here: http://blogoscoped.com/forum/125486.html#id125756

Philipp Lenssen [PersonRank 10]

16 years ago #

Well, first of all – the current linkifier does not fully work with handling the brackets right if you include several tests in a single comment, because it does some comment-wide checks. Second, there is apparently some bug in it right now which puts a blank before brackets, which I need to fix ASAP :)

If we talk about solutions, along the lines of what Tony says I'm mostly interested in a 99.9% working thing. If there is 1 in 100,000 URLs ending in a dot, it's not as important as when 1 out of 50 comments would include a URL ending in a dot-the-sentence-ending-kind. Similar for brackets, it seems there's rarely any URL ending in a bracket which does not contain an opening bracket in it as well, so I think we can safely disregard this if we go for a 99.9% perfect solution. (Besides, there's still us moderators for those one in a blue moon URL turning out wrong if the algo fails.)

Will look into Haochi's regex, looks interesting. Ideally we need something that handles comma, question marks, exclamation marks etc. in the most "pragmatic" sense (not necessarily the most correct...).

George R [PersonRank 10]

16 years ago #

If we could see a problem before it is actually posted, then we could adjust for it.

Could Philipp provide a preview button where we could enter some text without actually posting it, then return a page of how it would appear if it were actually posted. Convert urls to links and images. Show any other formatting or transformations.

Checking for the validity of url's and spell checking would be nice also, but that seems like unnecessary work.

Forum home

Advertisement

 
Blog  |  Forum     more >> Archive | Feed | Google's blogs | About
Advertisement

 

This site unofficially covers Google™ and more with some rights reserved. Join our forum!