On Dec 26, 2009, at 2:17 PM, Sergiu Dumitriu wrote:
Hi devs,
The short version:
Should we always use UTF-8 for encoding and decoding URLs,
regardless of
the wiki encoding, for better compliance with web standards?
The long version:
By definition, URLs can only contain ASCII characters, everything else
must be converted to their corresponding bytes and escaped as %XY
escapes. The problem is that "their corresponding bytes" implies a
charset + encoding, and no specification *enforces* a specific pair,
although it is *recommended* to use Unicode + UTF8, to comply with the
modern tendency of the web in general.
Traditionally, XWiki has been using the same encoding as the
configured
global wiki encoding for the URLs, which means that before 1.9, when
we
switched to UTF8 as the default wiki encoding, all URLs were using the
ISO-8859-1 encoding. Since the switch to UTF-8, URLs are also using
the
UTF-8 encoding by default, although the wiki encoding can be changed.
Now, since 2.1, a bugfix accidentally changed the behavior, so that
parsing back URLs always uses the UTF-8 encoding, even though
composing
URLs continues to use the wiki encoding. This is a bug, which prevents
changing the encoding to anything other than UTF-8, and it should be
fixed.
Now, we have two options:
1. Reintroduce the old behavior, so that URLs always use the wiki
encoding. This is a direct bugfix.
2. Also change the encoding part, so that UTF-8 is always used. This
is
an improvement, going towards better compliance with web standards.
Personally I think that the second option is the better one, but it
requires a vote, since it has a few drawbacks.
Advantages:
+ better compliance with web standards, since UTF-8 is the recommended
encoding for URLs (although not imposed)
Is there any reference to this? Some RFC that we could quote in the
code?
+ support for a wider range of document names, since
UTF-8 allows
full-unicode document names, while ISO-8859-1 limits names to latin1
characters
+ better support from browsers, since entering accented characters
directly in the address bar encodes the URL sent to the server using
UTF-8,
Is that true for all browsers? Is there a standard?
and decoding the URL also assumes UTF-8; this means
that a
document named "é" will be printed as .../view/Main/%E9 and will
have to
be entered the same way in the address bar when ISO-8859-1 is used,
and
as .../view/Main/é when UTF-8 is used
Drawbacks:
- by default Tomcat uses ISO-8859-1 as the encoding for URLs, so the
Tomcat configuration will have to be changed as in
http://platform.xwiki.org/xwiki/bin/view/AdminGuide/Encoding#HTomcat
Any known issue logged against Tomcat? Any planned fix? If this is
really a web standard why wouldn't Tomcat change this behavior?
- some existing bookmarks will not work anymore once
the encoding is
changed
+1 for option 2 from me,
Before voting I'd like to see how "standard" this is. If it's really
standard and we have "official" standard docs to back this then I
agree that we should choose 2.
Thanks
-Vincent