Hi devs, The short version: Should we always use UTF-8 for encoding and decoding URLs, regardless of the wiki encoding, for better compliance with web standards? The long version: By definition, URLs can only contain ASCII characters, everything else must be converted to their corresponding bytes and escaped as %XY escapes. The problem is that "their corresponding bytes" implies a charset + encoding, and no specification *enforces* a specific pair, although it is *recommended* to use Unicode + UTF8, to comply with the modern tendency of the web in general. Traditionally, XWiki has been using the same encoding as the configured global wiki encoding for the URLs, which means that before 1.9, when we switched to UTF8 as the default wiki encoding, all URLs were using the ISO-8859-1 encoding. Since the switch to UTF-8, URLs are also using the UTF-8 encoding by default, although the wiki encoding can be changed. Now, since 2.1, a bugfix accidentally changed the behavior, so that parsing back URLs always uses the UTF-8 encoding, even though composing URLs continues to use the wiki encoding. This is a bug, which prevents changing the encoding to anything other than UTF-8, and it should be fixed. Now, we have two options: 1. Reintroduce the old behavior, so that URLs always use the wiki encoding. This is a direct bugfix. 2. Also change the encoding part, so that UTF-8 is always used. This is an improvement, going towards better compliance with web standards. Personally I think that the second option is the better one, but it requires a vote, since it has a few drawbacks. Advantages: + better compliance with web standards, since UTF-8 is the recommended encoding for URLs (although not imposed) + support for a wider range of document names, since UTF-8 allows full-unicode document names, while ISO-8859-1 limits names to latin1 characters + better support from browsers, since entering accented characters directly in the address bar encodes the URL sent to the server using UTF-8, and decoding the URL also assumes UTF-8; this means that a document named "é" will be printed as .../view/Main/%E9 and will have to be entered the same way in the address bar when ISO-8859-1 is used, and as .../view/Main/é when UTF-8 is used Drawbacks: - by default Tomcat uses ISO-8859-1 as the encoding for URLs, so the Tomcat configuration will have to be changed as in http://platform.xwiki.org/xwiki/bin/view/AdminGuide/Encoding#HTomcat - some existing bookmarks will not work anymore once the encoding is changed +1 for option 2 from me, -- Sergiu Dumitriu http://purl.org/net/sergiu/
On Dec 26, 2009, at 2:17 PM, Sergiu Dumitriu wrote:
Hi devs,
The short version:
Should we always use UTF-8 for encoding and decoding URLs, regardless of the wiki encoding, for better compliance with web standards?
The long version:
By definition, URLs can only contain ASCII characters, everything else must be converted to their corresponding bytes and escaped as %XY escapes. The problem is that "their corresponding bytes" implies a charset + encoding, and no specification *enforces* a specific pair, although it is *recommended* to use Unicode + UTF8, to comply with the modern tendency of the web in general.
Traditionally, XWiki has been using the same encoding as the configured global wiki encoding for the URLs, which means that before 1.9, when we switched to UTF8 as the default wiki encoding, all URLs were using the ISO-8859-1 encoding. Since the switch to UTF-8, URLs are also using the UTF-8 encoding by default, although the wiki encoding can be changed.
Now, since 2.1, a bugfix accidentally changed the behavior, so that parsing back URLs always uses the UTF-8 encoding, even though composing URLs continues to use the wiki encoding. This is a bug, which prevents changing the encoding to anything other than UTF-8, and it should be fixed.
Now, we have two options:
1. Reintroduce the old behavior, so that URLs always use the wiki encoding. This is a direct bugfix. 2. Also change the encoding part, so that UTF-8 is always used. This is an improvement, going towards better compliance with web standards.
Personally I think that the second option is the better one, but it requires a vote, since it has a few drawbacks.
Advantages: + better compliance with web standards, since UTF-8 is the recommended encoding for URLs (although not imposed)
Is there any reference to this? Some RFC that we could quote in the code?
+ support for a wider range of document names, since UTF-8 allows full-unicode document names, while ISO-8859-1 limits names to latin1 characters + better support from browsers, since entering accented characters directly in the address bar encodes the URL sent to the server using UTF-8,
Is that true for all browsers? Is there a standard?
and decoding the URL also assumes UTF-8; this means that a document named "é" will be printed as .../view/Main/%E9 and will have to be entered the same way in the address bar when ISO-8859-1 is used, and as .../view/Main/é when UTF-8 is used
Drawbacks: - by default Tomcat uses ISO-8859-1 as the encoding for URLs, so the Tomcat configuration will have to be changed as in http://platform.xwiki.org/xwiki/bin/view/AdminGuide/Encoding#HTomcat
Any known issue logged against Tomcat? Any planned fix? If this is really a web standard why wouldn't Tomcat change this behavior?
- some existing bookmarks will not work anymore once the encoding is changed
+1 for option 2 from me,
Before voting I'd like to see how "standard" this is. If it's really standard and we have "official" standard docs to back this then I agree that we should choose 2. Thanks -Vincent
On 12/26/2009 02:29 PM, Vincent Massol wrote:
On Dec 26, 2009, at 2:17 PM, Sergiu Dumitriu wrote:
Hi devs,
The short version:
Should we always use UTF-8 for encoding and decoding URLs, regardless of the wiki encoding, for better compliance with web standards?
The long version:
By definition, URLs can only contain ASCII characters, everything else must be converted to their corresponding bytes and escaped as %XY escapes. The problem is that "their corresponding bytes" implies a charset + encoding, and no specification *enforces* a specific pair, although it is *recommended* to use Unicode + UTF8, to comply with the modern tendency of the web in general.
Traditionally, XWiki has been using the same encoding as the configured global wiki encoding for the URLs, which means that before 1.9, when we switched to UTF8 as the default wiki encoding, all URLs were using the ISO-8859-1 encoding. Since the switch to UTF-8, URLs are also using the UTF-8 encoding by default, although the wiki encoding can be changed.
Now, since 2.1, a bugfix accidentally changed the behavior, so that parsing back URLs always uses the UTF-8 encoding, even though composing URLs continues to use the wiki encoding. This is a bug, which prevents changing the encoding to anything other than UTF-8, and it should be fixed.
Now, we have two options:
1. Reintroduce the old behavior, so that URLs always use the wiki encoding. This is a direct bugfix. 2. Also change the encoding part, so that UTF-8 is always used. This is an improvement, going towards better compliance with web standards.
Personally I think that the second option is the better one, but it requires a vote, since it has a few drawbacks.
Advantages: + better compliance with web standards, since UTF-8 is the recommended encoding for URLs (although not imposed)
Is there any reference to this? Some RFC that we could quote in the code?
The URL RFC predates the wide adoption of UTF, so it does not mention any encoding (see http://tools.ietf.org/html/rfc1738#section-2.2 ). This is why I said that there's no enforcement. However, the URI RFC, which is a generalization of URLs, enforces UTF-8 (see http://tools.ietf.org/html/rfc3986#section-2.5 ). The URI RFC officially *updates* the URL RFC, so we can say that URLs are currently standardized by the new, UTF-8 enforcing RFC 3986, although for backwards compatibility URLs can still be used with the RFC 1738 definition.
+ support for a wider range of document names, since UTF-8 allows full-unicode document names, while ISO-8859-1 limits names to latin1 characters + better support from browsers, since entering accented characters directly in the address bar encodes the URL sent to the server using UTF-8,
Is that true for all browsers? Is there a standard?
FF, Opera, Chrome. Konqueror even displays %E9 as an invalid UTF character <?>, so it assumes even more that the URL is in UTF-8. IE6 does not automatically convert %XY escapes to their equivalent character, so it displays both %E9 and %C3%A9 (the utf encoding for é). However, entering é in the address bar converts to UTF-8 bytes. Also note that IE6 predates the RFC 3986 by several years, so it has the right not to assume UTF-8 in URLs.
and decoding the URL also assumes UTF-8; this means that a document named "é" will be printed as .../view/Main/%E9 and will have to be entered the same way in the address bar when ISO-8859-1 is used, and as .../view/Main/é when UTF-8 is used
Drawbacks: - by default Tomcat uses ISO-8859-1 as the encoding for URLs, so the Tomcat configuration will have to be changed as in http://platform.xwiki.org/xwiki/bin/view/AdminGuide/Encoding#HTomcat
Any known issue logged against Tomcat? Any planned fix? If this is really a web standard why wouldn't Tomcat change this behavior?
According to http://wiki.apache.org/tomcat/Tomcat/UTF-8 , Tomcat developers interpret RFC 2616 (the HTTP 1.1 specification) as to recommend the ISO-8859-1 charset as the default. However, this is not true. The RFC references the URI RFC as the standard for the requested address, and the 8859-1 is the default for the response *body* only. The original URI RFC referenced by the HTTP RFC, http://www.ietf.org/rfc/rfc2396.txt , does not recommend a default charset+encoding, although it only mentions UTF-8 as a possible encoding, and doesn't mention ISO-8859-1 at all. So I'd rather say that the HTTP 1.1 specification does not recommend any default encoding for addresses, rather than ISO-8859-1, and it hints towards UTF-8. And that RFC has been deprecated in favor of RFC 3986, which clearly states that UTF-8 is used. Perhaps I should send this interpretation to the Tomcat guys.
- some existing bookmarks will not work anymore once the encoding is changed
+1 for option 2 from me,
Before voting I'd like to see how "standard" this is. If it's really standard and we have "official" standard docs to back this then I agree that we should choose 2.
IMO, RFC 3986 is the current standard and the one we should follow, and it does specify UTF-8 as the ONLY encoding, not just the default. -- Sergiu Dumitriu http://purl.org/net/sergiu/
On Sat, Dec 26, 2009 at 14:17, Sergiu Dumitriu <[email protected]> wrote:
Hi devs,
The short version:
Should we always use UTF-8 for encoding and decoding URLs, regardless of the wiki encoding, for better compliance with web standards?
The long version:
By definition, URLs can only contain ASCII characters, everything else must be converted to their corresponding bytes and escaped as %XY escapes. The problem is that "their corresponding bytes" implies a charset + encoding, and no specification *enforces* a specific pair, although it is *recommended* to use Unicode + UTF8, to comply with the modern tendency of the web in general.
Traditionally, XWiki has been using the same encoding as the configured global wiki encoding for the URLs, which means that before 1.9, when we switched to UTF8 as the default wiki encoding, all URLs were using the ISO-8859-1 encoding. Since the switch to UTF-8, URLs are also using the UTF-8 encoding by default, although the wiki encoding can be changed.
Now, since 2.1, a bugfix accidentally changed the behavior, so that parsing back URLs always uses the UTF-8 encoding, even though composing URLs continues to use the wiki encoding. This is a bug, which prevents changing the encoding to anything other than UTF-8, and it should be fixed.
Now, we have two options:
1. Reintroduce the old behavior, so that URLs always use the wiki encoding. This is a direct bugfix. 2. Also change the encoding part, so that UTF-8 is always used. This is an improvement, going towards better compliance with web standards.
Personally I think that the second option is the better one, but it requires a vote, since it has a few drawbacks.
Advantages: + better compliance with web standards, since UTF-8 is the recommended encoding for URLs (although not imposed) + support for a wider range of document names, since UTF-8 allows full-unicode document names, while ISO-8859-1 limits names to latin1 characters + better support from browsers, since entering accented characters directly in the address bar encodes the URL sent to the server using UTF-8, and decoding the URL also assumes UTF-8; this means that a document named "é" will be printed as .../view/Main/%E9 and will have to be entered the same way in the address bar when ISO-8859-1 is used, and as .../view/Main/é when UTF-8 is used
Drawbacks: - by default Tomcat uses ISO-8859-1 as the encoding for URLs, so the Tomcat configuration will have to be changed as in http://platform.xwiki.org/xwiki/bin/view/AdminGuide/Encoding#HTomcat - some existing bookmarks will not work anymore once the encoding is changed
+1 for option 2 from me,
+1 for 2
-- Sergiu Dumitriu http://purl.org/net/sergiu/ _______________________________________________ devs mailing list [email protected] http://lists.xwiki.org/mailman/listinfo/devs
-- Thomas Mortagne
participants (3)
-
Sergiu Dumitriu -
Thomas Mortagne -
Vincent Massol