Re: [xwiki-devs] [VOTE] URLs and encoding

26 Dec 2009

On Dec 26, 2009, at 2:17 PM, Sergiu Dumitriu wrote:
...
  Hi devs,
 The short version:
 Should we always use UTF-8 for encoding and decoding URLs,
 regardless of
 the wiki encoding, for better compliance with web standards?
 The long version:
 By definition, URLs can only contain ASCII characters, everything else
 must be converted to their corresponding bytes and escaped as %XY
 escapes. The problem is that "their corresponding bytes" implies a
 charset + encoding, and no specification *enforces* a specific pair,
 although it is *recommended* to use Unicode + UTF8, to comply with the
 modern tendency of the web in general.
 Traditionally, XWiki has been using the same encoding as the
 configured
 global wiki encoding for the URLs, which means that before 1.9, when
 we
 switched to UTF8 as the default wiki encoding, all URLs were using the
 ISO-8859-1 encoding. Since the switch to UTF-8, URLs are also using
 the
 UTF-8 encoding by default, although the wiki encoding can be changed.
 Now, since 2.1, a bugfix accidentally changed the behavior, so that
 parsing back URLs always uses the UTF-8 encoding, even though
 composing
 URLs continues to use the wiki encoding. This is a bug, which prevents
 changing the encoding to anything other than UTF-8, and it should be
 fixed.
 Now, we have two options:
 1. Reintroduce the old behavior, so that URLs always use the wiki
 encoding. This is a direct bugfix.
 2. Also change the encoding part, so that UTF-8 is always used. This
 is
 an improvement, going towards better compliance with web standards.
 Personally I think that the second option is the better one, but it
 requires a vote, since it has a few drawbacks.
 Advantages:
 + better compliance with web standards, since UTF-8 is the recommended
 encoding for URLs (although not imposed) 
Is there any reference to this? Some RFC that we could quote in the
code?
...
  + support for a wider range of document names, since
UTF-8 allows
 full-unicode document names, while ISO-8859-1 limits names to latin1
 characters
 + better support from browsers, since entering accented characters
 directly in the address bar encodes the URL sent to the server using
 UTF-8, 
Is that true for all browsers? Is there a standard?
...
  and decoding the URL also assumes UTF-8; this means
that a
 document named "é" will be printed as .../view/Main/%E9 and will
 have to
 be entered the same way in the address bar when ISO-8859-1 is used,
 and
 as .../view/Main/é when UTF-8 is used
 Drawbacks:
 - by default Tomcat uses ISO-8859-1 as the encoding for URLs, so the
 Tomcat configuration will have to be changed as in
 http://platform.xwiki.org/xwiki/bin/view/AdminGuide/Encoding#HTomcat 
Any known issue logged against Tomcat? Any planned fix? If this is
really a web standard why wouldn't Tomcat change this behavior?
...
  - some existing bookmarks will not work anymore once
the encoding is
 changed
 +1 for option 2 from me, 
Before voting I'd like to see how "standard" this is. If it's really
standard and we have "official" standard docs to back this then I
agree that we should choose 2.
Thanks
-Vincent

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [xwiki-devs] [VOTE] URLs and encoding