Re: [xwiki-dev] Trying to understand I8N...

6 Apr 2007

Hi Gilles,
On Apr 6, 2007, at 7:18 PM, Gilles Serasset wrote:
...
  Hi Vincent, Hi all,
 On 6 avr. 07, at 11:39, Vincent Massol wrote:
  Hi,
 I admit it: I'm not an expert in I8N. However I realize that XWiki
 being a wiki we need to have strong I8N features so I'm trying to
 catch up with I8N knowledge... 
 well, first it is I18N, 10 more difficulties... ;-) 
:)
[snip]
...
   2) I see in
our encoding guide on http://www.xwiki.org/xwiki/bin/
 view/AdminGuide/Encoding that we need to set the encoding for the
 container. Why is that required? The servlet container reads pages
 which have the encoding specified (using Content-Type meta data),
 so why does it need to be told about the encoding to use? 
 I also saw that... I asume you mean the -Dfile.encoding=XXX. 
Well I meant this but for Tomcat, not for our code, for which it's
absolutely necessary as we're using the platform encoding in most
places and this defines the platform encoding.
...
  I did not do any extend test to see if this pârameter
was useful. I
 can only say that teher are many places in the code where you have :
 InputStreamReader ir = new InputStreamReader(is)
 This is one of the numerous examples of badly writen code where the
 encoding is not specified... Hence the plateform falls back to the
 file.encoding property value. 
yep. I'm not sure it's bad. Or at least I'm curious to understand why
it's bad.
...
  However, there is currently no accentuated chars in
the skin files
 (that are read on disk), hence it somehow works, because all
 plateform encodings do share the encoding of ASCII chars. It will
 ne be the same if you were to use UTF-16 for instance... 
IMO the encoding to use should be left to the user and be a
configuration option (as it is now) but we should configure
everything to use UTF8 by default.
...

  3) I see that in our standalone installation we
use -
 Dfile.encoding=iso-8859-1. Now that I've read Joel's tutorial it
 seems to me this is not going to work for everyone and that we
 should rather use -Dfile.encoding=UTF-8 by default. WDYT? 
 This will mean that all files that are read by the server, will
 have to be encoded in UTF-8... 
Or any compatible encoding like ISO 8859-1, etc. This is the case now
I think.
...

 (NOTE: resource files (as the one used in xwiki I18N are special as
 they should be encoded in ASCII with \uXXXX to represent non ascii
 chars).
  4) Should we use the platform encoding or default
to using UTF-8
 all the time? (this question is related to 1)). I think we should
 use the platform encoding but I'm curious to know what others think. 
 We should NOT use the plateform encoding. The reason is that all
 files read by the server (skin files mainly) will be read using the
 plateform encoding and their actual encoding. As they only contain
 ascii chars upo to now, it worked, but, if you add accents in them,
 and you give write them in encoding X (at edit time), you are not
 guarranteed that the plateform encoding will by X at run time.
 Hence you should specify the file encoding whenever you read a file. 
Exactly which is why this is best left to the user to decide which
encoding they need to use... I don't think we should force our
encoding. However I'm proposing that we do: System.setProperty
("file.encoding", getParam("xwiki.encoding")) in XWiki initialization
to set the platform encoding to be the encoding specified in xwiki.cfg.
...

  5) Jackson Wang is proposing in a patch to modify
readPackage like
 this:
      private Document readPackage(InputStream is) throws
 IOException, DocumentException
      {
 -        byte[] data = new byte[4096];
 +        //UTF-8 characters could cause encoding as continued
 bytes over 4096 boundary,
 +        // so change byte to char.  ---Jackson
 +        char[] data = new char[4096];
 +        BufferedReader in= new BufferedReader(new
 InputStreamReader(is));
          StringBuffer XmlFile = new StringBuffer();
          int Cnt;
 -        while ((Cnt = is.read(data, 0, 4096)) != -1) {
 +        while ((Cnt = in.read(data, 0, 4096)) != -1) {
              XmlFile.append(new String(data, 0, Cnt));
 -        }
 +       }
          return fromXml(XmlFile.toString());
      }
 However with my new understanding I'm not sure this would help as
 char are stored on 2 bytes in Java and UTF-8 encoding can store on
 up to 4 bytes. Am I correct? 
 Well this patch is problematic as the new InputStreamReader(is)
 does not specify the encoding. The problem is where does the
 InputStream comes from...
 Here, we can even avoid the question, as the InputStream contains
 an xml file that declares its encoding, so xwiki SHOULD NOT build a
 String from this stream, but rather pass the stream directly to the
 xml parser that will do its best to determine the effective
 encoding. This is what I did with my last patch to the packaging
 plugin.
 Here is a substitute patch:
 Index: core/src/main/java/com/xpn/xwiki/plugin/packaging/Package.java
 ===================================================================
 --- core/src/main/java/com/xpn/xwiki/plugin/packaging/
 Package.java      (revision 2581)
 +++ core/src/main/java/com/xpn/xwiki/plugin/packaging/
 Package.java      (working copy)
 @@ -673,13 +673,7 @@
      private Document readPackage(InputStream is) throws
 IOException, DocumentException
      {
 -        byte[] data = new byte[4096];
 -        StringBuffer XmlFile = new StringBuffer();
 -        int Cnt;
 -        while ((Cnt = is.read(data, 0, 4096)) != -1) {
 -            XmlFile.append(new String(data, 0, Cnt));
 -        }
 -        return fromXml(XmlFile.toString());
 +        return fromXml(is);
      }
      public String toXml(XWikiContext context)
 @@ -835,13 +829,12 @@
          }
      }
 -    protected Document fromXml(String xml) throws DocumentException
 +    protected Document fromXml(InputStream xml) throws
 DocumentException
      {
          SAXReader reader = new SAXReader();
          Document domdoc;
 -        StringReader in = new StringReader(xml);
 -        domdoc = reader.read(in);
 +        domdoc = reader.read(xml);
          Element docEl = domdoc.getRootElement();
          Element infosEl = docEl.element("infos");
 It compiles, but I did not have time to test it extensively... But
 it passes the current packaging tests. 
I agree that preventing any conversion is the best way to go.
Thanks for your patch. I'll replace my change of today (which was
using IOUtils) with your patch as it's better (faster, safer,
simpler) :-)
... applied!
...

  However, I would rather use
http://jakarta.apache.org/commons/io/
 api-release/org/apache/commons/io/IOUtils.html#toString
 (java.io.InputStream) than code it ourselves... Sounds safer,
 shorter, less maintenance, etc to me... :) 
 This method has exaclty the same problem, it'll use the plateform
 encoding, event if the inputstream is not encoded in the plateform
 encoding and even if it correctly declares its own encoding...
 Hence it will be buggy. 
Sure but that's ok if the encoding is specified (file.encoding),
right? That said I agree that no conversion is better.
...

 NOW A SMALL WORD ABOUT XWIKI ENCODING:
 Let's pose the problem like this:
 xwiki serves web pages (that are served using encoding Xw)
       gets POST and GET parameters (that are encoded using encoding
 Xp)
       is asked for page named Space/DocumentName (that is encoded
 using encoding Xp)
       reads skin files (that are encoded using encoding Xf)
       reads data from DB (which is encoded using Xd)
 Well maybe there is more, but let's simplify...
 1. We can rather safely assume that Xw and Xp are the same, but we
 may have older clients that disagree on this. We should maybe work
 with "accept-encoding" attributes of forms to avoid this kind of
 problem.
 2. As xwiki provides the skin files that are read, xwiki can chose
 whatever encoding he wants as Xf. Xf is, then, the encoding of the
 XWIKI code base (see previous discussion).
 3. If the database and hibernate are correctly configure with Xd,
 xwiki gets Strings that are correctly decoded from the DB by JDBC
 driver hence Xd is not a problem for xwiki.
 To simplify things, we usually try to have Xd = Xf = Xw = Xp. Hence
 if you want to serve multilingual content (multi being > 3), you
 are quickly obliged to use UTF-8 as Xw, and any more powerfull
 encoding as Xd.
 For the rest, we should hunt all calls to Reader creations that do
 not specify the encoding, as well as Writers creations and calls to
 toBytes... 
Don't you think it should be ok if we set the file.encoding to be the
xwiki.encoding value and we set it to be UTF-8 by default?
Thanks
-Vincent

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [xwiki-dev] Trying to understand I8N...