Trying to understand I8N...
Hi, I admit it: I'm not an expert in I8N. However I realize that XWiki being a wiki we need to have strong I8N features so I'm trying to catch up with I8N knowledge... I started yesterday by reading this excellent short tutorial http:// www.joelonsoftware.com/articles/Unicode.html (The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)). It's very good and an easy read. I recommend it to everyone. This led me to a few questions: 1) Is UTF8 supported on all platforms? Is it supported on mobile platforms for example? 2) I see in our encoding guide on http://www.xwiki.org/xwiki/bin/view/ AdminGuide/Encoding that we need to set the encoding for the container. Why is that required? The servlet container reads pages which have the encoding specified (using Content-Type meta data), so why does it need to be told about the encoding to use? 3) I see that in our standalone installation we use - Dfile.encoding=iso-8859-1. Now that I've read Joel's tutorial it seems to me this is not going to work for everyone and that we should rather use -Dfile.encoding=UTF-8 by default. WDYT? 4) Should we use the platform encoding or default to using UTF-8 all the time? (this question is related to 1)). I think we should use the platform encoding but I'm curious to know what others think. 5) Jackson Wang is proposing in a patch to modify readPackage like this: private Document readPackage(InputStream is) throws IOException, DocumentException { - byte[] data = new byte[4096]; + //UTF-8 characters could cause encoding as continued bytes over 4096 boundary, + // so change byte to char. ---Jackson + char[] data = new char[4096]; + BufferedReader in= new BufferedReader(new InputStreamReader (is)); StringBuffer XmlFile = new StringBuffer(); int Cnt; - while ((Cnt = is.read(data, 0, 4096)) != -1) { + while ((Cnt = in.read(data, 0, 4096)) != -1) { XmlFile.append(new String(data, 0, Cnt)); - } + } return fromXml(XmlFile.toString()); } However with my new understanding I'm not sure this would help as char are stored on 2 bytes in Java and UTF-8 encoding can store on up to 4 bytes. Am I correct? However, I would rather use http://jakarta.apache.org/commons/io/api- release/org/apache/commons/io/IOUtils.html#toString (java.io.InputStream) than code it ourselves... Sounds safer, shorter, less maintenance, etc to me... :) Thanks for your help -Vincent
Hi Vincent, On Apr 06, Vincent Massol wrote :
1) Is UTF8 supported on all platforms? Is it supported on mobile platforms for example?
I've had a quick look for mobile platforms. There is no simple answer. In the java world, J2ME supports unicode and UTF8. But then if the unicode aware fonts are not present in the device, there is not much you can do. Yet I believe most of the modern PDA today have some form of UTF-8 encoding support. Concerning mobiles phones, some of them do have UTF-8 support, and some of them do not. I have not found any comprehensive list. The Nokia 770 in which I'm doing my mobile xwiki experiments does support UTF-8.
5) Jackson Wang is proposing in a patch to modify readPackage like this:
private Document readPackage(InputStream is) throws IOException, DocumentException { - byte[] data = new byte[4096]; + //UTF-8 characters could cause encoding as continued bytes over 4096 boundary, + // so change byte to char. ---Jackson + char[] data = new char[4096]; + BufferedReader in= new BufferedReader(new InputStreamReader (is)); StringBuffer XmlFile = new StringBuffer(); int Cnt; - while ((Cnt = is.read(data, 0, 4096)) != -1) { + while ((Cnt = in.read(data, 0, 4096)) != -1) { XmlFile.append(new String(data, 0, Cnt)); - } + } return fromXml(XmlFile.toString()); }
However with my new understanding I'm not sure this would help as char are stored on 2 bytes in Java and UTF-8 encoding can store on up to 4 bytes. Am I correct?
Yes I think you are. I do not believe this is reliable: for once we should use the constructor String(data, 0, Cnt, encoding), then there is the problem Jackson outlined: data buffer may cut the last Unicode character's end. Using a StringWriter instead of building intermediate Strings, would make things easier.
However, I would rather use http://jakarta.apache.org/commons/io/api- release/org/apache/commons/io/IOUtils.html#toString (java.io.InputStream) than code it ourselves... Sounds safer, shorter, less maintenance, etc to me... :)
I agree. Pablo
On 4/6/07, Vincent Massol <[email protected]> wrote:
Hi,
I admit it: I'm not an expert in I8N. However I realize that XWiki being a wiki we need to have strong I8N features so I'm trying to catch up with I8N knowledge...
I started yesterday by reading this excellent short tutorial http:// www.joelonsoftware.com/articles/Unicode.html (The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)). It's very good and an easy read. I recommend it to everyone.
It's a bit too short, just when things get interesting, it ends. This led me to a few questions:
1) Is UTF8 supported on all platforms? Is it supported on mobile platforms for example?
Platforms old enough not to know UTF are unlikely to support running XWiki on it. 2) I see in our encoding guide on http://www.xwiki.org/xwiki/bin/view/
AdminGuide/Encoding that we need to set the encoding for the container. Why is that required? The servlet container reads pages which have the encoding specified (using Content-Type meta data), so why does it need to be told about the encoding to use?
If you mean the parameter in web.xml, then it's not the container encoding, but a parameter used to correctly identify outgoing files (it sets the Content-Type header according to this param). There are more files which don't have a Content-Type. First there are the files stored on disk. Second, when you POST some data or GET a resource, you don't have a content type. Requests don't have this HTTP header. 3) I see that in our standalone installation we use -
Dfile.encoding=iso-8859-1. Now that I've read Joel's tutorial it seems to me this is not going to work for everyone and that we should rather use -Dfile.encoding=UTF-8 by default. WDYT?
UTF-8 is better. But we really should not depend on the file encoding. 4) Should we use the platform encoding or default to using UTF-8 all
the time? (this question is related to 1)). I think we should use the platform encoding but I'm curious to know what others think.
UTF-8 all the time. Thus we have no dependency on the system, and we don't need guides on "how to change the encoding in only 7 places to make my wiki know bulgarian" 5) Jackson Wang is proposing in a patch to modify readPackage like this:
private Document readPackage(InputStream is) throws IOException, DocumentException { - byte[] data = new byte[4096]; + //UTF-8 characters could cause encoding as continued bytes over 4096 boundary, + // so change byte to char. ---Jackson + char[] data = new char[4096]; + BufferedReader in= new BufferedReader(new InputStreamReader (is)); StringBuffer XmlFile = new StringBuffer(); int Cnt; - while ((Cnt = is.read(data, 0, 4096)) != -1) { + while ((Cnt = in.read(data, 0, 4096)) != -1) { XmlFile.append(new String(data, 0, Cnt)); - } + } return fromXml(XmlFile.toString()); }
However with my new understanding I'm not sure this would help as char are stored on 2 bytes in Java and UTF-8 encoding can store on up to 4 bytes. Am I correct?
However, I would rather use http://jakarta.apache.org/commons/io/api- release/org/apache/commons/io/IOUtils.html#toString (java.io.InputStream) than code it ourselves... Sounds safer, shorter, less maintenance, etc to me... :)
+1. Always reuse existing proven code than reinvent a squeaky wheel. Thanks for your help
-Vincent
Sergiu -- http://purl.org/net/sergiu
Hi Vincent, Hi all, On 6 avr. 07, at 11:39, Vincent Massol wrote:
Hi,
I admit it: I'm not an expert in I8N. However I realize that XWiki being a wiki we need to have strong I8N features so I'm trying to catch up with I8N knowledge...
well, first it is I18N, 10 more difficulties... ;-)
I started yesterday by reading this excellent short tutorial http:// www.joelonsoftware.com/articles/Unicode.html (The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)). It's very good and an easy read. I recommend it to everyone.
Good historical overview, however it just shows several problems, but things are not so complicated...
This led me to a few questions:
1) Is UTF8 supported on all platforms? Is it supported on mobile platforms for example?
I don't know... (well I hope so...), goodanswer could be: It will...
2) I see in our encoding guide on http://www.xwiki.org/xwiki/bin/ view/AdminGuide/Encoding that we need to set the encoding for the container. Why is that required? The servlet container reads pages which have the encoding specified (using Content-Type meta data), so why does it need to be told about the encoding to use?
I also saw that... I asume you mean the -Dfile.encoding=XXX. I did not do any extend test to see if this pârameter was useful. I can only say that teher are many places in the code where you have : InputStreamReader ir = new InputStreamReader(is) This is one of the numerous examples of badly writen code where the encoding is not specified... Hence the plateform falls back to the file.encoding property value. However, there is currently no accentuated chars in the skin files (that are read on disk), hence it somehow works, because all plateform encodings do share the encoding of ASCII chars. It will ne be the same if you were to use UTF-16 for instance...
3) I see that in our standalone installation we use - Dfile.encoding=iso-8859-1. Now that I've read Joel's tutorial it seems to me this is not going to work for everyone and that we should rather use -Dfile.encoding=UTF-8 by default. WDYT?
This will mean that all files that are read by the server, will have to be encoded in UTF-8... (NOTE: resource files (as the one used in xwiki I18N are special as they should be encoded in ASCII with \uXXXX to represent non ascii chars).
4) Should we use the platform encoding or default to using UTF-8 all the time? (this question is related to 1)). I think we should use the platform encoding but I'm curious to know what others think.
We should NOT use the plateform encoding. The reason is that all files read by the server (skin files mainly) will be read using the plateform encoding and their actual encoding. As they only contain ascii chars upo to now, it worked, but, if you add accents in them, and you give write them in encoding X (at edit time), you are not guarranteed that the plateform encoding will by X at run time. Hence you should specify the file encoding whenever you read a file.
5) Jackson Wang is proposing in a patch to modify readPackage like this:
private Document readPackage(InputStream is) throws IOException, DocumentException { - byte[] data = new byte[4096]; + //UTF-8 characters could cause encoding as continued bytes over 4096 boundary, + // so change byte to char. ---Jackson + char[] data = new char[4096]; + BufferedReader in= new BufferedReader(new InputStreamReader (is)); StringBuffer XmlFile = new StringBuffer(); int Cnt; - while ((Cnt = is.read(data, 0, 4096)) != -1) { + while ((Cnt = in.read(data, 0, 4096)) != -1) { XmlFile.append(new String(data, 0, Cnt)); - } + } return fromXml(XmlFile.toString()); }
However with my new understanding I'm not sure this would help as char are stored on 2 bytes in Java and UTF-8 encoding can store on up to 4 bytes. Am I correct?
Well this patch is problematic as the new InputStreamReader(is) does not specify the encoding. The problem is where does the InputStream comes from... Here, we can even avoid the question, as the InputStream contains an xml file that declares its encoding, so xwiki SHOULD NOT build a String from this stream, but rather pass the stream directly to the xml parser that will do its best to determine the effective encoding. This is what I did with my last patch to the packaging plugin. Here is a substitute patch: Index: core/src/main/java/com/xpn/xwiki/plugin/packaging/Package.java =================================================================== --- core/src/main/java/com/xpn/xwiki/plugin/packaging/ Package.java (revision 2581) +++ core/src/main/java/com/xpn/xwiki/plugin/packaging/ Package.java (working copy) @@ -673,13 +673,7 @@ private Document readPackage(InputStream is) throws IOException, DocumentException { - byte[] data = new byte[4096]; - StringBuffer XmlFile = new StringBuffer(); - int Cnt; - while ((Cnt = is.read(data, 0, 4096)) != -1) { - XmlFile.append(new String(data, 0, Cnt)); - } - return fromXml(XmlFile.toString()); + return fromXml(is); } public String toXml(XWikiContext context) @@ -835,13 +829,12 @@ } } - protected Document fromXml(String xml) throws DocumentException + protected Document fromXml(InputStream xml) throws DocumentException { SAXReader reader = new SAXReader(); Document domdoc; - StringReader in = new StringReader(xml); - domdoc = reader.read(in); + domdoc = reader.read(xml); Element docEl = domdoc.getRootElement(); Element infosEl = docEl.element("infos"); It compiles, but I did not have time to test it extensively... But it passes the current packaging tests.
However, I would rather use http://jakarta.apache.org/commons/io/ api-release/org/apache/commons/io/IOUtils.html#toString (java.io.InputStream) than code it ourselves... Sounds safer, shorter, less maintenance, etc to me... :)
This method has exaclty the same problem, it'll use the plateform encoding, event if the inputstream is not encoded in the plateform encoding and even if it correctly declares its own encoding... Hence it will be buggy. NOW A SMALL WORD ABOUT XWIKI ENCODING: Let's pose the problem like this: xwiki serves web pages (that are served using encoding Xw) gets POST and GET parameters (that are encoded using encoding Xp) is asked for page named Space/DocumentName (that is encoded using encoding Xp) reads skin files (that are encoded using encoding Xf) reads data from DB (which is encoded using Xd) Well maybe there is more, but let's simplify... 1. We can rather safely assume that Xw and Xp are the same, but we may have older clients that disagree on this. We should maybe work with "accept-encoding" attributes of forms to avoid this kind of problem. 2. As xwiki provides the skin files that are read, xwiki can chose whatever encoding he wants as Xf. Xf is, then, the encoding of the XWIKI code base (see previous discussion). 3. If the database and hibernate are correctly configure with Xd, xwiki gets Strings that are correctly decoded from the DB by JDBC driver hence Xd is not a problem for xwiki. To simplify things, we usually try to have Xd = Xf = Xw = Xp. Hence if you want to serve multilingual content (multi being > 3), you are quickly obliged to use UTF-8 as Xw, and any more powerfull encoding as Xd. For the rest, we should hunt all calls to Reader creations that do not specify the encoding, as well as Writers creations and calls to toBytes... Regards, Gilles, -- Gilles Sérasset GETALP-LIG BP 53 - F-38041 Grenoble Cedex 9 Phone: +33 4 76 51 43 80 Fax: +33 4 76 44 66 75
Hi Gilles, On Apr 6, 2007, at 7:18 PM, Gilles Serasset wrote:
Hi Vincent, Hi all,
On 6 avr. 07, at 11:39, Vincent Massol wrote:
Hi,
I admit it: I'm not an expert in I8N. However I realize that XWiki being a wiki we need to have strong I8N features so I'm trying to catch up with I8N knowledge...
well, first it is I18N, 10 more difficulties... ;-)
:) [snip]
2) I see in our encoding guide on http://www.xwiki.org/xwiki/bin/ view/AdminGuide/Encoding that we need to set the encoding for the container. Why is that required? The servlet container reads pages which have the encoding specified (using Content-Type meta data), so why does it need to be told about the encoding to use?
I also saw that... I asume you mean the -Dfile.encoding=XXX.
Well I meant this but for Tomcat, not for our code, for which it's absolutely necessary as we're using the platform encoding in most places and this defines the platform encoding.
I did not do any extend test to see if this pârameter was useful. I can only say that teher are many places in the code where you have : InputStreamReader ir = new InputStreamReader(is) This is one of the numerous examples of badly writen code where the encoding is not specified... Hence the plateform falls back to the file.encoding property value.
yep. I'm not sure it's bad. Or at least I'm curious to understand why it's bad.
However, there is currently no accentuated chars in the skin files (that are read on disk), hence it somehow works, because all plateform encodings do share the encoding of ASCII chars. It will ne be the same if you were to use UTF-16 for instance...
IMO the encoding to use should be left to the user and be a configuration option (as it is now) but we should configure everything to use UTF8 by default.
3) I see that in our standalone installation we use - Dfile.encoding=iso-8859-1. Now that I've read Joel's tutorial it seems to me this is not going to work for everyone and that we should rather use -Dfile.encoding=UTF-8 by default. WDYT?
This will mean that all files that are read by the server, will have to be encoded in UTF-8...
Or any compatible encoding like ISO 8859-1, etc. This is the case now I think.
(NOTE: resource files (as the one used in xwiki I18N are special as they should be encoded in ASCII with \uXXXX to represent non ascii chars).
4) Should we use the platform encoding or default to using UTF-8 all the time? (this question is related to 1)). I think we should use the platform encoding but I'm curious to know what others think.
We should NOT use the plateform encoding. The reason is that all files read by the server (skin files mainly) will be read using the plateform encoding and their actual encoding. As they only contain ascii chars upo to now, it worked, but, if you add accents in them, and you give write them in encoding X (at edit time), you are not guarranteed that the plateform encoding will by X at run time. Hence you should specify the file encoding whenever you read a file.
Exactly which is why this is best left to the user to decide which encoding they need to use... I don't think we should force our encoding. However I'm proposing that we do: System.setProperty ("file.encoding", getParam("xwiki.encoding")) in XWiki initialization to set the platform encoding to be the encoding specified in xwiki.cfg.
5) Jackson Wang is proposing in a patch to modify readPackage like this:
private Document readPackage(InputStream is) throws IOException, DocumentException { - byte[] data = new byte[4096]; + //UTF-8 characters could cause encoding as continued bytes over 4096 boundary, + // so change byte to char. ---Jackson + char[] data = new char[4096]; + BufferedReader in= new BufferedReader(new InputStreamReader(is)); StringBuffer XmlFile = new StringBuffer(); int Cnt; - while ((Cnt = is.read(data, 0, 4096)) != -1) { + while ((Cnt = in.read(data, 0, 4096)) != -1) { XmlFile.append(new String(data, 0, Cnt)); - } + } return fromXml(XmlFile.toString()); }
However with my new understanding I'm not sure this would help as char are stored on 2 bytes in Java and UTF-8 encoding can store on up to 4 bytes. Am I correct?
Well this patch is problematic as the new InputStreamReader(is) does not specify the encoding. The problem is where does the InputStream comes from...
Here, we can even avoid the question, as the InputStream contains an xml file that declares its encoding, so xwiki SHOULD NOT build a String from this stream, but rather pass the stream directly to the xml parser that will do its best to determine the effective encoding. This is what I did with my last patch to the packaging plugin.
Here is a substitute patch:
Index: core/src/main/java/com/xpn/xwiki/plugin/packaging/Package.java =================================================================== --- core/src/main/java/com/xpn/xwiki/plugin/packaging/ Package.java (revision 2581) +++ core/src/main/java/com/xpn/xwiki/plugin/packaging/ Package.java (working copy) @@ -673,13 +673,7 @@ private Document readPackage(InputStream is) throws IOException, DocumentException { - byte[] data = new byte[4096]; - StringBuffer XmlFile = new StringBuffer(); - int Cnt; - while ((Cnt = is.read(data, 0, 4096)) != -1) { - XmlFile.append(new String(data, 0, Cnt)); - } - return fromXml(XmlFile.toString()); + return fromXml(is); } public String toXml(XWikiContext context) @@ -835,13 +829,12 @@ } } - protected Document fromXml(String xml) throws DocumentException + protected Document fromXml(InputStream xml) throws DocumentException { SAXReader reader = new SAXReader(); Document domdoc; - StringReader in = new StringReader(xml); - domdoc = reader.read(in); + domdoc = reader.read(xml); Element docEl = domdoc.getRootElement(); Element infosEl = docEl.element("infos");
It compiles, but I did not have time to test it extensively... But it passes the current packaging tests.
I agree that preventing any conversion is the best way to go. Thanks for your patch. I'll replace my change of today (which was using IOUtils) with your patch as it's better (faster, safer, simpler) :-) ... applied!
However, I would rather use http://jakarta.apache.org/commons/io/ api-release/org/apache/commons/io/IOUtils.html#toString (java.io.InputStream) than code it ourselves... Sounds safer, shorter, less maintenance, etc to me... :)
This method has exaclty the same problem, it'll use the plateform encoding, event if the inputstream is not encoded in the plateform encoding and even if it correctly declares its own encoding... Hence it will be buggy.
Sure but that's ok if the encoding is specified (file.encoding), right? That said I agree that no conversion is better.
NOW A SMALL WORD ABOUT XWIKI ENCODING:
Let's pose the problem like this:
xwiki serves web pages (that are served using encoding Xw) gets POST and GET parameters (that are encoded using encoding Xp) is asked for page named Space/DocumentName (that is encoded using encoding Xp) reads skin files (that are encoded using encoding Xf) reads data from DB (which is encoded using Xd)
Well maybe there is more, but let's simplify...
1. We can rather safely assume that Xw and Xp are the same, but we may have older clients that disagree on this. We should maybe work with "accept-encoding" attributes of forms to avoid this kind of problem. 2. As xwiki provides the skin files that are read, xwiki can chose whatever encoding he wants as Xf. Xf is, then, the encoding of the XWIKI code base (see previous discussion). 3. If the database and hibernate are correctly configure with Xd, xwiki gets Strings that are correctly decoded from the DB by JDBC driver hence Xd is not a problem for xwiki.
To simplify things, we usually try to have Xd = Xf = Xw = Xp. Hence if you want to serve multilingual content (multi being > 3), you are quickly obliged to use UTF-8 as Xw, and any more powerfull encoding as Xd.
For the rest, we should hunt all calls to Reader creations that do not specify the encoding, as well as Writers creations and calls to toBytes...
Don't you think it should be ok if we set the file.encoding to be the xwiki.encoding value and we set it to be UTF-8 by default? Thanks -Vincent
Hi, On 6 avr. 07, at 22:28, Vincent Massol wrote:
I did not do any extend test to see if this pârameter was useful. I can only say that teher are many places in the code where you have : InputStreamReader ir = new InputStreamReader(is) This is one of the numerous examples of badly writen code where the encoding is not specified... Hence the plateform falls back to the file.encoding property value.
yep. I'm not sure it's bad. Or at least I'm curious to understand why it's bad.
It may be good, but then, you'll need your input (skin) files to be delivered in that encoding... Well for now it is as it only has ascii.
However, there is currently no accentuated chars in the skin files (that are read on disk), hence it somehow works, because all plateform encodings do share the encoding of ASCII chars. It will ne be the same if you were to use UTF-16 for instance...
IMO the encoding to use should be left to the user and be a configuration option (as it is now) but we should configure everything to use UTF8 by default.
3) I see that in our standalone installation we use - Dfile.encoding=iso-8859-1. Now that I've read Joel's tutorial it seems to me this is not going to work for everyone and that we should rather use -Dfile.encoding=UTF-8 by default. WDYT?
This will mean that all files that are read by the server, will have to be encoded in UTF-8...
Or any compatible encoding like ISO 8859-1, etc. This is the case now I think.
ISO latin 1 IS NOT compatible with UTF-8... only ASCII (7bits) is...
(NOTE: resource files (as the one used in xwiki I18N are special as they should be encoded in ASCII with \uXXXX to represent non ascii chars).
4) Should we use the platform encoding or default to using UTF-8 all the time? (this question is related to 1)). I think we should use the platform encoding but I'm curious to know what others think.
We should NOT use the plateform encoding. The reason is that all files read by the server (skin files mainly) will be read using the plateform encoding and their actual encoding. As they only contain ascii chars upo to now, it worked, but, if you add accents in them, and you give write them in encoding X (at edit time), you are not guarranteed that the plateform encoding will by X at run time. Hence you should specify the file encoding whenever you read a file.
Exactly which is why this is best left to the user to decide which encoding they need to use... I don't think we should force our encoding. However I'm proposing that we do: System.setProperty ("file.encoding", getParam("xwiki.encoding")) in XWiki initialization to set the platform encoding to be the encoding specified in xwiki.cfg.
That's a good idea... [snip]
... applied!
good
However, I would rather use http://jakarta.apache.org/commons/io/ api-release/org/apache/commons/io/IOUtils.html#toString (java.io.InputStream) than code it ourselves... Sounds safer, shorter, less maintenance, etc to me... :)
This method has exaclty the same problem, it'll use the plateform encoding, event if the inputstream is not encoded in the plateform encoding and even if it correctly declares its own encoding... Hence it will be buggy.
Sure but that's ok if the encoding is specified (file.encoding), right? That said I agree that no conversion is better.
Well, not here, as the package file is a file that has been produced by somebody else, on another plateform, hence either we decide that all files are always UTF-8, or it is encoded in the producer's plateform encoding, not the one that is used to read it... That's why we have to delegate encoding detection to the xml parser.
NOW A SMALL WORD ABOUT XWIKI ENCODING:
Let's pose the problem like this:
xwiki serves web pages (that are served using encoding Xw) gets POST and GET parameters (that are encoded using encoding Xp) is asked for page named Space/DocumentName (that is encoded using encoding Xp) reads skin files (that are encoded using encoding Xf) reads data from DB (which is encoded using Xd)
Well maybe there is more, but let's simplify...
1. We can rather safely assume that Xw and Xp are the same, but we may have older clients that disagree on this. We should maybe work with "accept-encoding" attributes of forms to avoid this kind of problem. 2. As xwiki provides the skin files that are read, xwiki can chose whatever encoding he wants as Xf. Xf is, then, the encoding of the XWIKI code base (see previous discussion). 3. If the database and hibernate are correctly configure with Xd, xwiki gets Strings that are correctly decoded from the DB by JDBC driver hence Xd is not a problem for xwiki.
To simplify things, we usually try to have Xd = Xf = Xw = Xp. Hence if you want to serve multilingual content (multi being > 3), you are quickly obliged to use UTF-8 as Xw, and any more powerfull encoding as Xd.
For the rest, we should hunt all calls to Reader creations that do not specify the encoding, as well as Writers creations and calls to toBytes...
Don't you think it should be ok if we set the file.encoding to be the xwiki.encoding value and we set it to be UTF-8 by default?
As long as all "normal" files (i.e.) non xml files (skin mainly) is always in ASCII (i.e. compatible with all other current encoding, well not all, but let's forget old ibm encodings...), it should be harmless... and if a user creates new skin files (non ascii) on the file system, it'll be compatible. Regards, Gilles, -- Gilles Sérasset GETALP-LIG BP 53 - F-38041 Grenoble Cedex 9 Phone: +33 4 76 51 43 80 Fax: +33 4 76 44 66 75
On Apr 7, 2007, at 10:49 AM, Gilles Serasset wrote:
Hi,
On 6 avr. 07, at 22:28, Vincent Massol wrote:
I did not do any extend test to see if this pârameter was useful. I can only say that teher are many places in the code where you have : InputStreamReader ir = new InputStreamReader(is) This is one of the numerous examples of badly writen code where the encoding is not specified... Hence the plateform falls back to the file.encoding property value.
yep. I'm not sure it's bad. Or at least I'm curious to understand why it's bad.
It may be good, but then, you'll need your input (skin) files to be delivered in that encoding... Well for now it is as it only has ascii.
... and it should remain like this and we should use native2ascii or something like that in our build to ensure it remains like this I think.
However, there is currently no accentuated chars in the skin files (that are read on disk), hence it somehow works, because all plateform encodings do share the encoding of ASCII chars. It will ne be the same if you were to use UTF-16 for instance...
IMO the encoding to use should be left to the user and be a configuration option (as it is now) but we should configure everything to use UTF8 by default.
3) I see that in our standalone installation we use - Dfile.encoding=iso-8859-1. Now that I've read Joel's tutorial it seems to me this is not going to work for everyone and that we should rather use -Dfile.encoding=UTF-8 by default. WDYT?
This will mean that all files that are read by the server, will have to be encoded in UTF-8...
Or any compatible encoding like ISO 8859-1, etc. This is the case now I think.
ISO latin 1 IS NOT compatible with UTF-8... only ASCII (7bits) is...
(NOTE: resource files (as the one used in xwiki I18N are special as they should be encoded in ASCII with \uXXXX to represent non ascii chars).
4) Should we use the platform encoding or default to using UTF-8 all the time? (this question is related to 1)). I think we should use the platform encoding but I'm curious to know what others think.
We should NOT use the plateform encoding. The reason is that all files read by the server (skin files mainly) will be read using the plateform encoding and their actual encoding. As they only contain ascii chars upo to now, it worked, but, if you add accents in them, and you give write them in encoding X (at edit time), you are not guarranteed that the plateform encoding will by X at run time. Hence you should specify the file encoding whenever you read a file.
Exactly which is why this is best left to the user to decide which encoding they need to use... I don't think we should force our encoding. However I'm proposing that we do: System.setProperty ("file.encoding", getParam("xwiki.encoding")) in XWiki initialization to set the platform encoding to be the encoding specified in xwiki.cfg.
That's a good idea...
[snip]
... applied!
good
However, I would rather use http://jakarta.apache.org/commons/io/ api-release/org/apache/commons/io/IOUtils.html#toString (java.io.InputStream) than code it ourselves... Sounds safer, shorter, less maintenance, etc to me... :)
This method has exaclty the same problem, it'll use the plateform encoding, event if the inputstream is not encoded in the plateform encoding and even if it correctly declares its own encoding... Hence it will be buggy.
Sure but that's ok if the encoding is specified (file.encoding), right? That said I agree that no conversion is better.
Well, not here, as the package file is a file that has been produced by somebody else, on another plateform, hence either we decide that all files are always UTF-8, or it is encoded in the producer's plateform encoding, not the one that is used to read it... That's why we have to delegate encoding detection to the xml parser.
Good point. I agree. [snip] Thanks -Vincent PS: Thanks for everyone's help in bringing me up to date on I18N. I'm slowly starting to understand how that works... ;-)
Vincent Massol wrote:
Exactly which is why this is best left to the user to decide which encoding they need to use... I don't think we should force our encoding. However I'm proposing that we do: System.setProperty("file.encoding", getParam("xwiki.encoding")) in XWiki initialization to set the platform encoding to be the encoding specified in xwiki.cfg.
I think it should be vice-versa. Stick to platform or use UTF-8 don't add new encoding parameter. The only thing I am not sure about is database. I think that some databases doesn't support collation for different languages for UTF-8. Is it important for XWiki?
On Apr 7, 2007, at 8:00 PM, Zeljko Trogrlic wrote:
Vincent Massol wrote:
Exactly which is why this is best left to the user to decide which encoding they need to use... I don't think we should force our encoding. However I'm proposing that we do: System.setProperty ("file.encoding", getParam("xwiki.encoding")) in XWiki initialization to set the platform encoding to be the encoding specified in xwiki.cfg.
I think it should be vice-versa. Stick to platform or use UTF-8 don't add new encoding parameter.
xwiki.encoding already exists. The point here is to make platform = xwiki encoding = database driver encoding = web encoding = all encoding used everywhere, all this without the user having to change 7 places... -Vincent
Vincent Massol wrote:
On Apr 7, 2007, at 8:00 PM, Zeljko Trogrlic wrote:
Vincent Massol wrote:
Exactly which is why this is best left to the user to decide which encoding they need to use... I don't think we should force our encoding. However I'm proposing that we do: System.setProperty("file.encoding", getParam("xwiki.encoding")) in XWiki initialization to set the platform encoding to be the encoding specified in xwiki.cfg.
I think it should be vice-versa. Stick to platform or use UTF-8 don't add new encoding parameter.
xwiki.encoding already exists. The point here is to make platform = xwiki encoding = database driver encoding = web encoding = all encoding used everywhere, all this without the user having to change 7 places...
Of course, bur Java already has file.encoding. Why another property?
On Apr 8, 2007, at 10:10 PM, Zeljko Trogrlic wrote:
Vincent Massol wrote:
On Apr 7, 2007, at 8:00 PM, Zeljko Trogrlic wrote:
Vincent Massol wrote:
Exactly which is why this is best left to the user to decide which encoding they need to use... I don't think we should force our encoding. However I'm proposing that we do: System.setProperty("file.encoding", getParam("xwiki.encoding")) in XWiki initialization to set the platform encoding to be the encoding specified in xwiki.cfg.
I think it should be vice-versa. Stick to platform or use UTF-8 don't add new encoding parameter. xwiki.encoding already exists. The point here is to make platform = xwiki encoding = database driver encoding = web encoding = all encoding used everywhere, all this without the user having to change 7 places...
Of course, bur Java already has file.encoding. Why another property?
Good question. Maybe simply because we need a way to configure it and XWiki already has a configuration mechanism so it's probably logical to find the encoding to use defined there. I guess we could name the property "file.encoding" inside the xwiki.cfg file but the naming rule is that properties should start with xwiki in that file. Or we could leave it outside of XWiki's configuration but in that case we can't provide a default value that is externalized. I'l not sure about all this so all input is welcome. I'll try to prepare a synthesis of all we discussed on i8n next week. Thanks -Vincent
On 4/8/07, Zeljko Trogrlic <[email protected]> wrote:
Vincent Massol wrote:
On Apr 7, 2007, at 8:00 PM, Zeljko Trogrlic wrote:
Vincent Massol wrote:
Exactly which is why this is best left to the user to decide which encoding they need to use... I don't think we should force our encoding. However I'm proposing that we do: System.setProperty("file.encoding", getParam("xwiki.encoding")) in XWiki initialization to set the platform encoding to be the encoding specified in xwiki.cfg.
I think it should be vice-versa. Stick to platform or use UTF-8 don't add new encoding parameter.
xwiki.encoding already exists. The point here is to make platform = xwiki encoding = database driver encoding = web encoding = all encoding used everywhere, all this without the user having to change 7 places...
Of course, bur Java already has file.encoding. Why another property?
Because we can't change the way the server starts, given the fact that there are dozens of containers and platforms. -- http://purl.org/net/sergiu
Hi, On 8 avr. 07, at 22:56, Sergiu Dumitriu wrote:
On 4/8/07, Zeljko Trogrlic <[email protected]> wrote: Vincent Massol wrote:
On Apr 7, 2007, at 8:00 PM, Zeljko Trogrlic wrote:
Vincent Massol wrote:
Exactly which is why this is best left to the user to decide which encoding they need to use... I don't think we should force our encoding. However I'm proposing that we do: System.setProperty("file.encoding", getParam("xwiki.encoding")) in XWiki initialization to set the platform encoding to be the
encoding
specified in xwiki.cfg.
I think it should be vice-versa. Stick to platform or use UTF-8 don't add new encoding parameter.
xwiki.encoding already exists. The point here is to make platform = xwiki encoding = database driver encoding = web encoding = all encoding used everywhere, all this without the user having to change 7 places...
Of course, bur Java already has file.encoding. Why another property?
Because we can't change the way the server starts, given the fact that there are dozens of containers and platforms.
Moreover, we may want to install different instances of xwiki in the same container... hence the common file.encoding is not an option here (as far as I understand it...) Gilles, -- Gilles Sérasset GETALP-LIG BP 53 - F-38041 Grenoble Cedex 9 Phone: +33 4 76 51 43 80 Fax: +33 4 76 44 66 75
As a person from non-8859-1 country, I have some experience with such problems. Vincent Massol wrote:
1) Is UTF8 supported on all platforms? Is it supported on mobile platforms for example?
All MIDP 2 devices I worked with supported it.
3) I see that in our standalone installation we use -Dfile.encoding=iso-8859-1. Now that I've read Joel's tutorial it seems to me this is not going to work for everyone and that we should rather use -Dfile.encoding=UTF-8 by default. WDYT?
That is problem if it's not your default encoding. You have two options: * use platform default encoding and don't use non-ASCII characters in default configuration * use UTF-8 Although UTF-8 sounds better, note that you: * need an editor that supports it, otherwise local encoding will creep in * encoding must be set manually because encoding can't be detected for plain text files * you have to communicate this very clearly to users * text will look funny in non-UTF-8 editor and it will be hard to change it
4) Should we use the platform encoding or default to using UTF-8 all the time? (this question is related to 1)). I think we should use the platform encoding but I'm curious to know what others think.
See previous. you should either stick to UTF-8 or platform.
5) Jackson Wang is proposing in a patch to modify readPackage like this:
private Document readPackage(InputStream is) throws IOException, DocumentException { - byte[] data = new byte[4096]; + //UTF-8 characters could cause encoding as continued bytes over 4096 boundary, + // so change byte to char. ---Jackson + char[] data = new char[4096]; + BufferedReader in= new BufferedReader(new InputStreamReader(is)); StringBuffer XmlFile = new StringBuffer(); int Cnt; - while ((Cnt = is.read(data, 0, 4096)) != -1) { + while ((Cnt = in.read(data, 0, 4096)) != -1) { XmlFile.append(new String(data, 0, Cnt)); - } + } return fromXml(XmlFile.toString()); }
However with my new understanding I'm not sure this would help as char are stored on 2 bytes in Java and UTF-8 encoding can store on up to 4 bytes. Am I correct?
I don't know what do you read there, but Java can handle encoding for you if you tell her.
However, I would rather use http://jakarta.apache.org/commons/io/api-release/org/apache/commons/io/IOUti...) than code it ourselves... Sounds safer, shorter, less maintenance, etc to me... :)
If it adds value. I think that XWiki is plagued with different libraries doing the same thing or adding small amount of functionality. This makes it harder to analyse. Another place where to avoid local encoding: some source code files contain French characters, which are messed up on non-8859-1 platforms.
Hi Zeljko, On Apr 7, 2007, at 7:49 PM, Zeljko Trogrlic wrote: [snip]
3) I see that in our standalone installation we use - Dfile.encoding=iso-8859-1. Now that I've read Joel's tutorial it seems to me this is not going to work for everyone and that we should rather use -Dfile.encoding=UTF-8 by default. WDYT?
That is problem if it's not your default encoding. You have two options:
* use platform default encoding and don't use non-ASCII characters in default configuration * use UTF-8
Although UTF-8 sounds better, note that you: * need an editor that supports it, otherwise local encoding will creep in * encoding must be set manually because encoding can't be detected for plain text files * you have to communicate this very clearly to users * text will look funny in non-UTF-8 editor and it will be hard to change it
Let's look at the files xwiki manipulates: - config files. These ones should only contain ASCII characters and unicode code points when there's a need as with resource bundles for example. Thus all encoding will work there. - XAR files. If these are created with XWiki (with an export) they'll use the file.encoding specified so if it's utf8 they'll be saved in utf8. In addition, I propose that in our build we run native2ascii for all our data files (including the XAR files). This can be done automatically easily with maven. So all XAR files the XWiki team provides should work will work with any encoding. - java files: should be using only ascii chars That's about it I think. [snip]
However, I would rather use http://jakarta.apache.org/commons/io/ api-release/org/apache/commons/io/IOUtils.html#toString (java.io.InputStream) than code it ourselves... Sounds safer, shorter, less maintenance, etc to me... :)
If it adds value. I think that XWiki is plagued with different libraries doing the same thing or adding small amount of functionality. This makes it harder to analyse.
I'm not I would have used the word "plagued" which has a negative connotation... I would rather have said: "thanks to the effort of others in OSS we have been able to develop XWiki to a level we wouldn't have been able to reach otherwise... This allows us to reduce our maintenance efforts, our documentation efforts and our testing efforts..." :-) Now if you notice 2 libraries used in XWiki that do the same thing let us know so that we can all decide if we want to remove one and only use one. I'd be in favor of that wherever possible. I've noticed a few places myself where I think the wrong library was chosen IMO (like when we use Jakarta ECS for something completely unrelated). There are also places where the choice was historic: like using ORO when the Regex is now in JDK 1.4 (this has already been identified).
Another place where to avoid local encoding: some source code files contain French characters, which are messed up on non-8859-1 platforms.
Ah we need to track these down. Could you please let us know which files? Thanks -Vincent
Vincent Massol wrote:
However, I would rather use http://jakarta.apache.org/commons/io/api-release/org/apache/commons/io/IOUti...) than code it ourselves... Sounds safer, shorter, less maintenance, etc to me... :)
If it adds value. I think that XWiki is plagued with different libraries doing the same thing or adding small amount of functionality. This makes it harder to analyse.
I'm not I would have used the word "plagued" which has a negative connotation... I would rather have said: "thanks to the effort of others in OSS we have been able to develop XWiki to a level we wouldn't have been able to reach otherwise... This allows us to reduce our maintenance efforts, our documentation efforts and our testing efforts..." :-)
Now if you notice 2 libraries used in XWiki that do the same thing let us know so that we can all decide if we want to remove one and only use one. I'd be in favor of that wherever possible.
I've noticed a few places myself where I think the wrong library was chosen IMO (like when we use Jakarta ECS for something completely unrelated). There are also places where the choice was historic: like using ORO when the Regex is now in JDK 1.4 (this has already been identified).
Other examples are Xerces and Xalan, which are also included in JDK. I think that you also have duplicate cache libraries. I can take some time to analyze this after I finish playing with curernt topic (Kerberos). Just to clarify: I'm working with SVN so I don't know are all these libraries actually distributed.
Another place where to avoid local encoding: some source code files contain French characters, which are messed up on non-8859-1 platforms.
Ah we need to track these down. Could you please let us know which files?
When I bump into again, I'll drop a message.
Additional tip: There are several boundaries where characters are converted from one encoding to another. If you miss just one of them, pages will look funny: file <-> Java database <-> Java HTTP <-> Java It could happen that you store data and it looks good, but when you read it will be messed up. It is hard to find such errors unless you check Strings at the byte level, because System.out will also do conversion!
participants (5)
-
Gilles Serasset -
Pablo Oliveira -
Sergiu Dumitriu -
Vincent Massol -
Zeljko Trogrlic