Hi Vincent, Hi all,
On 6 avr. 07, at 11:39, Vincent Massol wrote:
Hi,
I admit it: I'm not an expert in I8N. However I realize that XWiki
being a wiki we need to have strong I8N features so I'm trying to
catch up with I8N knowledge...
well, first it is I18N, 10 more difficulties... ;-)
I started yesterday by reading this excellent short
tutorial http://
www.joelonsoftware.com/articles/Unicode.html (The Absolute Minimum
Every Software Developer Absolutely, Positively Must Know About
Unicode and Character Sets (No Excuses!)). It's very good and an
easy read. I recommend it to everyone.
Good historical overview, however it just shows several problems, but
things are not so complicated...
This led me to a few questions:
1) Is UTF8 supported on all platforms? Is it supported on mobile
platforms for example?
I don't know... (well I hope so...), goodanswer could be: It will...
2) I see in our encoding guide on
http://www.xwiki.org/xwiki/bin/
view/AdminGuide/Encoding that we need to set the encoding for the
container. Why is that required? The servlet container reads pages
which have the encoding specified (using Content-Type meta data),
so why does it need to be told about the encoding to use?
I also saw that... I asume you mean the -Dfile.encoding=XXX. I did
not do any extend test to see if this pârameter was useful. I can
only say that teher are many places in the code where you have :
InputStreamReader ir = new InputStreamReader(is)
This is one of the numerous examples of badly writen code where the
encoding is not specified... Hence the plateform falls back to the
file.encoding property value. However, there is currently no
accentuated chars in the skin files (that are read on disk), hence it
somehow works, because all plateform encodings do share the encoding
of ASCII chars. It will ne be the same if you were to use UTF-16 for
instance...
3) I see that in our standalone installation we use -
Dfile.encoding=iso-8859-1. Now that I've read Joel's tutorial it
seems to me this is not going to work for everyone and that we
should rather use -Dfile.encoding=UTF-8 by default. WDYT?
This will mean that all files that are read by the server, will have
to be encoded in UTF-8...
(NOTE: resource files (as the one used in xwiki I18N are special as
they should be encoded in ASCII with \uXXXX to represent non ascii
chars).
4) Should we use the platform encoding or default to
using UTF-8
all the time? (this question is related to 1)). I think we should
use the platform encoding but I'm curious to know what others think.
We should NOT use the plateform encoding. The reason is that all
files read by the server (skin files mainly) will be read using the
plateform encoding and their actual encoding. As they only contain
ascii chars upo to now, it worked, but, if you add accents in them,
and you give write them in encoding X (at edit time), you are not
guarranteed that the plateform encoding will by X at run time. Hence
you should specify the file encoding whenever you read a file.
5) Jackson Wang is proposing in a patch to modify
readPackage like
this:
private Document readPackage(InputStream is) throws
IOException, DocumentException
{
- byte[] data = new byte[4096];
+ //UTF-8 characters could cause encoding as continued bytes
over 4096 boundary,
+ // so change byte to char. ---Jackson
+ char[] data = new char[4096];
+ BufferedReader in= new BufferedReader(new InputStreamReader
(is));
StringBuffer XmlFile = new StringBuffer();
int Cnt;
- while ((Cnt = is.read(data, 0, 4096)) != -1) {
+ while ((Cnt = in.read(data, 0, 4096)) != -1) {
XmlFile.append(new String(data, 0, Cnt));
- }
+ }
return fromXml(XmlFile.toString());
}
However with my new understanding I'm not sure this would help as
char are stored on 2 bytes in Java and UTF-8 encoding can store on
up to 4 bytes. Am I correct?
Well this patch is problematic as the new InputStreamReader(is) does
not specify the encoding. The problem is where does the InputStream
comes from...
Here, we can even avoid the question, as the InputStream contains an
xml file that declares its encoding, so xwiki SHOULD NOT build a
String from this stream, but rather pass the stream directly to the
xml parser that will do its best to determine the effective encoding.
This is what I did with my last patch to the packaging plugin.
Here is a substitute patch:
Index: core/src/main/java/com/xpn/xwiki/plugin/packaging/Package.java
===================================================================
--- core/src/main/java/com/xpn/xwiki/plugin/packaging/
Package.java (revision 2581)
+++ core/src/main/java/com/xpn/xwiki/plugin/packaging/
Package.java (working copy)
@@ -673,13 +673,7 @@
private Document readPackage(InputStream is) throws
IOException, DocumentException
{
- byte[] data = new byte[4096];
- StringBuffer XmlFile = new StringBuffer();
- int Cnt;
- while ((Cnt = is.read(data, 0, 4096)) != -1) {
- XmlFile.append(new String(data, 0, Cnt));
- }
- return fromXml(XmlFile.toString());
+ return fromXml(is);
}
public String toXml(XWikiContext context)
@@ -835,13 +829,12 @@
}
}
- protected Document fromXml(String xml) throws DocumentException
+ protected Document fromXml(InputStream xml) throws
DocumentException
{
SAXReader reader = new SAXReader();
Document domdoc;
- StringReader in = new StringReader(xml);
- domdoc = reader.read(in);
+ domdoc = reader.read(xml);
Element docEl = domdoc.getRootElement();
Element infosEl = docEl.element("infos");
It compiles, but I did not have time to test it extensively... But it
passes the current packaging tests.
However, I would rather use
http://jakarta.apache.org/commons/io/
api-release/org/apache/commons/io/IOUtils.html#toString
(java.io.InputStream) than code it ourselves... Sounds safer,
shorter, less maintenance, etc to me... :)
This method has exaclty the same problem, it'll use the plateform
encoding, event if the inputstream is not encoded in the plateform
encoding and even if it correctly declares its own encoding... Hence
it will be buggy.
NOW A SMALL WORD ABOUT XWIKI ENCODING:
Let's pose the problem like this:
xwiki serves web pages (that are served using encoding Xw)
gets POST and GET parameters (that are encoded using encoding Xp)
is asked for page named Space/DocumentName (that is encoded
using encoding Xp)
reads skin files (that are encoded using encoding Xf)
reads data from DB (which is encoded using Xd)
Well maybe there is more, but let's simplify...
1. We can rather safely assume that Xw and Xp are the same, but we
may have older clients that disagree on this. We should maybe work
with "accept-encoding" attributes of forms to avoid this kind of
problem.
2. As xwiki provides the skin files that are read, xwiki can chose
whatever encoding he wants as Xf. Xf is, then, the encoding of the
XWIKI code base (see previous discussion).
3. If the database and hibernate are correctly configure with Xd,
xwiki gets Strings that are correctly decoded from the DB by JDBC
driver hence Xd is not a problem for xwiki.
To simplify things, we usually try to have Xd = Xf = Xw = Xp. Hence
if you want to serve multilingual content (multi being > 3), you are
quickly obliged to use UTF-8 as Xw, and any more powerfull encoding
as Xd.
For the rest, we should hunt all calls to Reader creations that do
not specify the encoding, as well as Writers creations and calls to
toBytes...
Regards, Gilles,
--
Gilles SĂ©rasset
GETALP-LIG BP 53 - F-38041 Grenoble Cedex 9
Phone: +33 4 76 51 43 80 Fax: +33 4 76 44 66 75