On 4/6/07, Vincent Massol <[email protected]> wrote:
Hi,
I admit it: I'm not an expert in I8N. However I realize that XWiki being a wiki we need to have strong I8N features so I'm trying to catch up with I8N knowledge...
I started yesterday by reading this excellent short tutorial http:// www.joelonsoftware.com/articles/Unicode.html (The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)). It's very good and an easy read. I recommend it to everyone.
It's a bit too short, just when things get interesting, it ends. This led me to a few questions:
1) Is UTF8 supported on all platforms? Is it supported on mobile platforms for example?
Platforms old enough not to know UTF are unlikely to support running XWiki on it. 2) I see in our encoding guide on http://www.xwiki.org/xwiki/bin/view/
AdminGuide/Encoding that we need to set the encoding for the container. Why is that required? The servlet container reads pages which have the encoding specified (using Content-Type meta data), so why does it need to be told about the encoding to use?
If you mean the parameter in web.xml, then it's not the container encoding, but a parameter used to correctly identify outgoing files (it sets the Content-Type header according to this param). There are more files which don't have a Content-Type. First there are the files stored on disk. Second, when you POST some data or GET a resource, you don't have a content type. Requests don't have this HTTP header. 3) I see that in our standalone installation we use -
Dfile.encoding=iso-8859-1. Now that I've read Joel's tutorial it seems to me this is not going to work for everyone and that we should rather use -Dfile.encoding=UTF-8 by default. WDYT?
UTF-8 is better. But we really should not depend on the file encoding. 4) Should we use the platform encoding or default to using UTF-8 all
the time? (this question is related to 1)). I think we should use the platform encoding but I'm curious to know what others think.
UTF-8 all the time. Thus we have no dependency on the system, and we don't need guides on "how to change the encoding in only 7 places to make my wiki know bulgarian" 5) Jackson Wang is proposing in a patch to modify readPackage like this:
private Document readPackage(InputStream is) throws IOException, DocumentException { - byte[] data = new byte[4096]; + //UTF-8 characters could cause encoding as continued bytes over 4096 boundary, + // so change byte to char. ---Jackson + char[] data = new char[4096]; + BufferedReader in= new BufferedReader(new InputStreamReader (is)); StringBuffer XmlFile = new StringBuffer(); int Cnt; - while ((Cnt = is.read(data, 0, 4096)) != -1) { + while ((Cnt = in.read(data, 0, 4096)) != -1) { XmlFile.append(new String(data, 0, Cnt)); - } + } return fromXml(XmlFile.toString()); }
However with my new understanding I'm not sure this would help as char are stored on 2 bytes in Java and UTF-8 encoding can store on up to 4 bytes. Am I correct?
However, I would rather use http://jakarta.apache.org/commons/io/api- release/org/apache/commons/io/IOUtils.html#toString (java.io.InputStream) than code it ourselves... Sounds safer, shorter, less maintenance, etc to me... :)
+1. Always reuse existing proven code than reinvent a squeaky wheel. Thanks for your help
-Vincent
Sergiu -- http://purl.org/net/sergiu