Trying to understand I8N...

6 Apr 2007

Hi,
I admit it: I'm not an expert in I8N. However I realize that XWiki
being a wiki we need to have strong I8N features so I'm trying to
catch up with I8N knowledge...
I started yesterday by reading this excellent short tutorial http://
www.joelonsoftware.com/articles/Unicode.html (The Absolute Minimum
Every Software Developer Absolutely, Positively Must Know About
Unicode and Character Sets (No Excuses!)). It's very good and an easy
read. I recommend it to everyone.
This led me to a few questions:
1) Is UTF8 supported on all platforms? Is it supported on mobile
platforms for example?
2) I see in our encoding guide on http://www.xwiki.org/xwiki/bin/view/
AdminGuide/Encoding that we need to set the encoding for the
container. Why is that required? The servlet container reads pages
which have the encoding specified (using Content-Type meta data), so
why does it need to be told about the encoding to use?
3) I see that in our standalone installation we use -
Dfile.encoding=iso-8859-1. Now that I've read Joel's tutorial it
seems to me this is not going to work for everyone and that we should
rather use -Dfile.encoding=UTF-8 by default. WDYT?
4) Should we use the platform encoding or default to using UTF-8 all
the time? (this question is related to 1)). I think we should use the
platform encoding but I'm curious to know what others think.
5) Jackson Wang is proposing in a patch to modify readPackage like this:
      private Document readPackage(InputStream is) throws
IOException, DocumentException
      {
-        byte[] data = new byte[4096];
+        //UTF-8 characters could cause encoding as continued bytes
over 4096 boundary,
+        // so change byte to char.  ---Jackson
+        char[] data = new char[4096];
+        BufferedReader in= new BufferedReader(new InputStreamReader
(is));
          StringBuffer XmlFile = new StringBuffer();
          int Cnt;
-        while ((Cnt = is.read(data, 0, 4096)) != -1) {
+        while ((Cnt = in.read(data, 0, 4096)) != -1) {
              XmlFile.append(new String(data, 0, Cnt));
-        }
+       }
          return fromXml(XmlFile.toString());
      }
However with my new understanding I'm not sure this would help as
char are stored on 2 bytes in Java and UTF-8 encoding can store on up
to 4 bytes. Am I correct?
However, I would rather use http://jakarta.apache.org/commons/io/api-
release/org/apache/commons/io/IOUtils.html#toString
(java.io.InputStream) than code it ourselves... Sounds safer,
shorter, less maintenance, etc to me... :)
Thanks for your help
-Vincent

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Trying to understand I8N...