Re: [xwiki-dev] Trying to understand I8N...

6 Apr 2007

On 4/6/07, Vincent Massol &lt;vincent(a)massol.net&gt; wrote:
...

 Hi,
 I admit it: I'm not an expert in I8N. However I realize that XWiki
 being a wiki we need to have strong I8N features so I'm trying to
 catch up with I8N knowledge...
 I started yesterday by reading this excellent short tutorial http://
 www.joelonsoftware.com/articles/Unicode.html (The Absolute Minimum
 Every Software Developer Absolutely, Positively Must Know About
 Unicode and Character Sets (No Excuses!)). It's very good and an easy
 read. I recommend it to everyone. 
It's a bit too short, just when things get interesting, it ends.
This led me to a few questions:
...

 1) Is UTF8 supported on all platforms? Is it supported on mobile
 platforms for example? 
Platforms old enough not to know UTF are unlikely to support running XWiki
on it.
2) I see in our encoding guide on http://www.xwiki.org/xwiki/bin/view/
...
  AdminGuide/Encoding that we need to set the encoding
for the
 container. Why is that required? The servlet container reads pages
 which have the encoding specified (using Content-Type meta data), so
 why does it need to be told about the encoding to use? 
If you mean the parameter in web.xml, then it's not the container encoding,
but a parameter used to correctly identify outgoing files (it sets the
Content-Type header according to this param).
There are more files which don't have a Content-Type. First there are the
files stored on disk. Second, when you POST some data or GET a resource, you
don't have a content type. Requests don't have this HTTP header.
3) I see that in our standalone installation we use -
...
  Dfile.encoding=iso-8859-1. Now that I've read
Joel's tutorial it
 seems to me this is not going to work for everyone and that we should
 rather use -Dfile.encoding=UTF-8 by default. WDYT? 
UTF-8 is better. But we really should not depend on the file encoding.
4) Should we use the platform encoding or default to using UTF-8 all
...
  the time? (this question is related to 1)). I think we
should use the
 platform encoding but I'm curious to know what others think. 
UTF-8 all the time. Thus we have no dependency on the system, and we don't
need guides on "how to change the encoding in only 7 places to make my wiki
know bulgarian"
5) Jackson Wang is proposing in a patch to modify readPackage like this:
...

       private Document readPackage(InputStream is) throws
 IOException, DocumentException
       {
 -        byte[] data = new byte[4096];
 +        //UTF-8 characters could cause encoding as continued bytes
 over 4096 boundary,
 +        // so change byte to char.  ---Jackson
 +        char[] data = new char[4096];
 +        BufferedReader in= new BufferedReader(new InputStreamReader
 (is));
           StringBuffer XmlFile = new StringBuffer();
           int Cnt;
 -        while ((Cnt = is.read(data, 0, 4096)) != -1) {
 +        while ((Cnt = in.read(data, 0, 4096)) != -1) {
               XmlFile.append(new String(data, 0, Cnt));
 -        }
 +       }
           return fromXml(XmlFile.toString());
       }
 However with my new understanding I'm not sure this would help as
 char are stored on 2 bytes in Java and UTF-8 encoding can store on up
 to 4 bytes. Am I correct?
 However, I would rather use http://jakarta.apache.org/commons/io/api-
 release/org/apache/commons/io/IOUtils.html#toString
 (java.io.InputStream) than code it ourselves... Sounds safer,
 shorter, less maintenance, etc to me... :) 
+1. Always reuse existing proven code than reinvent a squeaky wheel.
Thanks for your help
...
  -Vincent

Sergiu
--
http://purl.org/net/sergiu

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [xwiki-dev] Trying to understand I8N...