Hi,
On 6 avr. 07, at 22:28, Vincent Massol wrote:
I did not do
any extend test to see if this pârameter was useful.
I can only say that teher are many places in the code where you
have :
InputStreamReader ir = new InputStreamReader(is)
This is one of the numerous examples of badly writen code where
the encoding is not specified... Hence the plateform falls back to
the file.encoding property value.
yep. I'm not sure it's bad. Or at least I'm curious to understand
why it's bad.
It may be good, but then, you'll need your input (skin) files to be
delivered in that encoding... Well for now it is as it only has ascii.
However, there
is currently no accentuated chars in the skin files
(that are read on disk), hence it somehow works, because all
plateform encodings do share the encoding of ASCII chars. It will
ne be the same if you were to use UTF-16 for instance...
IMO the encoding to use should be left to the user and be a
configuration option (as it is now) but we should configure
everything to use UTF8 by default.
3) I see that in our standalone installation we
use -
Dfile.encoding=iso-8859-1. Now that I've read Joel's tutorial it
seems to me this is not going to work for everyone and that we
should rather use -Dfile.encoding=UTF-8 by default. WDYT?
This will mean that all files that are read by the server, will
have to be encoded in UTF-8...
Or any compatible encoding like ISO 8859-1, etc. This is the case
now I think.
ISO latin 1 IS NOT compatible with UTF-8... only ASCII (7bits) is...
(NOTE: resource files (as the one used in xwiki I18N are special
as they should be encoded in ASCII with \uXXXX to represent non
ascii chars).
4) Should we use the platform encoding or default
to using UTF-8
all the time? (this question is related to 1)). I think we should
use the platform encoding but I'm curious to know what others think.
We should NOT use the plateform encoding. The reason is that all
files read by the server (skin files mainly) will be read using
the plateform encoding and their actual encoding. As they only
contain ascii chars upo to now, it worked, but, if you add accents
in them, and you give write them in encoding X (at edit time), you
are not guarranteed that the plateform encoding will by X at run
time. Hence you should specify the file encoding whenever you read
a file.
Exactly which is why this is best left to the user to decide which
encoding they need to use... I don't think we should force our
encoding. However I'm proposing that we do: System.setProperty
("file.encoding", getParam("xwiki.encoding")) in XWiki
initialization to set the platform encoding to be the encoding
specified in xwiki.cfg.
That's a good idea...
[snip]
... applied!
good
However, I would rather use
http://jakarta.apache.org/commons/io/
api-release/org/apache/commons/io/IOUtils.html#toString
(java.io.InputStream) than code it ourselves... Sounds safer,
shorter, less maintenance, etc to me... :)
This method has exaclty the same problem, it'll use the plateform
encoding, event if the inputstream is not encoded in the plateform
encoding and even if it correctly declares its own encoding...
Hence it will be buggy.
Sure but that's ok if the encoding is specified (file.encoding),
right? That said I agree that no conversion is better.
Well, not here, as the package file is a file that has been produced
by somebody else, on another plateform, hence either we decide that
all files are always UTF-8, or it is encoded in the producer's
plateform encoding, not the one that is used to read it... That's why
we have to delegate encoding detection to the xml parser.
NOW A SMALL WORD ABOUT XWIKI ENCODING:
Let's pose the problem like this:
xwiki serves web pages (that are served using encoding Xw)
gets POST and GET parameters (that are encoded using
encoding Xp)
is asked for page named Space/DocumentName (that is encoded
using encoding Xp)
reads skin files (that are encoded using encoding Xf)
reads data from DB (which is encoded using Xd)
Well maybe there is more, but let's simplify...
1. We can rather safely assume that Xw and Xp are the same, but we
may have older clients that disagree on this. We should maybe work
with "accept-encoding" attributes of forms to avoid this kind of
problem.
2. As xwiki provides the skin files that are read, xwiki can chose
whatever encoding he wants as Xf. Xf is, then, the encoding of the
XWIKI code base (see previous discussion).
3. If the database and hibernate are correctly configure with Xd,
xwiki gets Strings that are correctly decoded from the DB by JDBC
driver hence Xd is not a problem for xwiki.
To simplify things, we usually try to have Xd = Xf = Xw = Xp.
Hence if you want to serve multilingual content (multi being > 3),
you are quickly obliged to use UTF-8 as Xw, and any more powerfull
encoding as Xd.
For the rest, we should hunt all calls to Reader creations that do
not specify the encoding, as well as Writers creations and calls
to toBytes...
Don't you think it should be ok if we set the file.encoding to be
the xwiki.encoding value and we set it to be UTF-8 by default?
As long as all "normal" files (i.e.) non xml files (skin mainly) is
always in ASCII (i.e. compatible with all other current encoding,
well not all, but let's forget old ibm encodings...), it should be
harmless... and if a user creates new skin files (non ascii) on the
file system, it'll be compatible.
Regards,
Gilles,
--
Gilles SĂ©rasset
GETALP-LIG BP 53 - F-38041 Grenoble Cedex 9
Phone: +33 4 76 51 43 80 Fax: +33 4 76 44 66 75