Vincent Massol wrote:
On Feb 15, 2008, at 2:49 PM, Sergiu Dumitriu wrote:
Hi devs,
We need to decide how to handle the charset/encoding in XWiki. We
have 3
options:
1. Leave it as it is. The default is ISO-8859-1, and the admin has to
make sure the JVM is started with the correct -Dfile.encoding param.
If
another encoding is needed, it has to be changed in 4 places (web.xml,
xwiki.cfg, -Dfile.encoding, database charset+collation)
2. Force it to always be UTF-8, overriding the file.enconding setting.
This ensures internationalization, as UTF-8 works with any language.
And
I think it is a safe move, as any modern system supports UTF-8 (given
that XWiki requires java 5, we can assume it will be in a modern
system). This has the advantage that the code will be simpler, as we
don't have to check and switch encodings, but has the disadvantage
that
mysql has to be manually configured for UTF-8, as by default it
comes in
latin1.
Isn't this a problem with databases which are configured in ISO8859-1
by default most of the time?
Yes, it is. Right now there is a component somewhere that converts
characters not supported by the encoding into &#xxx; escapes, but I
can't remember which. With these escapes, the database always receives
data in the encoding XWiki is configured with.
What I would really like is if hibernate was smart enough to enforce
encodings. Or to transparently encode data between the application and
the database. Unfortunately, that's not the case.
I'll have to check which encodings do DBMSs have implicitly. I only know
that mysql comes with latin1, and I think hsql and derby come with utf.
Same question for the servlet container.
The servlet container does not (usually) have an encoding. It works in
the system encoding, which varies from OS and country. Windows systems
usually are set to an encoding that reflects the language/country, and
Linux systems mostly do the same, but tend to switch to UTF8.
I checked what happens if I override the jvm encoding. It's not good, as
it is replaced for all the apps, and in a shared container that's really
bad. Thus, I'm against overriding the jvm encoding. This then makes
option 2 impossible to implement, unless we decide to make XWiki
products work only in certain environments. It will be possible to do
this in several years, once people forget all about different charsets.
Sometimes, decisions made in early stages are so hard to overcome and
completely eliminate in later stages.
Still, we can't work with reduced charsets anymore. People all over the
world should be able to use XWiki, and right now it is not possible.
Unicode is the way to go, and UTF-8 seems the best choice. Even if we
can't impose it on the environment, I stilll think it should be used
internally and externally. Internally means that whenever we have to
switch from String to byte[] and back, we ask the conversion to be made
using UTF-8. Externally means that the container is just a middle-man
transparently handling data from and to the client, and the client
already works with UTF-8. The web, being born a bit later, understood
that Unicode is the right answer when determining which characters to
support, so most technologies are made to work with Unicode, and its
UTF-8 representation. We already have problems with URLs, GET parameters
and AJAX calls because we're not working with UTF-8.
The tough part is that there are some tools that handle conversions
internally, and they work with the JVM encoding. We have such problems
with JRCS (rollbacks replace non-ascii chars with question marks), and
with FOP (the same question marks appear). I'll have to study what can
be done to overcome these problems.
I know that this decision is an important one, as it affects large
portions of code. But it is a decision that must be made sooner rather
than later, so that we can prepare for the switch (btw, if we vote
anything other than 1, then this will be part of a future M1, and not of
the 1.3 RCs).
So, here's option number 4: Let the system as it is, since it must be
shared with other applications, but work with UTF-8 both internally, by
asking any String <-> byte[] conversion to be made in that encoding, and
externally, by sending responses and expecting requests in UTF-8. Given
that the database accepts charset configurations at any level (database,
table, column), it is OK to ask admins to configure the XWiki database
to a certain encoding.
Even better, I think this can be done in an Aspect, so that we don't
have to manually try-catch all transformations, and always be careful to
manually specify the encoding. I'm not an AOP expert, so I'm not sure
that this is possible. Is it?
I can't vote till I know the answer to these 2
questions.
Thanks
-Vincent
PS: As a principle I don't like hard-coding anything so if these
questions are answered satisfactorily I'll be ok but with a single
config parameter set to UTF8 by default in xwiki.cfg.
> 3. Keep it configurable, but by only specifying it in one place
> (xwiki.cfg or web.xml), and enforcing that encoding in the JVM (by
> overriding file.encoding). The default should be UTF-8.
>
> Here's my +1 for option 2, -1 for option 1, and 0 for option 3.
--
Sergiu Dumitriu
http://purl.org/net/sergiu/