Le 16 févr. 08 à 09:00, Sergiu Dumitriu a écrit :
that mysql has to be manually configured for UTF-8, as
by default it
comes in latin1.
Isn't this a problem with databases which are configured in ISO8859-1
by default most of the time?
Yes, it is. Right now there is a component somewhere that converts
characters not supported by the encoding into &#xxx; escapes, but I
can't remember which. With these escapes, the database always receives
data in the encoding XWiki is configured with.
This is the kind of escapes that have to go away, I feel, they
clutter the whole place, you never know if you're not doing it
twice, and they slow everything down.
What I would really like is if hibernate was smart
enough to enforce
encodings. Or to transparently encode data between the application and
the database. Unfortunately, that's not the case.
I'll have to check which encodings do DBMSs have implicitly. I only
know
that mysql comes with latin1, and I think hsql and derby come with
utf.
Are you serious that such a need is there ?
We've been using mostly derby but at times we used mysql.... and with
the following property and no config on a default (fink-installed)
mysql that worked:
<property name="connection.url">jdbc:mysql://dbserver/
activemath?useUnicode=true&characterEncoding=UTF-8</property>
Adding this in the installation instruction seems doable, or ?
Maybe best would be to have a small test-application that allows each
and everyone to test it.
Same question
for the servlet container.
The servlet container does not (usually) have an encoding.
There's one that's been kept implicit for too long but is now
commonly written in server.xml: the charset used in URLs and www-form-
url-encoded post-content.
Tomcat has long considered the platform encoding to be correct here,
but this is clearly wrong. Again, some made some workarounds....
It works in the system encoding, which varies from OS
and country.
Windows systems usually are set to an encoding that reflects the
language/country, and Linux systems mostly do the same, but tend to
switch to UTF8.
Macs have yet another gang (e.g. MacRoman instead of latin1), that's
also varying per language.
I checked what happens if I override the jvm encoding.
It's not
good, as
it is replaced for all the apps, and in a shared container that's
really
bad.
Well... there again the problem is even more stringent. Contmeporary
Apache deliver, per default, html files as having the charset of the
apache!! (no joke!) Moreover, several specs state clearly that the
content (e.g. html meta tags or xml headers) should not override a
charset declared in the mime-type.
The platform charset is such a variable parameters that I no no
applications that would make sense of this... except maybe those that
manipulate plain text for use in such as notepad....
It's as simple as that: if you want to be on the web you need to
think globally and thus you need a universal encoding.
Thus, I'm against overriding the jvm encoding.
This then makes
option 2 impossible to implement, unless we decide to make XWiki
products work only in certain environments.
I think that everyone has the problem.
It will be possible to do this in several years, once
people
forget all about different charsets. Sometimes, decisions made in
early stages are so hard to overcome and completely eliminate in
later stages.
Still, we can't work with reduced charsets
anymore.
add to it: Math symbols (and greek letters and symbols and...) cannot
live within an 8 bits encoding, whichever it is.
But the task seems big... as you describe below.
paul
The tough part is that there are some tools that
handle conversions
internally, and they work with the JVM encoding. We have such problems
with JRCS (rollbacks replace non-ascii chars with question marks), and
with FOP (the same question marks appear). I'll have to study what can
be done to overcome these problems.
[...]