[xwiki-devs] [vote] XWiki charset strategy

newer
[xwiki-devs] [Architecture] New...

older
[xwiki-devs] How to monitor the...

Sergiu Dumitriu

15 Feb 2008 15 Feb '08

2:49 p.m.

Hi devs, We need to decide how to handle the charset/encoding in XWiki. We have 3 options: 1. Leave it as it is. The default is ISO-8859-1, and the admin has to make sure the JVM is started with the correct -Dfile.encoding param. If another encoding is needed, it has to be changed in 4 places (web.xml, xwiki.cfg, -Dfile.encoding, database charset+collation) 2. Force it to always be UTF-8, overriding the file.enconding setting. This ensures internationalization, as UTF-8 works with any language. And I think it is a safe move, as any modern system supports UTF-8 (given that XWiki requires java 5, we can assume it will be in a modern system). This has the advantage that the code will be simpler, as we don't have to check and switch encodings, but has the disadvantage that mysql has to be manually configured for UTF-8, as by default it comes in latin1. 3. Keep it configurable, but by only specifying it in one place (xwiki.cfg or web.xml), and enforcing that encoding in the JVM (by overriding file.encoding). The default should be UTF-8. Here's my +1 for option 2, -1 for option 1, and 0 for option 3. -- Sergiu Dumitriu http://purl.org/net/sergiu/

Show replies by date

Thomas Mortagne

15 Feb 15 Feb

6:18 p.m.

On Fri, Feb 15, 2008 at 2:49 PM, Sergiu Dumitriu <[email protected]> wrote:

...

Hi devs,

We need to decide how to handle the charset/encoding in XWiki. We have 3 options:

1. Leave it as it is. The default is ISO-8859-1, and the admin has to make sure the JVM is started with the correct -Dfile.encoding param. If another encoding is needed, it has to be changed in 4 places (web.xml, xwiki.cfg, -Dfile.encoding, database charset+collation)

2. Force it to always be UTF-8, overriding the file.enconding setting. This ensures internationalization, as UTF-8 works with any language. And I think it is a safe move, as any modern system supports UTF-8 (given that XWiki requires java 5, we can assume it will be in a modern system). This has the advantage that the code will be simpler, as we don't have to check and switch encodings, but has the disadvantage that mysql has to be manually configured for UTF-8, as by default it comes in latin1.

3. Keep it configurable, but by only specifying it in one place (xwiki.cfg or web.xml), and enforcing that encoding in the JVM (by overriding file.encoding). The default should be UTF-8.

Here's my +1 for option 2, -1 for option 1, and 0 for option 3. -- Sergiu Dumitriu http://purl.org/net/sergiu/ _______________________________________________ devs mailing list [email protected] http://lists.xwiki.org/mailman/listinfo/devs

Don't have more to say that you don't already said and also +1 for option 2, -1 for option 1, and 0 for option 3. -- Thomas Mortagne

Vincent Massol

7:22 p.m.

On Feb 15, 2008, at 2:49 PM, Sergiu Dumitriu wrote:

...

Hi devs,

We need to decide how to handle the charset/encoding in XWiki. We have 3 options:

1. Leave it as it is. The default is ISO-8859-1, and the admin has to make sure the JVM is started with the correct -Dfile.encoding param. If another encoding is needed, it has to be changed in 4 places (web.xml, xwiki.cfg, -Dfile.encoding, database charset+collation)

2. Force it to always be UTF-8, overriding the file.enconding setting. This ensures internationalization, as UTF-8 works with any language. And I think it is a safe move, as any modern system supports UTF-8 (given that XWiki requires java 5, we can assume it will be in a modern system). This has the advantage that the code will be simpler, as we don't have to check and switch encodings, but has the disadvantage that mysql has to be manually configured for UTF-8, as by default it comes in latin1.

Isn't this a problem with databases which are configured in ISO8859-1 by default most of the time? Same question for the servlet container. I can't vote till I know the answer to these 2 questions. Thanks -Vincent PS: As a principle I don't like hard-coding anything so if these questions are answered satisfactorily I'll be ok but with a single config parameter set to UTF8 by default in xwiki.cfg.

...

3. Keep it configurable, but by only specifying it in one place (xwiki.cfg or web.xml), and enforcing that encoding in the JVM (by overriding file.encoding). The default should be UTF-8.

Here's my +1 for option 2, -1 for option 1, and 0 for option 3. -- Sergiu Dumitriu http://purl.org/net/sergiu/

Paul Libbrecht

16 Feb 16 Feb

12:25 a.m.

Le 15 févr. 08 à 19:22, Vincent Massol a écrit :

...

...
I think it is a safe move, as any modern system supports UTF-8 (given that XWiki requires java 5, we can assume it will be in a modern system). This has the advantage that the code will be simpler, as we don't have to check and switch encodings, but has the disadvantage that mysql has to be manually configured for UTF-8, as by default it comes in latin1.

Isn't this a problem with databases which are configured in ISO8859-1 by default most of the time?

I believe this is always a question of connection and not of the database itself. But that's just my hint. A related but not equal matter is the sorting, or ?

...

Same question for the servlet container.

The servlet container would default to the platform's encoding which is not always latin1. But I think that the only place this is relevant is at the encoding for URL-encoded values... and these are recommended to be utf-8 by the new URI spec. I do not know if there's any vote I can express but I would strongly vote with -1 for 1 and 3 and stick to utf-8 for greatest simplicity and finally really really reach readable URIs. paul PS: except for simplicity I have never met anyone refusing to try to set utf-8 as standard for everything.

Vincent Massol

9:07 a.m.

On Feb 16, 2008, at 12:25 AM, Paul Libbrecht wrote:

...

Le 15 févr. 08 à 19:22, Vincent Massol a écrit :

...
...
I think it is a safe move, as any modern system supports UTF-8 (given that XWiki requires java 5, we can assume it will be in a modern system). This has the advantage that the code will be simpler, as we don't have to check and switch encodings, but has the disadvantage that mysql has to be manually configured for UTF-8, as by default it comes in latin1.

Isn't this a problem with databases which are configured in ISO8859-1 by default most of the time?

I believe this is always a question of connection and not of the database itself. But that's just my hint. A related but not equal matter is the sorting, or ?

...
Same question for the servlet container.

The servlet container would default to the platform's encoding which is not always latin1. But I think that the only place this is relevant is at the encoding for URL-encoded values... and these are recommended to be utf-8 by the new URI spec.

I do not know if there's any vote I can express but I would strongly vote with -1 for 1 and 3 and stick to utf-8 for greatest simplicity and finally really really reach readable URIs.

paul

PS: except for simplicity I have never met anyone refusing to try to set utf-8 as standard for everything.

The question is not whether we want it. It's whether it'll work out of the box or not. -Vincent

Jan Kodera

10:03 a.m.

Hi, my opinion is, make UTF-8 default. I am using xwiki in my national language, which is Czech. We have some special characters which is not in latin1. There is no problem to set utf-8 (except pdf export, because missing fonts for Czech , i think), but if someone want to use xwiki he must challenge the setting of encoding. And some people could not handle this. They will leave xwiki. My friend, want to use xwiki too. And he was unable to set encoding right. I had to set encoding and he is experienced user. He is satisfied with xwiki now. For the databases - you can modify query, which makes tables, so the setting of tables will be in utf-8. That`s not a problem, i think. I`m not voting, because i`m not a developer. It`s only my point of view. Jan On Feb 16, 2008 9:07 AM, Vincent Massol <[email protected]> wrote:

...

On Feb 16, 2008, at 12:25 AM, Paul Libbrecht wrote:

...
Le 15 févr. 08 à 19:22, Vincent Massol a écrit :

...
...
I think it is a safe move, as any modern system supports UTF-8 (given that XWiki requires java 5, we can assume it will be in a modern system). This has the advantage that the code will be simpler, as we don't have to check and switch encodings, but has the disadvantage that mysql has to be manually configured for UTF-8, as by default it comes in latin1.

Isn't this a problem with databases which are configured in ISO8859-1 by default most of the time?

I believe this is always a question of connection and not of the database itself. But that's just my hint. A related but not equal matter is the sorting, or ?

...
Same question for the servlet container.

The servlet container would default to the platform's encoding which is not always latin1. But I think that the only place this is relevant is at the encoding for URL-encoded values... and these are recommended to be utf-8 by the new URI spec.

I do not know if there's any vote I can express but I would strongly vote with -1 for 1 and 3 and stick to utf-8 for greatest simplicity and finally really really reach readable URIs.

paul

PS: except for simplicity I have never met anyone refusing to try to set utf-8 as standard for everything.

The question is not whether we want it. It's whether it'll work out of the box or not.

-Vincent _______________________________________________ devs mailing list [email protected] http://lists.xwiki.org/mailman/listinfo/devs

Sergiu Dumitriu

9 a.m.

Vincent Massol wrote:

...

On Feb 15, 2008, at 2:49 PM, Sergiu Dumitriu wrote:

...
Hi devs,

We need to decide how to handle the charset/encoding in XWiki. We have 3 options:

1. Leave it as it is. The default is ISO-8859-1, and the admin has to make sure the JVM is started with the correct -Dfile.encoding param. If another encoding is needed, it has to be changed in 4 places (web.xml, xwiki.cfg, -Dfile.encoding, database charset+collation)

2. Force it to always be UTF-8, overriding the file.enconding setting. This ensures internationalization, as UTF-8 works with any language. And I think it is a safe move, as any modern system supports UTF-8 (given that XWiki requires java 5, we can assume it will be in a modern system). This has the advantage that the code will be simpler, as we don't have to check and switch encodings, but has the disadvantage that mysql has to be manually configured for UTF-8, as by default it comes in latin1.

Isn't this a problem with databases which are configured in ISO8859-1 by default most of the time?

Yes, it is. Right now there is a component somewhere that converts characters not supported by the encoding into &#xxx; escapes, but I can't remember which. With these escapes, the database always receives data in the encoding XWiki is configured with. What I would really like is if hibernate was smart enough to enforce encodings. Or to transparently encode data between the application and the database. Unfortunately, that's not the case. I'll have to check which encodings do DBMSs have implicitly. I only know that mysql comes with latin1, and I think hsql and derby come with utf.

...

Same question for the servlet container.

The servlet container does not (usually) have an encoding. It works in the system encoding, which varies from OS and country. Windows systems usually are set to an encoding that reflects the language/country, and Linux systems mostly do the same, but tend to switch to UTF8. I checked what happens if I override the jvm encoding. It's not good, as it is replaced for all the apps, and in a shared container that's really bad. Thus, I'm against overriding the jvm encoding. This then makes option 2 impossible to implement, unless we decide to make XWiki products work only in certain environments. It will be possible to do this in several years, once people forget all about different charsets. Sometimes, decisions made in early stages are so hard to overcome and completely eliminate in later stages. Still, we can't work with reduced charsets anymore. People all over the world should be able to use XWiki, and right now it is not possible. Unicode is the way to go, and UTF-8 seems the best choice. Even if we can't impose it on the environment, I stilll think it should be used internally and externally. Internally means that whenever we have to switch from String to byte[] and back, we ask the conversion to be made using UTF-8. Externally means that the container is just a middle-man transparently handling data from and to the client, and the client already works with UTF-8. The web, being born a bit later, understood that Unicode is the right answer when determining which characters to support, so most technologies are made to work with Unicode, and its UTF-8 representation. We already have problems with URLs, GET parameters and AJAX calls because we're not working with UTF-8. The tough part is that there are some tools that handle conversions internally, and they work with the JVM encoding. We have such problems with JRCS (rollbacks replace non-ascii chars with question marks), and with FOP (the same question marks appear). I'll have to study what can be done to overcome these problems. I know that this decision is an important one, as it affects large portions of code. But it is a decision that must be made sooner rather than later, so that we can prepare for the switch (btw, if we vote anything other than 1, then this will be part of a future M1, and not of the 1.3 RCs). So, here's option number 4: Let the system as it is, since it must be shared with other applications, but work with UTF-8 both internally, by asking any String <-> byte[] conversion to be made in that encoding, and externally, by sending responses and expecting requests in UTF-8. Given that the database accepts charset configurations at any level (database, table, column), it is OK to ask admins to configure the XWiki database to a certain encoding. Even better, I think this can be done in an Aspect, so that we don't have to manually try-catch all transformations, and always be careful to manually specify the encoding. I'm not an AOP expert, so I'm not sure that this is possible. Is it?

...

I can't vote till I know the answer to these 2 questions.

Thanks -Vincent

PS: As a principle I don't like hard-coding anything so if these questions are answered satisfactorily I'll be ok but with a single config parameter set to UTF8 by default in xwiki.cfg.

...
3. Keep it configurable, but by only specifying it in one place (xwiki.cfg or web.xml), and enforcing that encoding in the JVM (by overriding file.encoding). The default should be UTF-8.

Here's my +1 for option 2, -1 for option 1, and 0 for option 3.

-- Sergiu Dumitriu http://purl.org/net/sergiu/

Paul Libbrecht

9:12 p.m.

Le 16 févr. 08 à 09:00, Sergiu Dumitriu a écrit :

...

...
...
that mysql has to be manually configured for UTF-8, as by default it comes in latin1.

Isn't this a problem with databases which are configured in ISO8859-1 by default most of the time?

Yes, it is. Right now there is a component somewhere that converts characters not supported by the encoding into &#xxx; escapes, but I can't remember which. With these escapes, the database always receives data in the encoding XWiki is configured with.

This is the kind of escapes that have to go away, I feel, they clutter the whole place, you never know if you're not doing it twice, and they slow everything down.

...

What I would really like is if hibernate was smart enough to enforce encodings. Or to transparently encode data between the application and the database. Unfortunately, that's not the case. I'll have to check which encodings do DBMSs have implicitly. I only know that mysql comes with latin1, and I think hsql and derby come with utf.

Are you serious that such a need is there ? We've been using mostly derby but at times we used mysql.... and with the following property and no config on a default (fink-installed) mysql that worked: <property name="connection.url">jdbc:mysql://dbserver/ activemath?useUnicode=true&characterEncoding=UTF-8</property> Adding this in the installation instruction seems doable, or ? Maybe best would be to have a small test-application that allows each and everyone to test it.

...

...
Same question for the servlet container.

The servlet container does not (usually) have an encoding.

There's one that's been kept implicit for too long but is now commonly written in server.xml: the charset used in URLs and www-form- url-encoded post-content. Tomcat has long considered the platform encoding to be correct here, but this is clearly wrong. Again, some made some workarounds....

...

It works in the system encoding, which varies from OS and country. Windows systems usually are set to an encoding that reflects the language/country, and Linux systems mostly do the same, but tend to switch to UTF8.

Macs have yet another gang (e.g. MacRoman instead of latin1), that's also varying per language.

...

I checked what happens if I override the jvm encoding. It's not good, as it is replaced for all the apps, and in a shared container that's really bad.

Well... there again the problem is even more stringent. Contmeporary Apache deliver, per default, html files as having the charset of the apache!! (no joke!) Moreover, several specs state clearly that the content (e.g. html meta tags or xml headers) should not override a charset declared in the mime-type. The platform charset is such a variable parameters that I no no applications that would make sense of this... except maybe those that manipulate plain text for use in such as notepad.... It's as simple as that: if you want to be on the web you need to think globally and thus you need a universal encoding.

...

Thus, I'm against overriding the jvm encoding. This then makes option 2 impossible to implement, unless we decide to make XWiki products work only in certain environments.

I think that everyone has the problem.

...

It will be possible to do this in several years, once people forget all about different charsets. Sometimes, decisions made in early stages are so hard to overcome and completely eliminate in later stages.

...

Still, we can't work with reduced charsets anymore.

add to it: Math symbols (and greek letters and symbols and...) cannot live within an 8 bits encoding, whichever it is. But the task seems big... as you describe below. paul

...

The tough part is that there are some tools that handle conversions internally, and they work with the JVM encoding. We have such problems with JRCS (rollbacks replace non-ascii chars with question marks), and with FOP (the same question marks appear). I'll have to study what can be done to overcome these problems. [...]

6706

Age (days ago)

6707

Last active (days ago)

List overview

Download

7 comments

5 participants

participants (5)

Jan Kodera
Paul Libbrecht
Sergiu Dumitriu
Thomas Mortagne
Vincent Massol