Re: [xwiki-devs] [VOTE] Make UTF-8 mandatory for a valid installation

11 Dec 2012

On 12/07/2012 04:56 PM, Caleb James DeLisle wrote:
...

 On 12/07/2012 04:26 PM, Vincent Massol wrote:
  Hi,
 On Dec 7, 2012, at 9:59 PM, Sergiu Dumitriu &lt;sergiu(a)xwiki.org&gt; wrote:
  Hi devs,
 We've moved more and more toward an UTF-8-only application, and XWiki
 has only been tested with this configuration for several years.
 I propose that we require UTF-8 for a valid, supported installation.
 This means:
 - JVM encoding (-Dfile.encoding=UTF8)
 - Container default URL encoding (Tomcat has ISO-8859-1 by default)
 - Database encoding (MySql is still configured with latin1 on some distros)
 There's one big site to update on our side: xwiki.org.
 Here's my +1. This is a move toward a future web, since more and more
 standards require (or at least assume as a default) UTF-8.
 After thinking a bit more, it would make sense to require a valid
 Unicode encoding, including UTF-16, which is preferable in countries
 that don't use a latin alphabet. However, XWiki doesn't currently work
 under 16-bit encodings at all. 
 For XWiki 4.x I'm -1 since it's a big change and we don't want to break our
users that currently use 4.x with ISO8859-1 for example
 For XWiki 5.x I'm not sure.
 To be able to answer I need to understand more. For example what currently doesn't
work with any encoding the user wants to use? Shouldn't we just be transparent and use
whatever encoding is specified and not hardcode anything? 
 +1 for UTF-8 only.
 If we want to support an encoding we need to run our test suite with it so
 each encoding we support multiplies the test run time and it's not going to
 bring features to the user's hands.
 +1 for waiting until 5.x at least before making it mandatory because we will
 have to require MySQL >= 5.5.3 and set the encoding to utf8mb4 in order to
 avoid errors when saving pages with 4 byte codepoints.
 http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html 
I'm afraid we'll get errors if we do that, since indexes are still
limited to a total of 1024 bytes, and we're already maxing out with
255-varchar columns + other fields. In short, MySQL sucks for serious
projects, but we can't really tell our users "use Postgres, it's
better". So I'd rather keep it to the current utf-8, and hope that
nobody will need the extended unicode planes, until we find a better
solution.
To be more specific: we can't switch to 4-byte utf8 until we stop using
names as primary key elements.
Just tried it, and indeed trying to save characters outside the BMP will
fail. Thanks for pointing this out.
...
  I understand that some users currently set the
encoding to latin1 so MySQL
 will just treat the data as opaque blobs. 
Except that it doesn't work like that. If you use latin1, you'll get
errors with the default XE xar about invalid values in the RCS table.
The connector doesn't send bytes, it sends characters, and the database
will try to store them, which it can't. Every piece of MySQL has an
encoding, which isn't opaque. Pushing characters outside the table's
charset will trigger an exception.
...
  Thanks,
 Caleb
>
> Thanks
> -Vincent 
--
Sergiu Dumitriu
http://purl.org/net/sergiu

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [xwiki-devs] [VOTE] Make UTF-8 mandatory for a valid installation