On 04/23/2010 09:44 PM, Caleb James DeLisle wrote:
Sergiu Dumitriu wrote:
On 04/23/2010 08:50 AM, Denis Gervalle wrote:
On Fri, Apr 23, 2010 at 03:32, Sergiu
Dumitriu<sergiu(a)xwiki.com> wrote:
On 04/06/2010 05:03 PM, Vincent Massol wrote:
> Hi Milind,
>
> On Apr 6, 2010, at 5:00 PM, Milind Kamble wrote:
>
>> Denis,
>> I understand your point that XE being used globally, needs to support
more than Ascii char set.
>> While the new reference model matures, could you clarify if underscore
in a file name would break the functionality under the current model where
attachment name is used as a reference for attachments? If not, would it be
possible to eliminate the stripping of just the underscore chars and push
that fix in the next XE release -- I am OK with space chars getting stripped
off.
> I don't think that underscores are a problem even with the old "reference
as string" code. Actually I don't even know why we're stripping them.
Sergiu
might know more. Any idea Sergiu?
This is the issue that started it: XWIKI-2087
So, there were three main problems:
1. Impossible to actually restore the attachment from the database since
the ID was generated using the hash of the original, correct name, yet
it was stored using the broken name, with ? instead of non-latin1
characters
2. Impossible to link to such an attachment, since a non-UTF wiki would
encode non-ASCII chars to their&#xyz; escapes, and the filename wasn't
decoded when trying to get the attachment from the database
3. Encoding bug in the old WYSIWYG which composed the URL using a wrong
encoding
3 should be fixed since we're forcing UTF-8 in URLs.
2 and 1 should work if the wiki+database are using UTF8, but they might
still fail in latin1.
Should we really support non-UTF-8 configuration ? We have already lost so
much time with these encoding issues, and I really do not understand the
advantage of supporting non-UTF8 environment ?
Legacy. Maybe if we can provide a nice and quick guide for transforming
a latinX installation into an UTF-8, we'd be allowed to require UTF-8.
We could announce that from 2.5 onwards UTF-8 will be mandatory, if we
decide to go this way. Maybe the most important latin1 installation is
xwiki.org itself.
The most problematic thing is that by default mysql databases come as
latin1 (in most distributions, although my Gentoo makes it utf8), and
this is one of the most frequent source of encoding problem reports.
Am I correct in saying that mysql with utf8 is unable to handle some
characters and so pages can't be saved? My understanding is using latin1
is a common workaround so that mysql doesn't know that it is handling the
characters. Forcing utf8 might lead to some unhappy users who suddenly find
not only their database must be changed but some of the characters used in
their language are nolonger allowed.
No, utf8 should allow all characters, since UTF-8 allows complete
representation of all unicode characters, which cover all existing
non-UTF charsets.
The workaround that you talk about is actually a different problem, and
a common bad practice: storing UTF-8 or some other fixed encoding bytes
in a latin1 column, by decomposing and recomposing strings into bytes.
This is bad because it breaks sorting and upper/lowercase
transformations. It's also bad because it involves another byte
split/parse operation, since the data is already transformed once from
fake "ISO-8859-1" bytes into a String. It is good because it actually
doesn't care about the database encoding, it's independent, and works
with (almost) all database encodings transparently.
What actually breaks when switching from latin1 to utf8, in this
scenario, is that UTF-8 has some intrinsic data validation, meaning that
certain bytes and certain byte pairs/triples are not valid UTF-8
strings, thus, trying to push random bytes could sometime fail.
Normally, storing UTF-8 bytes in a utf8 column should work, so it fails
when storing another encoding in a utf8 column.
Fortunately, we don't use this technique, so we're not affected by this
problem.
Now, on the contrary to what you say, UTF-8 has been the default
encoding of the wiki for some time, and it fails if the database (mysql)
is NOT in utf8. We actually require the database to be in utf8, since
otherwise data will be lost after it gets out of the cache.
Why some people still prefer latin1 in a world that is moving more and
more towards UTF-8? Well, there are a few disadvantages to utf8, when
used inside mysql:
- Data is bigger, since latin1 uses exactly 1 byte for each character,
while UTF-8 uses 2 for most european languages, and even 3 for Asian and
other exotic scripts. I'm speaking here only about storage space needed.
- Not only is the storage bigger, but the algorithms are a bit more
complex/time consuming: counting how many characters are in a latin1
string is simple, just see how many bytes are in there. In UTF it's more
complex, since 1, 2, 3 bytes can form one character, and the rules
require full examination of each byte. Thus, length(latin1) is O(1),
length(utf8) is O(n). Most other string functions are also affected by
this complexity problem.
- Moreover, indexes are limited to 1024 (or was it 2048?) BYTES in
length, and MySQL assumes the worst case scenario when computing how
many bytes a column takes up. So, while it's possible to use 4 small
columns (255 chars) combined in an index, if utf8 is used instead,
4*(255*3 bytes per char in the worst case scenario)>1024, thus using
utf8 in tables limits the size of indexes.
There might be other major disadvantages, but these are the most
important that I know of.
The big advantage of utf8, when compared to all latinX charsets, is that
it can store much more characters. All latinX charsets can store only
256 possible characters (including all the control chars rarely used).
And frankly, the entire web is moving towards UTF-8.
These disadvantages are not problems with all UTF-8 applications, it is
just a very lousy design/implementation in mysql, and it's one of the
main reasons why I don't like mysql. I hope that they will realize that
they did it all wrong and fix it at some point.
A thought.
Caleb
>
>>>> Thanks
>>>> -Vincent
>>>>
>>>>> ________________________________
>>>>> From: Denis Gervalle<dgl(a)softec.lu>
>>>>> To: XWiki Developers<devs(a)xwiki.org>
>>>>> Sent: Tue, April 6, 2010 8:30:34 AM
>>>>> Subject: Re: [xwiki-devs] Simple patch to enable/preserve underscore
>>> chars in attachment file names
>>>>> On Tue, Apr 6, 2010 at 14:02, Guillaume
Lerouge<guillaume(a)xwiki.com>
>>> wrote:
>>>>>> Hi Milind,
>>>>>>
>>>>>> On Tue, Apr 6, 2010 at 1:23 AM, Milind
Kamble<mbkads(a)yahoo.com>
>>> wrote:
>>>>>>> Hi. I would like the dev community to evaluate this simple
fix that
>>> will
>>>>>>> enable uploading of files with underscore chars in the file
name when
>>>>>> users
>>>>>>> perform the attach action. Our user community is quite
impressed about
>>>>>> the
>>>>>>> refreshing ease of use and the power, flexibility in their
>>> collaboration
>>>>>>> work flow made possible by XE. They would like to escape the
tyranny
>>> of
>>>>>>> Microsoft-MOSS as early as possible and the main roadblock to
do so is
>>>>>> the
>>>>>>> stripping of space and underscores from file names which were
created
>>> in
>>>>>> a
>>>>>>> MS-Office centric environment.
>>>>>>>
>>>>>> I can't do much about your underscore problem (though I
promise I'll
>>> poke
>>>>>> the developer sitting right next to me so that he looks at it).
>>>>>>
>>>>> I was already aware of this issue, and I have had similar problemqs
with
>>>>> attachment, not only with "_", but also with accentuated
chars etc...
>>>>> Restriction on attachment names will be easier to be changed when
the
>>> new
>>>
>>>>> model model using references will be fully in place, since
attachment
>>> names
>>>>> are currently used as reference for attachments. Be sure I will take
>>> care to
>>>>> have it improve.
--
Sergiu Dumitriu
http://purl.org/net/sergiu/