Niels Mayer wrote:
I previously identified 6 areas that I hard-coded over
to UTF-8. Some of
these were voodoo, some of these worked-directly. I believe this part (which
also started this thread) was the most relevant:
So what exactly do these 6 areas affect (or side-efect) if set or not set?
(1) -Dfile.encoding -> UTF-8
This sets the charset/encoding to use as a default whenever strings
(which are always UTF-16, see
http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html) are
converted to bytes and vice versa:
- String.getBytes
- new String(byte[])
- InputStream vs. Reader
- OutputStream vs. Writer
Fortunately, all these allow manually specifying an encoding to use, and
if we want to be independent of the JVM encoding, we should always do
that. There are still a few places where we use the JVM encoding, but a
recent trunk is safe to use without taking care of the -Dfile.encoding.
I fixed this
issue by running java with -Dfile.encoding=UTF-8 (note the
lowercase setting suggested in
http://platform.xwiki.org/xwiki/bin/view/AdminGuide/Performances seems
incorrect?).
(2) -Djavax.servlet.request.encoding -> UTF-8
I can't find anything about this, so I doubt it does something. The
servlet specification (2.5) states that when determining the encoding
used to parse a request, either it is specified in the request itself in
a HTTP Header, or the default encoding of ISO-8859-1 is used. Maybe
some containers allow to change this default, but it's not standard.
This auto-detection can be overridden by a call to
request.setCharacterEncoding("enc").
When that
alone didn't work, I also added "
-Djavax.servlet.request.encoding=UTF-8
(3) -DjavaEncoding -> UTF-8
I also can't find anything about this. Could be a mistaken name for
-Dfile.encoding?
There is a Jetty setting named javaEncoding, which affects the encoding
used by the JSP compiler to generate Java source files, but we don't
have any JSP pages.
-DjavaEncoding=UTF-8" which had been suggested in solving this problem for
other Tomcat users.
(Now I run java with the following options:-server -Xms160m -Xmx1024m
-XX:PermSize=160m -XX:MaxPermSize=320m -Djavax.servlet.request.encoding=
UTF-8 -Dfile.encoding=UTF-8 -DjavaEncoding=UTF-8 -Djava.awt.headless=true)
(4) LANG -> UTF-8
This sets an environment variable that is read by the JVM to determine
the right value for file.encoding and the default locale. Setting
-Dfile.encoding takes precedence, so this makes no sense for XWiki. It
still influences the default locale, though, but the OS/JVM must have
support for it.
I also saw
other suggestions to set LANG="en_US.UTF-8" in the tomcat
launching script...
(5) com.xpn.xwiki.web.SetCharacterEncodingFilter's 'encoding' -> UTF-8
This is set in web.xml, or directly in the Java file if you modify the
source and recompile it. This influences how the request is parsed,
since request.getParameter transforms bytes into Strings. See the
explanation for (2), this always forces our encoding to be used for
reading requests.
In the future this should be merged with the setting in xwiki.cfg, but
this is not possible yet because the servlet filter does not have access
to the XWiki object (to call xwiki.getEncoding()), and XWiki does not
have access to the filter config. The ugly solution is to manually parse
one of these files, but it really is Ugly.
however,
I'm not sure which of my changes "did" it, but i believe that
following two steps I'd forgotten||skipped in
http://platform.xwiki.org/xwiki/bin/view/AdminGuide/Encoding caused the
correct encoding to be used:
(1) WEB-INF> diff web.xml.~1~ web.xml
23c23
< <param-value>ISO-8859-1</param-value>
---
> <param-value>UTF-8</param-value>
(6) xwiki.encoding -> UTF-8
This is the main setting that we use for determining the encoding. This
allows us to be independent of the system settings, since hosted users
cannot always alter these settings, but they can change their application.
> (2) WEB-INF> diff xwiki.cfg.~2~ xwiki.cfg
> 29c29
> < xwiki.encoding=ISO-8859-1
> ---
>> xwiki.encoding=UTF-8
There are a few more settings for various containers. For example,
Tomcat users must use these
(
http://tomcat.apache.org/tomcat-6.0-doc/config/http.html):
(7) URIEncoding - Tomcat treats different parts of the request differently:
- URL path using UTF-8
- URL query using ISO-8859-1
- request body using the specified encoding (see 2 and 5)
Previous versions of Tomcat (4.1) used the body encoding also for the
query string, but they fixed this (this is in accordance with the
specs). So, in order for request.getParameter to work fine, this setting
must be specified.
(8) useBodyEncodingForURI - see above, this makes tomcat use the body
encoding also for the query string. Either (7) or (8) should be set.
(9) For Jetty, -Dorg.mortbay.util.URI.charset=UTF-8 can influence the
encoding used for parsing the query string. The difference is that by
default they use UTF-8 in recent versions.
--
Sergiu Dumitriu
http://purl.org/net/sergiu/