As a member of an organization whose name has an ampersand in it, I am
pleased to say that I am discovering many of the ways that XML and HTML
can choke on it. This is because as a developer, I thoroughly enjoy
finding bugs (in other people's code).
However, I am spending more time than I want (greater than zero) using
my admin privileges to clean up after users who shouldn't have to worry
about what kind of text they enter into their documents - in particular,
they shouldn't have to know that every time they type the company name,
it has to be AT&T.
First the document headings in the RSS feeds caused readers to fail,
then the CSS validator refuses even to look at a document, and now the
Tomcat logfile is growing by dozens of megabytes per minute on a system
with ten or fewer active users. All because somebody innocently entered
"AT&T" somewhere in a document.
I have found several methods for transforming text, such as
$xwiki.getURLEncoded(String) and $doc.getEscapedContent() (which
apparently hides the entire content of a document from Velocity, but not
from Radeox). There is also the Javascript in some form documents that
makes sure that accented characters don't get into document names.
Nowhere, however, have I yet found a method that will generally escape
things in user-entered text that will break XML parsing.
Is there such a thing? I note several regular expressions in some of
the config files for Radeox, etc; there ought - somewhere - to be a
general method for doing this, n'est-ce pas?
Brian M. Thomas - Senior Technical Architect
AT&T Services, Inc.
One SBC Center, Room 24D3
St. Louis, MO 63101
314 235 3141