okay,. maybe you'd better devise the code ?
i can only copy-paste from googled sources without real Java knowledge and
real ability to test.
So even if i do something - it still would have to be reviewed and maybe
even would not compile.
http://www.benmccann.com/dev-blog/java-html-parsing-library-comparison/
Here i can see how to create DOM, yet it would be overkill, SAX is proper
better approach here.
But can SAX be run over HTML not XML ?
java-sources.net suggest to use
hotsax.sf.net, but it probably lacks
auto-detection.
another HTML SAX is JTagSoup, it also lacks auto-detection yet suggests
looking at
jchardet.sourceforge.net
For what i can see, OpenOffice does not offer UTF-16 or such exports, so we
have to choose between UTF-8, UTF-7 and single-byte encodings...
That should replace hardcoded " htmlReader = new
InputStreamReader(htmlStream, "UTF-8");"
at
https://github.com/xwiki/xwiki-platform/blob/master/xwiki-platform-core/xwi…
We maybe can assume any charset initially, for we need only Latin1 tags and
values.
Yet... Some tag parameters values might be non-Latin and if tags order would
be different, they might come up before the encoding tag...
Like in
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html;
charset=utf-8">
<TITLE></TITLE>
<META NAME="GENERATOR"
CONTENT="OpenOffice.org 3.4 (Win32)">
<META NAME="AUTHOR" CONTENT="Тестовый
менеджер">
<META NAME="CREATED" CONTENT="20120525;11540000">
<META NAME="CHANGEDBY" CONTENT="Тестовый
менеджер">
Here u can see that charset is specified above all the rest.
If we can assume that as a traditional behaviour, then we can even just
offset few bytes from beginning and get directly to '=utf-8"' part :-)
--
View this message in context:
http://xwiki.475771.n2.nabble.com/which-HTML-parsing-libs-are-already-using…
Sent from the XWiki- Users mailing list archive at
Nabble.com.