Re: [xwiki-users] which HTML parsing libs are already using/shiipped with XWiki ?

4 Jul 2012

okay,. maybe you'd better devise the code ?
i can only copy-paste from googled sources without real Java knowledge and
real ability to test.
So even if i do something - it still would have to be reviewed and maybe
even would not compile.
http://www.benmccann.com/dev-blog/java-html-parsing-library-comparison/
Here i can see how to create DOM, yet it would be overkill, SAX is proper
better approach here.
But can SAX be run over HTML not XML ?
java-sources.net suggest to use hotsax.sf.net, but it probably lacks
auto-detection.
another HTML SAX is JTagSoup, it also lacks auto-detection yet suggests
looking at jchardet.sourceforge.net
For what i can see, OpenOffice does not offer UTF-16 or such exports, so we
have to choose between UTF-8, UTF-7 and single-byte encodings...
That should replace hardcoded "            htmlReader = new
InputStreamReader(htmlStream, "UTF-8");"
at
https://github.com/xwiki/xwiki-platform/blob/master/xwiki-platform-core/xwi…
We maybe can assume any charset initially, for we need only Latin1 tags and
values.
Yet... Some tag parameters values might be non-Latin and if tags order would
be different, they might come up before the encoding tag...
Like in
<!DOCTYPE HTML PUBLIC &quot;-//W3C//DTD HTML 4.0 Transitional//EN&quot;>
<HTML>
<HEAD>
        <META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html;
charset=utf-8">
        <TITLE></TITLE>
        <META NAME="GENERATOR" CONTENT="OpenOffice.org 3.4
(Win32)">
        <META NAME="AUTHOR" CONTENT="РўРµСЃС‚РѕРІС‹Р№
РјРµРЅРµРґР¶РµСЂ">
        <META NAME="CREATED" CONTENT="20120525;11540000">
        <META NAME="CHANGEDBY" CONTENT="РўРµСЃС‚РѕРІС‹Р№
РјРµРЅРµРґР¶РµСЂ">
Here u can see that charset is specified above all the rest.
If we can assume that as a traditional behaviour, then we can even just
offset few bytes from beginning and get directly to '=utf-8"' part :-)
--
View this message in context:
http://xwiki.475771.n2.nabble.com/which-HTML-parsing-libs-are-already-using…
Sent from the XWiki- Users mailing list archive at Nabble.com.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [xwiki-users] which HTML parsing libs are already using/shiipped with XWiki ?