Re: [xwiki-users] which HTML parsing libs are already using/shiipped with XWiki ?

4 Jul 2012

HTML cleaner tells to have auto-detection in one of its methods
http://htmlcleaner.sourceforge.net/doc/org/htmlcleaner/HtmlCleaner.html#cle…
....
Okay, this probably might be copy-pasted almost non-modified (if
HtmlCleaner's 3-clause BSD license allows it,
http://htmlcleaner.sourceforge.net/license.php)
Potential extension might be the loop, if there are multiple charset
declarations (i saw such malformed HTMLs in the wild, though i have doubts
OpenOffice would ever do such a thing, but who knows what HTML importer
might get reused for later?), breaking out on 1st `supported` charset.
Or just copy-paste like that, to return 1st match and no more guessing...
....
org/htmlcleaner/Utils.java
    public static String getCharsetFromContent(URL url) throws IOException {
                InputStream stream = url.openStream();
                byte chunk[] = new byte[2048];
                int bytesRead = stream.read(chunk);
                if (bytesRead > 0) {
                    String startContent = new String(chunk);
                    String pattern =
"\\<meta\\s*http-equiv=[\\\&quot;\\']content-type[\\\&quot;\\']\\s*content\\s*=\\s*[\&quot;']text/html\\s*;\\s*charset=([a-z\\d\\-]*)[\\\&quot;\\'\\>]";
                    Matcher matcher = Pattern.compile(pattern,
Pattern.CASE_INSENSITIVE).matcher(startContent);
                    if (matcher.find()) {
                        String charset = matcher.group(1);
                        if (Charset.isSupported(charset)) {
                            return charset;
                        }
                    }
                }
                return null;
            }
-----------------
Another approach might be to use HTML parser.
http://htmlparser.sourceforge.net/faq.html#encodingchangeexception
This sounds like the target, the parser able to made some assumptions of
charset and re-scan if proven wrong.
--
View this message in context:
http://xwiki.475771.n2.nabble.com/which-HTML-parsing-libs-are-already-using…
Sent from the XWiki- Users mailing list archive at Nabble.com.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [xwiki-users] which HTML parsing libs are already using/shiipped with XWiki ?