HTML cleaner tells to have auto-detection in one of its methods
http://htmlcleaner.sourceforge.net/doc/org/htmlcleaner/HtmlCleaner.html#cle…
....
Okay, this probably might be copy-pasted almost non-modified (if
HtmlCleaner's 3-clause BSD license allows it,
http://htmlcleaner.sourceforge.net/license.php)
Potential extension might be the loop, if there are multiple charset
declarations (i saw such malformed HTMLs in the wild, though i have doubts
OpenOffice would ever do such a thing, but who knows what HTML importer
might get reused for later?), breaking out on 1st `supported` charset.
Or just copy-paste like that, to return 1st match and no more guessing...
....
org/htmlcleaner/Utils.java
public static String getCharsetFromContent(URL url) throws IOException {
InputStream stream = url.openStream();
byte chunk[] = new byte[2048];
int bytesRead = stream.read(chunk);
if (bytesRead > 0) {
String startContent = new String(chunk);
String pattern =
"\\<meta\\s*http-equiv=[\\\"\\']content-type[\\\"\\']\\s*content\\s*=\\s*[\"']text/html\\s*;\\s*charset=([a-z\\d\\-]*)[\\\"\\'\\>]";
Matcher matcher = Pattern.compile(pattern,
Pattern.CASE_INSENSITIVE).matcher(startContent);
if (matcher.find()) {
String charset = matcher.group(1);
if (Charset.isSupported(charset)) {
return charset;
}
}
}
return null;
}
-----------------
Another approach might be to use HTML parser.
http://htmlparser.sourceforge.net/faq.html#encodingchangeexception
This sounds like the target, the parser able to made some assumptions of
charset and re-scan if proven wrong.
--
View this message in context:
http://xwiki.475771.n2.nabble.com/which-HTML-parsing-libs-are-already-using…
Sent from the XWiki- Users mailing list archive at
Nabble.com.