Re: [xwiki-devs] [xwiki-notifications] r17078 - in platform/core/trunk/xwiki-officeimporter/src: main/java/org/xwiki/officeimporter/filter test/java/org/xwiki/officeimporter/internal/cleaner

2 Mar 2009

Hi,
I did not mention the related JIRA issue (if there is one) in any of the
test method's javadoc comments but I did mention about the JIRA issue in the
class comment for the relevant HTMLFilter implementation.
...
  +    /**
  +     * {@code <style>} tags should be
stripped from html content. 
 copy paste, should be <script>. 
Need to be fixed.
...
  [snip]
  +    /**
 +     * {@code <br/>} elements placed next to paragraph elements
 should be converted to {@code<div
 +     * class="wikikmodel-emptyline"/>} elements.
 +     */
 +    public void testLineBreaksNextToParagraphElements()
 +    {
 +
checkLineBreakReplacements("<br/><br/><p>para</p>", 0,
2);
 +
checkLineBreakReplacements("<p>para</p><br/><br/>", 0,
2);
 +
checkLineBreakReplacements("<p>para</p><br/><br/><p>para</
 p>", 0, 2);
 +    } 
 Shouldn't this be done by the default HTML Cleaner?
 Same for the other tests in this category. 
This filter was moved from default html cleaner to OpenOfficeHTMLCleaner :
http://jira.xwiki.org/jira/browse/XWIKI-3101
So the default html cleaner should only clean <br> elements directly under
the <body> element. But here these <br/> elements under <body> tag are
generated as a result of unwrapping <p><br/></p> elements in
OpenOfficeHTMLCleaner (to remove extra new lines). We could modify the
unwrapping of <p><br/></p> elements to directly trnasform internal
<br/>
elements into <div class="wikikmodel-emptyline"> elements (if the
<p><br/></p> element is directly under <body>).
...
   +    /**
 +     * The html generated by open office server includes anchors of
 the form {@code<a name="table1"><h1>Sheet 2:
 +     * <em>Hello</em></h1></a>} and the default html cleaner
 converts them to {@code <a name="table1"/><h1><a
 +     * name="table1">Sheet 1:
<em>Hello</em></a></h1>} this is
 because of the close-before-copy-inside
 +     * behaviour of default html cleaner. Thus the additional (copy-
 inside) anchor needs to be ripped off. 
 This looks like a bug in the default HTML cleaner no?

Well this behavior is well documented in htmlcleaner default configuration
file : http://htmlcleaner.sourceforge.net/default.xml and it says:
"close-before-copy-inside-tags - tags that must be closed and copied - for
example, in
      <a href="#"><div>.... tag A must be closed before DIV but
copied again
inside DIV"
Now the copy-inside behaviour is acceptable, for ex:
<a name="test"><h1>hi</h1></a> --> invalid xhtml
<h1><a name="test">hi</a></h1> --> valid xhtml
But the problem is that htmlcleaner does not remove the original anchor
which leads to:
<a name="test"/><h1><a
name="test">hi</a></h1>
This is how it becomes "close-before-copy-inside"
...
   +    /**
 +     * If there are leading spaces within the content of a list
 item ({@code<li/>}) they should be trimmed.
 +     */
 +    public void testListItemContentLeadingSpaceTrimming()
 +    {
 +        String html = header + "<ol><li>
Test</li></ol>" + footer;
 +        Document doc = openOfficeHTMLCleaner.clean(new
 StringReader(html));
 +        NodeList nodes = doc.getElementsByTagName("li");
 +        Node listContent = nodes.item(0).getFirstChild();
 +        assertEquals(Node.TEXT_NODE, listContent.getNodeType());
 +        assertEquals("Test", listContent.getNodeValue());
 +    } 
 Shouldn't this be done in the default HTML cleaner? Actually I think
 this is already done in the XHTML parser by the whitespace XML filter.
 If not then it's a bug of the whitespace filter.

Ok, it seems this issue has been fixed in XHTML parser.
For all bugs please refer to the jira issue in the javadoc and explain
...
  that the code will be removed once the bug is fixed.

Ok.
...
  +
  +    /**
 +     * If there is a leading paragraph inside a list item, it
 should be replaced with it's content.
 +     */
 +    public void testListItemContentIsolatedParagraphCleaning()
 +    {
 +        String html = header +
"<ol><li><p>Test</p></li></ol>" +
 footer;
 +        Document doc = openOfficeHTMLCleaner.clean(new
 StringReader(html));
 +        NodeList nodes = doc.getElementsByTagName("li");
 +        Node listContent = nodes.item(0).getFirstChild();
 +        assertEquals(Node.TEXT_NODE, listContent.getNodeType());
 +        assertEquals("Test", listContent.getNodeValue());
 +    }
 +} 
 This should be handled by a combination of both XHTML parser and Wiki
 Syntax Renderer and/or by the default HTML cleaner.

JIRA: http://jira.xwiki.org/jira/browse/XWIKI-3262
I have mentioned this JIRA in the corresponding HTMLFilter class, need to
put it inside test methods javadoc comment as well.
...

  +    /**
 +     * Test cleaning of html paragraphs brearing namespaces.
 +     */
 +    public void testParagraphsWithNamespaces()
 +    {
 +        String html = header + "<w:p>paragraph</w:p>" + footer;
 +        Document doc =
 +            wysiwygHTMLCleaner.clean(new StringReader(html),
 Collections.singletonMap(HTMLCleaner.NAMESPACES_AWARE,
 +                "false"));
 +        NodeList nodes = doc.getElementsByTagName("p");
 +        assertEquals(1, nodes.getLength());
 +    } 
 hmmm... I think this needs to be reviewed and we need to check if the
 wikimodel XHTML parser supports namespaces. 
Well, since i'm setting the nameSpacesAware attribute to 'false', all
namespace information will get ripped off while cleaning. so the XHTML
parser does not have to worry about handling namespaces.
...

  +
 +    /**
 +     * The source of the images in copy pasted html content should
 be replaces with 'Missing.png' since they can't be
 +     * uploaded automatically.
 +     */
 +    public void testImageFiltering()
 +    {
 +        String html = header + "<img 
src=\"file://path/to/local/image.png
  \"/>" + footer;
 +        Document doc = wysiwygHTMLCleaner.clean(new
 StringReader(html));
 +        NodeList nodes = doc.getElementsByTagName("img");
 +        assertEquals(1, nodes.getLength());
 +        Element image = (Element) nodes.item(0);
 +        Node startComment = image.getPreviousSibling();
 +        Node stopComment = image.getNextSibling();
 +        assertEquals(Node.COMMENT_NODE, startComment.getNodeType());
 +
 assertTrue
 (startComment.getNodeValue().equals("startimage:Missing.png")); 
 It should be lowercase "missing.png". So this means a missing.png
 image need to be present in all skins?
 Has this been discussed and is everyone aware of this?

This is only a workaround to the problem of not being able to upload images
in the copy pasted office content. We have discussed earlier that we should
have a placeholder for such images.
...

  +    /**
 +     * Test filtering of those tags which doesn't have any
 attributes set.
 +     */
 +    public void testFilterIfZeroAttributes()
 +    {
 +        String htmlTemplate = header + "<p>Test%sRedundant
 %sFiltering</p>" + footer;
 +        String[] filterIfZeroAttributesTags = new String[] {"span",
 "div"};
 +        for (String tag : filterIfZeroAttributesTags) {
 +            String startTag = "<" + tag + ">";
 +            String endTag = "</" + tag + ">";
 +            String html = String.format(htmlTemplate, startTag,
 endTag);
 +            Document doc = openOfficeHTMLCleaner.clean(new
 StringReader(html));
 +            NodeList nodes = doc.getElementsByTagName(tag);
 +            assertEquals(0, nodes.getLength());
 +        }
 +    } 
 Shouldn't this be done in the default HTML cleaner?

These tags are generated as a result of StyleFilter (so putting this code in
default html cleaner will not work). I could move this code to StyleFilter
itself which will strip off resulting redundant tags though.
...

  +
 +    /**
 +     * Test filtering of those tags which doesn't have any textual
 content in them.
 +     */
 +    public void testFilterIfNoContent()
 +    {
 +        String htmlTemplate = header + "<p>Test%sRedundant%s%s
 %sFiltering</p>" + footer;
 +        String[] filterIfNoContentTags =
 +            new String[] {"em", "strong", "dfn",
"code", "samp",
 "kbd", "var", "cite", "abbr",
"acronym", "address",
 +            "blockquote", "q", "pre", "h1",
"h2", "h3", "h4", "h5",
 "h6"};
 +        for (String tag : filterIfNoContentTags) {
 +            String startTag = "<" + tag + ">";
 +            String endTag = "</" + tag + ">";
 +            String html = String.format(htmlTemplate, startTag,
 endTag, startTag, endTag);
 +            Document doc = openOfficeHTMLCleaner.clean(new
 StringReader(html));
 +            NodeList nodes = doc.getElementsByTagName(tag);
 +            assertEquals(1, nodes.getLength());
 +        }
 +    }
 +} 
 Shouldn't this be done in the default HTML cleaner?

Yes.
...

  +    /**
 +     * An isolated paragraph inside a table cell item should be
 replaced with paragraph's content.
 +     */
 +    public void testTableCellItemIsolatedParagraphCleaning()
 +    {
 +        String html = header +
"<table><tr><td><p>Test</p></td></
 tr></table>" + footer;
 +        Document doc = openOfficeHTMLCleaner.clean(new
 StringReader(html));
 +        NodeList nodes = doc.getElementsByTagName("td");
 +        Node cellContent = nodes.item(0).getFirstChild();
 +        assertEquals(Node.TEXT_NODE, cellContent.getNodeType());
 +        assertEquals("Test", cellContent.getNodeValue());
 +    } 
 Isn't this already tested above?
 In any case shouldn't this be moved out of the importer?
 Same for other tests  in the same category.

We have discussed earlier that we should wrap such isolated paragraphs in an
embedded document and the XWikiSyntaxRenderer will remove such paragraphs
just to keep the generated wiki syntax clean.
...

  +    /**
 +     * If multiple paragraphs are found inside a table cell item,
 they should be wrapped in an embedded document.
 +     */
 +    public void testTableCellItemMultipleParagraphWrapping()
 +    {
 +        assertEquals(true,
checkEmbeddedDocumentGeneration("<table><tr><td><p>Test</p><p>Test</
 p></td></tr></table>",
 +            "td"));
 +    } 
 This looks like a bug in the XHTML parser.
 Same for other tests in the same category.

JIRA: http://jira.xwiki.org/jira/browse/XWIKI-3262
...

  +
 +    /**
 +     * Empty rows should be removed.
 +     */
 +    public void testEmptyRowRemoving()
 +    {
 +        String html = header +
"<table><tbody><tr><td>cell</td></
 tr><tr></tr></tbody></table>" + footer;
 +        Document doc = openOfficeHTMLCleaner.clean(new
 StringReader(html));
 +        NodeList nodes = doc.getElementsByTagName("tr");
 +        assertEquals(1, nodes.getLength());
 +    } 
 Shouldn't this be done in the default HTML cleaner?

Yes.
Finally, IMO officeimporter is a component which puts together several other
components to achieve a final result. As a module, officeimporter itself
does not contain much of a logic that needs lots of unit tests. The existing
unit tests either test some html cleaning function that is used as a
workaround for issues in rendering or some OO specific html cleaning
operation. Once those issues are fixed, the number of unit tests in
officeimporter will be even less.
What we can have for officeimporter is more integration tests that would
tests the overall functionality. Still, I don't think existing integration
tests are enough because there are only a few of them and they have complex
office documents as inputs. But what would be more appropriate is to have
lots of office documents with each document testing a small piece of office
formatting, like:
bold_text.doc
italic_text.doc
empty_table.doc
// More
This way we can ensure that even complex documents will work fine because we
have individual formatting elments imported correctly. And these
integrations tests results are easy to validate. (I'm assuming that we can
use selenium can be used to evaluate the final rendered result of the wiki
page).
btw, currently officeimporter integration tests are not executed on hudson
which is my fault. I will talk with raff and see if we can get a working OO
server there.
Thanks.
- Asiri

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [xwiki-devs] [xwiki-notifications] r17078 - in platform/core/trunk/xwiki-officeimporter/src: main/java/org/xwiki/officeimporter/filter test/java/org/xwiki/officeimporter/internal/cleaner