Re: [xwiki-devs] [xwiki-notifications] r17078 - in platform/core/trunk/xwiki-officeimporter/src: main/java/org/xwiki/officeimporter/filter test/java/org/xwiki/officeimporter/internal/cleaner

27 Feb 2009

Hi Asiri,

On Feb 27, 2009, at 12:32 PM, asiri (SVN) wrote:

...
  Author: asiri
 Date: 2009-02-27 12:32:21 +0100 (Fri, 27 Feb 2009)
 New Revision: 17078

 Added:
   platform/core/trunk/xwiki-officeimporter/src/test/java/org/xwiki/ 
 officeimporter/internal/cleaner/AbstractHTMLCleaningTest.java
   platform/core/trunk/xwiki-officeimporter/src/test/java/org/xwiki/ 
 officeimporter/internal/cleaner/ 
 EmptyLineParagraphOpenOfficeCleaningTest.java
   platform/core/trunk/xwiki-officeimporter/src/test/java/org/xwiki/ 
 officeimporter/internal/cleaner/ImageOpenOfficeCleaningTest.java
   platform/core/trunk/xwiki-officeimporter/src/test/java/org/xwiki/ 
 officeimporter/internal/cleaner/InvalidTagOpenOfficeCleaningTest.java
   platform/core/trunk/xwiki-officeimporter/src/test/java/org/xwiki/ 
 officeimporter/internal/cleaner/LineBreakOpenOfficeCleaningTest.java
   platform/core/trunk/xwiki-officeimporter/src/test/java/org/xwiki/ 
 officeimporter/internal/cleaner/LinkOpenOfficeCleaningTest.java
   platform/core/trunk/xwiki-officeimporter/src/test/java/org/xwiki/ 
 officeimporter/internal/cleaner/ListOpenOfficeCleaningTest.java
   platform/core/trunk/xwiki-officeimporter/src/test/java/org/xwiki/ 
 officeimporter/internal/cleaner/MiscWysiwygCleaningTest.java
   platform/core/trunk/xwiki-officeimporter/src/test/java/org/xwiki/ 
 officeimporter/internal/cleaner/ 
 RedundantTagOpenOfficeCleaningTest.java
   platform/core/trunk/xwiki-officeimporter/src/test/java/org/xwiki/ 
 officeimporter/internal/cleaner/TableOpenOfficeCleaningTest.java
 Removed:
   platform/core/trunk/xwiki-officeimporter/src/test/java/org/xwiki/ 
 officeimporter/internal/cleaner/AbstractHTMLCleanerTest.java
   platform/core/trunk/xwiki-officeimporter/src/test/java/org/xwiki/ 
 officeimporter/internal/cleaner/OpenOfficeHTMLCleanerTest.java
   platform/core/trunk/xwiki-officeimporter/src/test/java/org/xwiki/ 
 officeimporter/internal/cleaner/WysiwygHTMLCleanerTest.java
 Modified:
   platform/core/trunk/xwiki-officeimporter/src/main/java/org/xwiki/ 
 officeimporter/filter/LineBreakFilter.java
 Log:
 XWIKI-3265: Restructure officeimporter test cases + write more tests

 * Completed. 
[snip]

...
  +public class InvalidTagOpenOfficeCleaningTest extends

 AbstractHTMLCleaningTest
 +{
 +    /**
 +     * {@code <style>} tags should be stripped from html content.
 +     */
 +    public void testStyleTagRemoving()
 +    {
 +        String html =
 +            "<html><head><title>Title</title>" +
"<style type= 
 \"text/css\">h1 {color:red} p {color:blue} </style>"
 +                + "</head><body>" + footer;
 +        Document doc = openOfficeHTMLCleaner.clean(new  
 StringReader(html));
 +        NodeList nodes = doc.getElementsByTagName("style");
 +        assertEquals(0, nodes.getLength());
 +    }
 +
 +    /**
 +     * {@code <style>} tags should be stripped from html content. 
copy paste, should be <script>.

...
  +     */
 +    public void testScriptTagRemoving()
 +    {
 +        String html = header + "<script type=\"text/javascript 
 \">document.write(\"Hello World!\")</script>" + footer;
 +        Document doc = openOfficeHTMLCleaner.clean(new  
 StringReader(html));
 +        NodeList nodes = doc.getElementsByTagName("script");
 +        assertEquals(0, nodes.getLength());
 +    }
 +}

[snip]

...
  +    /**
 +     * {@code <br/>} elements placed next to paragraph elements  
 should be converted to {@code<div
 +     * class="wikikmodel-emptyline"/>} elements.
 +     */
 +    public void testLineBreaksNextToParagraphElements()
 +    {
 +       
checkLineBreakReplacements("<br/><br/><p>para</p>", 0,
2);
 +       
checkLineBreakReplacements("<p>para</p><br/><br/>", 0,
2);
 +       
checkLineBreakReplacements("<p>para</p><br/><br/><p>para</

 p>", 0, 2);
 +    } 
Shouldn't this be done by the default HTML Cleaner?
Same for the other tests in this category.

...
  +    /**
 +     * The html generated by open office server includes anchors of  
 the form {@code<a name="table1"><h1>Sheet 2:
 +     * <em>Hello</em></h1></a>} and the default html cleaner  
 converts them to {@code <a name="table1"/><h1><a
 +     * name="table1">Sheet 1:
<em>Hello</em></a></h1>} this is  
 because of the close-before-copy-inside
 +     * behaviour of default html cleaner. Thus the additional (copy- 
 inside) anchor needs to be ripped off. 
This looks like a bug in the default HTML cleaner no?

...
  +    /**
 +     * If there are leading spaces within the content of a list  
 item ({@code<li/>}) they should be trimmed.
 +     */
 +    public void testListItemContentLeadingSpaceTrimming()
 +    {
 +        String html = header + "<ol><li>
Test</li></ol>" + footer;
 +        Document doc = openOfficeHTMLCleaner.clean(new  
 StringReader(html));
 +        NodeList nodes = doc.getElementsByTagName("li");
 +        Node listContent = nodes.item(0).getFirstChild();
 +        assertEquals(Node.TEXT_NODE, listContent.getNodeType());
 +        assertEquals("Test", listContent.getNodeValue());
 +    } 
Shouldn't this be done in the default HTML cleaner? Actually I think  
this is already done in the XHTML parser by the whitespace XML filter.  
If not then it's a bug of the whitespace filter.

For all bugs please refer to the jira issue in the javadoc and explain  
that the code will be removed once the bug is fixed.

...
  +
 +    /**
 +     * If there is a leading paragraph inside a list item, it  
 should be replaced with it's content.
 +     */
 +    public void testListItemContentIsolatedParagraphCleaning()
 +    {
 +        String html = header +
"<ol><li><p>Test</p></li></ol>" +  
 footer;
 +        Document doc = openOfficeHTMLCleaner.clean(new  
 StringReader(html));
 +        NodeList nodes = doc.getElementsByTagName("li");
 +        Node listContent = nodes.item(0).getFirstChild();
 +        assertEquals(Node.TEXT_NODE, listContent.getNodeType());
 +        assertEquals("Test", listContent.getNodeValue());
 +    }
 +} 
This should be handled by a combination of both XHTML parser and Wiki  
Syntax Renderer and/or by the default HTML cleaner.

...
  +    /**
 +     * Test cleaning of html paragraphs brearing namespaces.
 +     */
 +    public void testParagraphsWithNamespaces()
 +    {
 +        String html = header + "<w:p>paragraph</w:p>" + footer;
 +        Document doc =
 +            wysiwygHTMLCleaner.clean(new StringReader(html),  
 Collections.singletonMap(HTMLCleaner.NAMESPACES_AWARE,
 +                "false"));
 +        NodeList nodes = doc.getElementsByTagName("p");
 +        assertEquals(1, nodes.getLength());
 +    } 
hmmm... I think this needs to be reviewed and we need to check if the  
wikimodel XHTML parser supports namespaces.

...
  +
 +    /**
 +     * The source of the images in copy pasted html content should  
 be replaces with 'Missing.png' since they can't be
 +     * uploaded automatically.
 +     */
 +    public void testImageFiltering()
 +    {
 +        String html = header + "<img src=\"file://path/to/local/image.png 
 \"/>" + footer;
 +        Document doc = wysiwygHTMLCleaner.clean(new  
 StringReader(html));
 +        NodeList nodes = doc.getElementsByTagName("img");
 +        assertEquals(1, nodes.getLength());
 +        Element image = (Element) nodes.item(0);
 +        Node startComment = image.getPreviousSibling();
 +        Node stopComment = image.getNextSibling();
 +        assertEquals(Node.COMMENT_NODE, startComment.getNodeType());
 +         
 assertTrue 
 (startComment.getNodeValue().equals("startimage:Missing.png")); 
It should be lowercase "missing.png". So this means a missing.png  
image need to be present in all skins?

Has this been discussed and is everyone aware of this?

...
  +    /**
 +     * Test filtering of those tags which doesn't have any  
 attributes set.
 +     */
 +    public void testFilterIfZeroAttributes()
 +    {
 +        String htmlTemplate = header + "<p>Test%sRedundant 
 %sFiltering</p>" + footer;
 +        String[] filterIfZeroAttributesTags = new String[] {"span",  
 "div"};
 +        for (String tag : filterIfZeroAttributesTags) {
 +            String startTag = "<" + tag + ">";
 +            String endTag = "</" + tag + ">";
 +            String html = String.format(htmlTemplate, startTag,  
 endTag);
 +            Document doc = openOfficeHTMLCleaner.clean(new  
 StringReader(html));
 +            NodeList nodes = doc.getElementsByTagName(tag);
 +            assertEquals(0, nodes.getLength());
 +        }
 +    } 
Shouldn't this be done in the default HTML cleaner?

...
  +
 +    /**
 +     * Test filtering of those tags which doesn't have any textual  
 content in them.
 +     */
 +    public void testFilterIfNoContent()
 +    {
 +        String htmlTemplate = header + "<p>Test%sRedundant%s%s 
 %sFiltering</p>" + footer;
 +        String[] filterIfNoContentTags =
 +            new String[] {"em", "strong", "dfn",
"code", "samp",  
 "kbd", "var", "cite", "abbr",
"acronym", "address",
 +            "blockquote", "q", "pre", "h1",
"h2", "h3", "h4", "h5",  
 "h6"};
 +        for (String tag : filterIfNoContentTags) {
 +            String startTag = "<" + tag + ">";
 +            String endTag = "</" + tag + ">";
 +            String html = String.format(htmlTemplate, startTag,  
 endTag, startTag, endTag);
 +            Document doc = openOfficeHTMLCleaner.clean(new  
 StringReader(html));
 +            NodeList nodes = doc.getElementsByTagName(tag);
 +            assertEquals(1, nodes.getLength());
 +        }
 +    }
 +} 
Shouldn't this be done in the default HTML cleaner?

...
  +    /**
 +     * An isolated paragraph inside a table cell item should be  
 replaced with paragraph's content.
 +     */
 +    public void testTableCellItemIsolatedParagraphCleaning()
 +    {
 +        String html = header +
"<table><tr><td><p>Test</p></td></ 
 tr></table>" + footer;
 +        Document doc = openOfficeHTMLCleaner.clean(new  
 StringReader(html));
 +        NodeList nodes = doc.getElementsByTagName("td");
 +        Node cellContent = nodes.item(0).getFirstChild();
 +        assertEquals(Node.TEXT_NODE, cellContent.getNodeType());
 +        assertEquals("Test", cellContent.getNodeValue());
 +    } 
Isn't this already tested above?
In any case shouldn't this be moved out of the importer?
Same for other tests  in the same category.

...
  +    /**
 +     * If multiple paragraphs are found inside a table cell item,  
 they should be wrapped in an embedded document.
 +     */
 +    public void testTableCellItemMultipleParagraphWrapping()
 +    {
 +        assertEquals(true,  

checkEmbeddedDocumentGeneration("<table><tr><td><p>Test</p><p>Test</

 p></td></tr></table>",
 +            "td"));
 +    } 
This looks like a bug in the XHTML parser.
Same for other tests in the same category.

...
  +
 +    /**
 +     * Empty rows should be removed.
 +     */
 +    public void testEmptyRowRemoving()
 +    {
 +        String html = header +
"<table><tbody><tr><td>cell</td></ 
 tr><tr></tr></tbody></table>" + footer;
 +        Document doc = openOfficeHTMLCleaner.clean(new  
 StringReader(html));
 +        NodeList nodes = doc.getElementsByTagName("tr");
 +        assertEquals(1, nodes.getLength());
 +    } 
Shouldn't this be done in the default HTML cleaner?

Thanks
-Vincent
http://xwiki.com
http://xwiki.org
http://massol.net

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [xwiki-devs] [xwiki-notifications] r17078 - in platform/core/trunk/xwiki-officeimporter/src: main/java/org/xwiki/officeimporter/filter test/java/org/xwiki/officeimporter/internal/cleaner