New subject: [xwiki-devs] [xwiki-notifications] r14425 - sandbox/xwiki-plugin-officeimporter/src/main/java/com/xpn/xwiki/plugin/officeimporter/filter

24 Nov 2008

Hi Asiri,
On Nov 24, 2008, at 6:08 PM, asiri (SVN) wrote:
...
  Author: asiri
 Date: 2008-11-24 18:08:51 +0100 (Mon, 24 Nov 2008)
 New Revision: 14425
 Modified:
   sandbox/xwiki-plugin-officeimporter/src/main/java/com/xpn/xwiki/
 plugin/officeimporter/filter/RedundantTagFilter.java
 Log:
 XAOFFICE-1 : Develop the initial feature set for office-importer
 plugin.
 * Added support for filtering empty / redundant paragraphs.
 Modified: sandbox/xwiki-plugin-officeimporter/src/main/java/com/xpn/
 xwiki/plugin/officeimporter/filter/RedundantTagFilter.java
 ===================================================================
 --- sandbox/xwiki-plugin-officeimporter/src/main/java/com/xpn/xwiki/
 plugin/officeimporter/filter/RedundantTagFilter.java   2008-11-24
 15:17:17 UTC (rev 14424)
 +++ sandbox/xwiki-plugin-officeimporter/src/main/java/com/xpn/xwiki/
 plugin/officeimporter/filter/RedundantTagFilter.java   2008-11-24
 17:08:51 UTC (rev 14425)
 @@ -31,12 +31,13 @@
     public void filter(Document document, ImporterContext context)
     {
 -        for(String key : attributeWiseFilteredTags) {
 +        for (String key : attributeWiseFilteredTags) {
 filterNodesWithZeroAttributes(document.getElementsByTagName(key));
         }
 -        for(String key : contentWiseFilteredTags) {
 +        for (String key : contentWiseFilteredTags) {
 filterNodesWithEmptyTextContent(document.getElementsByTagName(key));
 -        }
 +        }
 +        filterEmptyParagraphs(document);
     }
     /**
 @@ -70,10 +71,35 @@
     {
         for (int i = 0; i < elements.getLength(); i++) {
             Element element = (Element) elements.item(i);
 -            if (element.getTextContent().trim().equals("")) {
 +            if (element.getTextContent().trim().equals("")) {
                 element.getParentNode().removeChild(element);
                 i--;
             }
         }
     }
 +
 +    /**
 +     * OpenOffice server generates redundant paragraphs (with empty
 content) to achieve spacing.
 +     * These paragraphs should be stripped off / replaced with
 {@code <br/>} elements appropriately
 +     * because otherwise they result in spurious {@code (%%)}
 elements in generated xwiki content.
 +     *
 +     * @param document The html document.
 +     */
 +    private void filterEmptyParagraphs(Document document)
 +    {
 +        NodeList paragraphs = document.getElementsByTagName("p");
 +        for (int i = 0; i < paragraphs.getLength(); i++) {
 +            Element paragraph = (Element) paragraphs.item(i);
 +            if (paragraph.getTextContent().trim().equals("")) {
 +                // We suspect this is an empty paragraph but it is
 possible that it contains other
 +                // non-textual tags like images. For the moment
 we'll only search for internal image
 +                // tags, we might have to refine this criterion
 later.
 +                NodeList internalImages =
 paragraph.getElementsByTagName("img");
 +                if (internalImages.getLength() == 0) {
 +                    paragraph.getParentNode().removeChild(paragraph);
 +                    i--; 
I don't understand this algorithm. There can be a lot of other valid
XHTML elements inside a P element (see the XHTML spec). Why are you
searching for IMG tags at all? What about all the other tags?
Again I'm not sure this code is in the right place. For example what
will happen if the user provides an empty P as input for a wiki page
(using XHTML syntax)? Your code will never run in this case.
To be honest I haven't looked at the office importer code for some
time but we really need to be careful to move all generic code outside
of it and into the HTML cleaner or into the XHTML parser. Could you
please check that there are only office-related code in there?
Thanks
-Vincent

Re: [xwiki-devs] [xwiki-notifications] r14425 - sandbox/xwiki-plugin-officeimporter/src/main/java/com/xpn/xwiki/plugin/officeimporter/filter