Re: [xwiki-devs] [xwiki-notifications] r14425 - sandbox/xwiki-plugin-officeimporter/src/main/java/com/xpn/xwiki/plugin/officeimporter/filter

24 Nov 2008

Hi Vincent,
...
   +    /**
 +     * OpenOffice server generates redundant paragraphs (with empty
 content) to achieve spacing.
 +     * These paragraphs should be stripped off / replaced with
 {@code <br/>} elements appropriately
 +     * because otherwise they result in spurious {@code (%%)}
 elements in generated xwiki content.
 +     *
 +     * @param document The html document.
 +     */
 +    private void filterEmptyParagraphs(Document document)
 +    {
 +        NodeList paragraphs = document.getElementsByTagName("p");
 +        for (int i = 0; i < paragraphs.getLength(); i++) {
 +            Element paragraph = (Element) paragraphs.item(i);
 +            if (paragraph.getTextContent().trim().equals("")) {
 +                // We suspect this is an empty paragraph but it is
 possible that it contains other
 +                // non-textual tags like images. For the moment
 we'll only search for internal image
 +                // tags, we might have to refine this criterion
 later.
 +                NodeList internalImages =
 paragraph.getElementsByTagName("img");
 +                if (internalImages.getLength() == 0) {
 +                    paragraph.getParentNode().removeChild(paragraph);
 +                    i--; 
 I don't understand this algorithm. There can be a lot of other valid
 XHTML elements inside a P element (see the XHTML spec). Why are you
 searching for IMG tags at all? What about all the other tags? 
Let me explain,
I'm looking for ways to implement a moderate syntax filtering mechanism.
Currently in the strict filtering mode, we strip off most of the attributes
from elements including those of the <p> tag. When we do this, if a
particular <p> element is empty (if it has no textual content), and since
all of it's attributes were stripped off, the resulting xwiki content has no
(%%) elements.
Example : When <p></p> is converted into xwiki/2.0 the result is empty. But
when <p align="justify"></p> is converted into xwiki/2.0 syntax the
result
is (% align="justify"%)
To implement a moderate filtering mechanism, i need to allow the align
attribute of the <p> tag, because that's required to make the content appear
more pleasing to the user. But when I do this, since openoffice server uses
a lot of <p align="justify"></p> (a.k.a empty paragraph elements)
for
spacing, the resulting wiki content contains a lot of isolated
(%align="justify"%) elements which doesn't look nice.
This is the sole reason why I'm trying to rip off empty paragraph elements
so that i can reduce the number of (%%) in the output.
Now why do i look for <img> tags and not others ?
Well, as you can see in the code, there is a check to see if the <p> element
has any text content in it's sub tree. If there is no such text content, we
can guess that this is one of those empty <p> tags used by openoffice server
for spacing. But there are situations where <p> tags are used to enclose
non-textual content like images. So we need to pay special attention to
those cases.
I hope I made myself clear. It's true that it might be too early to think
about a moderate filtering function, but I couldn't resist. The thing is,
the wiki content produced under strict filtering looks very nice, but not
the output of it, It's simply doen't look like the original doc!
Again I'm not sure this code is in the right place.
Well this might be true. I didn't know if html cleaner can be used to strip
off empty <p> tags or such. Is that possible ?
...
  For example what
 will happen if the user provides an empty P as input for a wiki page
 (using XHTML syntax)? Your code will never run in this case. 
I don't understand what you mean by this. Are you refferring to a situation
where the orginal office document contains html content entered by the user
? Please explain. Because the only input against which this code executes is
the html output generated from office documents. Not user entered stuff.
...
  To be honest I haven't looked at the office
importer code for some
 time but we really need to be careful to move all generic code outside
 of it and into the HTML cleaner or into the XHTML parser. 
I agree.
Please let me know what you think about my above comments.
Thanks.
- Asiri

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [xwiki-devs] [xwiki-notifications] r14425 - sandbox/xwiki-plugin-officeimporter/src/main/java/com/xpn/xwiki/plugin/officeimporter/filter