Re: [xwiki-devs] [xwiki-notifications] r14425 - sandbox/xwiki-plugin-officeimporter/src/main/java/com/xpn/xwiki/plugin/officeimporter/filter

24 Nov 2008

Hi Vincent,
Are you sure? Then there's a bug. It should generate 2 empty lines.
...

But
 when is converted into xwiki/2.0 syntax the
 result
 is (% align="justify"%) 
 Again there should be 2 new lines below the (%...%) text. 
Yes, both of these are true, 2 empty lines are there.
...
 To implement a
moderate filtering mechanism, i need to allow the align
 attribute of the tag, because that's required to make the
 content appear
 more pleasing to the user. But when I do this, since openoffice
 server uses
 a lot of (a.k.a empty paragraph elements)
for
 spacing, the resulting wiki content contains a lot of isolated
 (%align="justify"%) elements which doesn't look nice. 
 ok so you want to remove attributes for empty paragraphs? Then the
 method above is badly named and the javadoc not correct since it says
 it removes empty paragraphs.
 This is the sole reason why I'm trying to rip
off empty paragraph
 elements
 so that i can reduce the number of (%%) in the output. 
 I'm not sure why you'd want to remove empty paragraphs at all. I
 understand removing empty paragraph attributes though. An empty
 paragraph has a meaning, it means 2 empty lines. If you remove them
 won't you loose those empty lines? 
You are correct. I thought of the empty paragraphs as garbage and wanted to
get rid of them. The correct approach indeed is to remove the attributes
from those empty paragraphs (so that we don't loose the empty lines).
...
 Now why do i
look for <img> tags and not others ?
 Well, as you can see in the code, there is a check to see if the 
 element
 has any text content in it's sub tree. If there is no such text
 content, we
 can guess that this is one of those empty tags used by
 openoffice server
 for spacing. 
 Why? For example:
 hello
 The text for P is empty. 
Not really, when you do Element.getTextContent() this particular example
will return "hello". Please refer to javadoc
http://java.sun.com/j2se/1.5.0/docs/api/org/w3c/dom/Node.html#setTextConten…
<http://java.sun.com/j2se/1.5.0/docs/api/org/w3c/dom/Node.html#setTextContent%28java.lang.String%29>
...
 But there are
situations where tags are used to enclose
 non-textual content like images. So we need to pay special attention
 to
 those cases. 
 Why images? Why not the other 20 or so HTML elements that can go in
 paragraph? (that was my initial question). 
I think my previous answer addresses this question also. Even if there are
so many HTML elements, we don't have to worry about all of those as long as
Element.getTextContent() returns "". WDYT ?
...

I hope I made myself clear. It's true that it
might be too early to
 think
 about a moderate filtering function, but I couldn't resist. The
 thing is,
 the wiki content produced under strict filtering looks very nice,
 but not
 the output of it, It's simply doen't look like the original doc!
 Again I'm not sure this code is in the right place.
 Well this might be true. I didn't know if html cleaner can be used
 to strip
 off empty tags or such. Is that possible ? 
 You are the one who should tell me! :)
 BTW I didn't say HTML cleaner. I said HTML cleaner or XHTML parser.
 I'm not working on the office importer so I can't answer but you
 should be able to answer. 
Well, the use of empty tags for spacing is specifc to openoffice server.
So yes, I think the code belongs to office-importer.
...
 For example
what
 will happen if the user provides an empty P as input for a wiki page
 (using XHTML syntax)? Your code will never run in this case. 
 I don't understand what you mean by this. Are you refferring to a
 situation
 where the orginal office document contains html content entered by
 the user
 ? Please explain. Because the only input against which this code
 executes is
 the html output generated from office documents. Not user entered
 stuff. 
Users can enter content into pages using different syntaxes. One of
...
 them is HTML/XHTML. So as a user, I can enter:
 
 We can't prevent users entering this. So we need to handle it and
 ensure it produces what it should produce. As you can see in my use
 case there's no office importer code executed but we still need to
 handle empty paragraphs.

I agree. Stripping empty <p> elements is not a good solution. I will change
the code to make it such that all attributes of empty paragraphs are
removed.
Thanks a lot for pointing this out. :)
- Asiri

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [xwiki-devs] [xwiki-notifications] r14425 - sandbox/xwiki-plugin-officeimporter/src/main/java/com/xpn/xwiki/plugin/officeimporter/filter