Re: [xwiki-devs] [xwiki-notifications] r14425 - sandbox/xwiki-plugin-officeimporter/src/main/java/com/xpn/xwiki/plugin/officeimporter/filter
Hi Asiri, On Nov 24, 2008, at 6:08 PM, asiri (SVN) wrote:
Author: asiri Date: 2008-11-24 18:08:51 +0100 (Mon, 24 Nov 2008) New Revision: 14425
Modified: sandbox/xwiki-plugin-officeimporter/src/main/java/com/xpn/xwiki/ plugin/officeimporter/filter/RedundantTagFilter.java Log: XAOFFICE-1 : Develop the initial feature set for office-importer plugin.
* Added support for filtering empty / redundant paragraphs.
Modified: sandbox/xwiki-plugin-officeimporter/src/main/java/com/xpn/ xwiki/plugin/officeimporter/filter/RedundantTagFilter.java =================================================================== --- sandbox/xwiki-plugin-officeimporter/src/main/java/com/xpn/xwiki/ plugin/officeimporter/filter/RedundantTagFilter.java 2008-11-24 15:17:17 UTC (rev 14424) +++ sandbox/xwiki-plugin-officeimporter/src/main/java/com/xpn/xwiki/ plugin/officeimporter/filter/RedundantTagFilter.java 2008-11-24 17:08:51 UTC (rev 14425) @@ -31,12 +31,13 @@
public void filter(Document document, ImporterContext context) { - for(String key : attributeWiseFilteredTags) { + for (String key : attributeWiseFilteredTags) {
filterNodesWithZeroAttributes(document.getElementsByTagName(key)); } - for(String key : contentWiseFilteredTags) { + for (String key : contentWiseFilteredTags) {
filterNodesWithEmptyTextContent(document.getElementsByTagName(key)); - } + } + filterEmptyParagraphs(document); }
/** @@ -70,10 +71,35 @@ { for (int i = 0; i < elements.getLength(); i++) { Element element = (Element) elements.item(i); - if (element.getTextContent().trim().equals("")) { + if (element.getTextContent().trim().equals("")) { element.getParentNode().removeChild(element); i--; } } } + + /** + * OpenOffice server generates redundant paragraphs (with empty content) to achieve spacing. + * These paragraphs should be stripped off / replaced with {@code <br/>} elements appropriately + * because otherwise they result in spurious {@code (%%)} elements in generated xwiki content. + * + * @param document The html document. + */ + private void filterEmptyParagraphs(Document document) + { + NodeList paragraphs = document.getElementsByTagName("p"); + for (int i = 0; i < paragraphs.getLength(); i++) { + Element paragraph = (Element) paragraphs.item(i); + if (paragraph.getTextContent().trim().equals("")) { + // We suspect this is an empty paragraph but it is possible that it contains other + // non-textual tags like images. For the moment we'll only search for internal image + // tags, we might have to refine this criterion later. + NodeList internalImages = paragraph.getElementsByTagName("img"); + if (internalImages.getLength() == 0) { + paragraph.getParentNode().removeChild(paragraph); + i--;
I don't understand this algorithm. There can be a lot of other valid XHTML elements inside a P element (see the XHTML spec). Why are you searching for IMG tags at all? What about all the other tags? Again I'm not sure this code is in the right place. For example what will happen if the user provides an empty P as input for a wiki page (using XHTML syntax)? Your code will never run in this case. To be honest I haven't looked at the office importer code for some time but we really need to be careful to move all generic code outside of it and into the HTML cleaner or into the XHTML parser. Could you please check that there are only office-related code in there? Thanks -Vincent
Hi Vincent,
+ /** + * OpenOffice server generates redundant paragraphs (with empty content) to achieve spacing. + * These paragraphs should be stripped off / replaced with {@code <br/>} elements appropriately + * because otherwise they result in spurious {@code (%%)} elements in generated xwiki content. + * + * @param document The html document. + */ + private void filterEmptyParagraphs(Document document) + { + NodeList paragraphs = document.getElementsByTagName("p"); + for (int i = 0; i < paragraphs.getLength(); i++) { + Element paragraph = (Element) paragraphs.item(i); + if (paragraph.getTextContent().trim().equals("")) { + // We suspect this is an empty paragraph but it is possible that it contains other + // non-textual tags like images. For the moment we'll only search for internal image + // tags, we might have to refine this criterion later. + NodeList internalImages = paragraph.getElementsByTagName("img"); + if (internalImages.getLength() == 0) { + paragraph.getParentNode().removeChild(paragraph); + i--;
I don't understand this algorithm. There can be a lot of other valid XHTML elements inside a P element (see the XHTML spec). Why are you searching for IMG tags at all? What about all the other tags?
Let me explain, I'm looking for ways to implement a moderate syntax filtering mechanism. Currently in the strict filtering mode, we strip off most of the attributes from elements including those of the <p> tag. When we do this, if a particular <p> element is empty (if it has no textual content), and since all of it's attributes were stripped off, the resulting xwiki content has no (%%) elements. Example : When <p></p> is converted into xwiki/2.0 the result is empty. But when <p align="justify"></p> is converted into xwiki/2.0 syntax the result is (% align="justify"%) To implement a moderate filtering mechanism, i need to allow the align attribute of the <p> tag, because that's required to make the content appear more pleasing to the user. But when I do this, since openoffice server uses a lot of <p align="justify"></p> (a.k.a empty paragraph elements) for spacing, the resulting wiki content contains a lot of isolated (%align="justify"%) elements which doesn't look nice. This is the sole reason why I'm trying to rip off empty paragraph elements so that i can reduce the number of (%%) in the output. Now why do i look for <img> tags and not others ? Well, as you can see in the code, there is a check to see if the <p> element has any text content in it's sub tree. If there is no such text content, we can guess that this is one of those empty <p> tags used by openoffice server for spacing. But there are situations where <p> tags are used to enclose non-textual content like images. So we need to pay special attention to those cases. I hope I made myself clear. It's true that it might be too early to think about a moderate filtering function, but I couldn't resist. The thing is, the wiki content produced under strict filtering looks very nice, but not the output of it, It's simply doen't look like the original doc! Again I'm not sure this code is in the right place. Well this might be true. I didn't know if html cleaner can be used to strip off empty <p> tags or such. Is that possible ?
For example what will happen if the user provides an empty P as input for a wiki page (using XHTML syntax)? Your code will never run in this case.
I don't understand what you mean by this. Are you refferring to a situation where the orginal office document contains html content entered by the user ? Please explain. Because the only input against which this code executes is the html output generated from office documents. Not user entered stuff.
To be honest I haven't looked at the office importer code for some time but we really need to be careful to move all generic code outside of it and into the HTML cleaner or into the XHTML parser.
I agree. Please let me know what you think about my above comments. Thanks. - Asiri
On Nov 24, 2008, at 6:46 PM, Asiri Rathnayake wrote:
Hi Vincent,
+ /** + * OpenOffice server generates redundant paragraphs (with empty content) to achieve spacing. + * These paragraphs should be stripped off / replaced with {@code <br/>} elements appropriately + * because otherwise they result in spurious {@code (%%)} elements in generated xwiki content. + * + * @param document The html document. + */ + private void filterEmptyParagraphs(Document document) + { + NodeList paragraphs = document.getElementsByTagName("p"); + for (int i = 0; i < paragraphs.getLength(); i++) { + Element paragraph = (Element) paragraphs.item(i); + if (paragraph.getTextContent().trim().equals("")) { + // We suspect this is an empty paragraph but it is possible that it contains other + // non-textual tags like images. For the moment we'll only search for internal image + // tags, we might have to refine this criterion later. + NodeList internalImages = paragraph.getElementsByTagName("img"); + if (internalImages.getLength() == 0) { + paragraph.getParentNode().removeChild(paragraph); + i--;
I don't understand this algorithm. There can be a lot of other valid XHTML elements inside a P element (see the XHTML spec). Why are you searching for IMG tags at all? What about all the other tags?
Let me explain,
I'm looking for ways to implement a moderate syntax filtering mechanism. Currently in the strict filtering mode, we strip off most of the attributes from elements including those of the <p> tag. When we do this, if a particular <p> element is empty (if it has no textual content), and since all of it's attributes were stripped off, the resulting xwiki content has no (%%) elements.
Example : When <p></p> is converted into xwiki/2.0 the result is empty.
Are you sure? Then there's a bug. It should generate 2 empty lines.
But when <p align="justify"></p> is converted into xwiki/2.0 syntax the result is (% align="justify"%)
Again there should be 2 new lines below the (%...%) text.
To implement a moderate filtering mechanism, i need to allow the align attribute of the <p> tag, because that's required to make the content appear more pleasing to the user. But when I do this, since openoffice server uses a lot of <p align="justify"></p> (a.k.a empty paragraph elements) for spacing, the resulting wiki content contains a lot of isolated (%align="justify"%) elements which doesn't look nice.
ok so you want to remove attributes for empty paragraphs? Then the method above is badly named and the javadoc not correct since it says it removes empty paragraphs.
This is the sole reason why I'm trying to rip off empty paragraph elements so that i can reduce the number of (%%) in the output.
I'm not sure why you'd want to remove empty paragraphs at all. I understand removing empty paragraph attributes though. An empty paragraph has a meaning, it means 2 empty lines. If you remove them won't you loose those empty lines?
Now why do i look for <img> tags and not others ?
Well, as you can see in the code, there is a check to see if the <p> element has any text content in it's sub tree. If there is no such text content, we can guess that this is one of those empty <p> tags used by openoffice server for spacing.
Why? For example: <p><b>hello</b></p> The text for P is empty.
But there are situations where <p> tags are used to enclose non-textual content like images. So we need to pay special attention to those cases.
Why images? Why not the other 20 or so HTML elements that can go in paragraph? (that was my initial question).
I hope I made myself clear. It's true that it might be too early to think about a moderate filtering function, but I couldn't resist. The thing is, the wiki content produced under strict filtering looks very nice, but not the output of it, It's simply doen't look like the original doc!
Again I'm not sure this code is in the right place.
Well this might be true. I didn't know if html cleaner can be used to strip off empty <p> tags or such. Is that possible ?
You are the one who should tell me! :) BTW I didn't say HTML cleaner. I said HTML cleaner or XHTML parser. I'm not working on the office importer so I can't answer but you should be able to answer.
For example what will happen if the user provides an empty P as input for a wiki page (using XHTML syntax)? Your code will never run in this case.
I don't understand what you mean by this. Are you refferring to a situation where the orginal office document contains html content entered by the user ? Please explain. Because the only input against which this code executes is the html output generated from office documents. Not user entered stuff.
Users can enter content into pages using different syntaxes. One of them is HTML/XHTML. So as a user, I can enter: <p></p> We can't prevent users entering this. So we need to handle it and ensure it produces what it should produce. As you can see in my use case there's no office importer code executed but we still need to handle empty paragraphs. All this to say I don't understand why you're replacing empty paragraphs by BRs. I really don't think it's the role of the office importer to do this, unless I haven't understood what you're trying to do. Thanks -Vincent
To be honest I haven't looked at the office importer code for some time but we really need to be careful to move all generic code outside of it and into the HTML cleaner or into the XHTML parser.
I agree.
Please let me know what you think about my above comments.
Thanks.
- Asiri
Hi Vincent, Are you sure? Then there's a bug. It should generate 2 empty lines.
But when <p align="justify"></p> is converted into xwiki/2.0 syntax the result is (% align="justify"%)
Again there should be 2 new lines below the (%...%) text.
Yes, both of these are true, 2 empty lines are there.
To implement a moderate filtering mechanism, i need to allow the align attribute of the <p> tag, because that's required to make the content appear more pleasing to the user. But when I do this, since openoffice server uses a lot of <p align="justify"></p> (a.k.a empty paragraph elements) for spacing, the resulting wiki content contains a lot of isolated (%align="justify"%) elements which doesn't look nice.
ok so you want to remove attributes for empty paragraphs? Then the method above is badly named and the javadoc not correct since it says it removes empty paragraphs.
This is the sole reason why I'm trying to rip off empty paragraph elements so that i can reduce the number of (%%) in the output.
I'm not sure why you'd want to remove empty paragraphs at all. I understand removing empty paragraph attributes though. An empty paragraph has a meaning, it means 2 empty lines. If you remove them won't you loose those empty lines?
You are correct. I thought of the empty paragraphs as garbage and wanted to get rid of them. The correct approach indeed is to remove the attributes from those empty paragraphs (so that we don't loose the empty lines).
Now why do i look for <img> tags and not others ?
Well, as you can see in the code, there is a check to see if the <p> element has any text content in it's sub tree. If there is no such text content, we can guess that this is one of those empty <p> tags used by openoffice server for spacing.
Why? For example:
<p><b>hello</b></p>
The text for P is empty.
Not really, when you do Element.getTextContent() this particular example will return "hello". Please refer to javadoc http://java.sun.com/j2se/1.5.0/docs/api/org/w3c/dom/Node.html#setTextContent...) <http://java.sun.com/j2se/1.5.0/docs/api/org/w3c/dom/Node.html#setTextContent%28java.lang.String%29>
But there are situations where <p> tags are used to enclose non-textual content like images. So we need to pay special attention to those cases.
Why images? Why not the other 20 or so HTML elements that can go in paragraph? (that was my initial question).
I think my previous answer addresses this question also. Even if there are so many HTML elements, we don't have to worry about all of those as long as Element.getTextContent() returns "". WDYT ?
I hope I made myself clear. It's true that it might be too early to think about a moderate filtering function, but I couldn't resist. The thing is, the wiki content produced under strict filtering looks very nice, but not the output of it, It's simply doen't look like the original doc!
Again I'm not sure this code is in the right place.
Well this might be true. I didn't know if html cleaner can be used to strip off empty <p> tags or such. Is that possible ?
You are the one who should tell me! :)
BTW I didn't say HTML cleaner. I said HTML cleaner or XHTML parser.
I'm not working on the office importer so I can't answer but you should be able to answer.
Well, the use of empty <p> tags for spacing is specifc to openoffice server. So yes, I think the code belongs to office-importer.
For example what will happen if the user provides an empty P as input for a wiki page (using XHTML syntax)? Your code will never run in this case.
I don't understand what you mean by this. Are you refferring to a situation where the orginal office document contains html content entered by the user ? Please explain. Because the only input against which this code executes is the html output generated from office documents. Not user entered stuff.
Users can enter content into pages using different syntaxes. One of
them is HTML/XHTML. So as a user, I can enter:
<p></p>
We can't prevent users entering this. So we need to handle it and ensure it produces what it should produce. As you can see in my use case there's no office importer code executed but we still need to handle empty paragraphs.
I agree. Stripping empty <p> elements is not a good solution. I will change the code to make it such that all attributes of empty paragraphs are removed. Thanks a lot for pointing this out. :) - Asiri
participants (2)
-
Asiri Rathnayake -
Vincent Massol