Hi Asiri,
Are you sure this should go in the office importer and not in the HTML
cleaner or in the XHMTL parser?
Thanks
-Vincent
On Nov 24, 2008, at 6:08 PM, asiri (SVN) wrote:
Author: asiri
Date: 2008-11-24 18:08:51 +0100 (Mon, 24 Nov 2008)
New Revision: 14425
Modified:
sandbox/xwiki-plugin-officeimporter/src/main/java/com/xpn/xwiki/
plugin/officeimporter/filter/RedundantTagFilter.java
Log:
XAOFFICE-1 : Develop the initial feature set for office-importer
plugin.
* Added support for filtering empty / redundant paragraphs.
Modified: sandbox/xwiki-plugin-officeimporter/src/main/java/com/xpn/
xwiki/plugin/officeimporter/filter/RedundantTagFilter.java
===================================================================
--- sandbox/xwiki-plugin-officeimporter/src/main/java/com/xpn/xwiki/
plugin/officeimporter/filter/RedundantTagFilter.java 2008-11-24
15:17:17 UTC (rev 14424)
+++ sandbox/xwiki-plugin-officeimporter/src/main/java/com/xpn/xwiki/
plugin/officeimporter/filter/RedundantTagFilter.java 2008-11-24
17:08:51 UTC (rev 14425)
@@ -31,12 +31,13 @@
public void filter(Document document, ImporterContext context)
{
- for(String key : attributeWiseFilteredTags) {
+ for (String key : attributeWiseFilteredTags) {
filterNodesWithZeroAttributes(document.getElementsByTagName(key));
}
- for(String key : contentWiseFilteredTags) {
+ for (String key : contentWiseFilteredTags) {
filterNodesWithEmptyTextContent(document.getElementsByTagName(key));
- }
+ }
+ filterEmptyParagraphs(document);
}
/**
@@ -70,10 +71,35 @@
{
for (int i = 0; i < elements.getLength(); i++) {
Element element = (Element) elements.item(i);
- if (element.getTextContent().trim().equals("")) {
+ if (element.getTextContent().trim().equals("")) {
element.getParentNode().removeChild(element);
i--;
}
}
}
+
+ /**
+ * OpenOffice server generates redundant paragraphs (with empty
content) to achieve spacing.
+ * These paragraphs should be stripped off / replaced with
{@code <br/>} elements appropriately
+ * because otherwise they result in spurious {@code (%%)}
elements in generated xwiki content.
+ *
+ * @param document The html document.
+ */
+ private void filterEmptyParagraphs(Document document)
+ {
+ NodeList paragraphs = document.getElementsByTagName("p");
+ for (int i = 0; i < paragraphs.getLength(); i++) {
+ Element paragraph = (Element) paragraphs.item(i);
+ if (paragraph.getTextContent().trim().equals("")) {
+ // We suspect this is an empty paragraph but it is
possible that it contains other
+ // non-textual tags like images. For the moment
we'll only search for internal image
+ // tags, we might have to refine this criterion
later.
+ NodeList internalImages =
paragraph.getElementsByTagName("img");
+ if (internalImages.getLength() == 0) {
+ paragraph.getParentNode().removeChild(paragraph);
+ i--;
+ }
+ }
+ }
+ }
}