Re: [xwiki-devs] [xwiki-notifications] r16999 - platform/core/trunk/xwiki-officeimporter/src/test/java/org/xwiki/officeimporter/internal/cleaner - xwiki-devs@xwiki.org

List overview All Threads
Download

newer

Re: [xwiki-devs] [xwiki-notifications] r16999 - platform/core/trunk/xwiki-officeimporter/src/test/java/org/xwiki/officeimporter/internal/cleaner

older

[xwiki-devs] ResetPassword page...

[xwiki-devs] Issues with XmlRPC

Vincent Massol

23 Feb 2009 23 Feb '09

1:45 p.m.

Hi Asiri, On Feb 23, 2009, at 1:37 PM, asiri (SVN) wrote:

...

Author: asiri Date: 2009-02-23 13:37:50 +0100 (Mon, 23 Feb 2009) New Revision: 16999 Modified: platform/core/trunk/xwiki-officeimporter/src/test/java/org/xwiki/ officeimporter/internal/cleaner/OpenOfficeHTMLCleanerTest.java Log: XWIKI-3259: Table headers are not handled properly * Added a unit test.

[snip]

...

/** + * Test proper cleaning of {@code <th>} elements. + */ + public void testTableHeaderItemCleaning() + { + // Isolated paragraph elements inside 'th' elements should be removed. + String html = + header + "<table><thead><tr><th><p>Test</p></th></tr></ thead><tbody><tr><td/></tr></tbody></table>" + + footer; + Document doc = cleaner.clean(new StringReader(html)); + NodeList nodes = doc.getElementsByTagName("th"); + Node hearderItemContent = nodes.item(0).getFirstChild(); + assertEquals(Node.TEXT_NODE, hearderItemContent.getNodeType()); + assertEquals("Test", hearderItemContent.getNodeValue());

Why is this only for th and not for td cells too? Is this specific to the office importer? It looks very generic to me, isn't it? Why do paragraphs need to be removed? What if there are 2 paragraphs elements? what happens? Do you have a test for that too? Thanks -Vincent

Show replies by date

Asiri Rathnayake

23 Feb 23 Feb

3:23 p.m.

New subject: [xwiki-devs] [xwiki-notifications] r16999 - platform/core/trunk/xwiki-officeimporter/src/test/java/org/xwiki/officeimporter/internal/cleaner

HI Vincent, Currently officeimporter test methods are bit messed and I'm going to refactor / restructure them asap. Why is this only for th and not for td cells too? It's tested, inside the test method testTableFiltering() which tests for several scenarios. This will be corrected with the new restructuring. Is this specific to the office importer? It looks very generic to me,

...

isn't it? Why do paragraphs need to be removed?

Ok, this is something I'm bit confused about. Consider the following xhtml input: <code> <table><thead><tr><th><p>Test</p></th></tr></thead><tbody><tr><td/></tr></tbody></table> </code> This xhtml fragment when parsed by xhtml parser and rendered into xwiki 2.0 generates the xwiki 2.0 code below: <code> |= Test | </code> which obviously results in a fragmented table when rendered. Now that you asked about it, I might have been working myself around a possible bug in rendering. But these are what I saw as solutions: 1. Wrap the paragraph inside <div class="xwiki-document"> : This results in enlarged table header elements. 2. Remove the paragraph if it's an isolated one (only one paragraph inside the 'th' element) if there are more than one paragraph or other elements (like lists), then wrap the content within the 'th' element inside a <div class="xwiki-document"> I've been using the second approach because it yielded the best results so far... Now, have i been working around a bug which should be fixed in rendering? :) What if there are 2 paragraphs elements? what happens? Do you have a

...

test for that too?

Yes, it's missing a test case for that scenario and several other scenarios, I will add more tests cases when the test classes are refactored (very soon, within this week). Thanks. - Asiri

Vincent Massol

3:33 p.m.

New subject: [xwiki-devs] [xwiki-notifications] r16999 - platform/core/trunk/xwiki-officeimporter/src/test/java/org/xwiki/officeimporter/internal/cleaner

On Feb 23, 2009, at 3:23 PM, Asiri Rathnayake wrote:

...

isn't it? Why do paragraphs need to be removed?

This is a bug in the XHTML parser. It should generate an embedded document. This is true for any block element inside a table cell. However in order to get simpler xwiki syntax we could modify the XWiki Syntax Renderer to remove the embedded doc in case it contains only a paragraph.

...

Now that you asked about it, I might have been working myself around a possible bug in rendering. But these are what I saw as solutions: 1. Wrap the paragraph inside <div class="xwiki-document"> : This results in enlarged table header elements.

why?

...

2. Remove the paragraph if it's an isolated one (only one paragraph inside the 'th' element) if there are more than one paragraph or other elements (like lists), then wrap the content within the 'th' element inside a <div class="xwiki-document"> I've been using the second approach because it yielded the best results so far... Now, have i been working around a bug which should be fixed in rendering? :)

I think so. In addition you haven't fixed the problem in the general case. For example if someone chooses HTML 4.01 syntax in wiki pages. Even if the problem was not in the parser/renderer you should still have moved it in the default HTML cleaner and not in the office cleaner IMO since I don't see the relationship with office import. Can you open a jira issue? Thanks -Vincent

...

What if there are 2 paragraphs elements? what happens? Do you have a

test for that too?

Yes, it's missing a test case for that scenario and several other scenarios, I will add more tests cases when the test classes are refactored (very soon, within this week). Thanks. - Asiri

Asiri Rathnayake

24 Feb 24 Feb

7:26 a.m.

New subject: [xwiki-devs] [xwiki-notifications] r16999 - platform/core/trunk/xwiki-officeimporter/src/test/java/org/xwiki/officeimporter/internal/cleaner

Hi Vincent, This is a bug in the XHTML parser. It should generate an embedded

...

document. This is true for any block element inside a table cell. However in order to get simpler xwiki syntax we could modify the XWiki Syntax Renderer to remove the embedded doc in case it contains only a paragraph.

I will raise a JIRA issue for this.

...

why?

I'm talking with respect to the original word document. This is a problem with OO server's html generation because it generates a paragraph inside each table cell / table header item, the generated html kind of looks enlarged when rendered on a browser. Also, since we strip those <style> tags, the content gets even more enlarged. To work around this problem I chose to strip any isolated paragraph elements found inside table cells / table header items.

...

2. Remove the paragraph if it's an isolated one (only one paragraph

inside the 'th' element) if there are more than one paragraph or other elements (like lists), then wrap the content within the 'th' element inside a <div class="xwiki-document"> I've been using the second approach because it yielded the best results so far... Now, have i been working around a bug which should be fixed in rendering? :)

I think so. In addition you haven't fixed the problem in the general case. For example if someone chooses HTML 4.01 syntax in wiki pages.

...

Even if the problem was not in the parser/renderer you should still have moved it in the default HTML cleaner and not in the office cleaner IMO since I don't see the relationship with office import.

I don't think this is correct. If the user chooses HTML 4.01 syntax, he knows what is doing and he expects table cells / table header items to appear large if he puts a <p> inside a <td> item or <th> item. But the story is different for OO generated html which puts a paragraph element when there shouldn't be one. That is why i beleived that this particular issue belongs to officeimporter module and not html cleaner module. Thanks. - Asiri

Vincent Massol

9:57 a.m.

New subject: [xwiki-devs] [xwiki-notifications] r16999 - platform/core/trunk/xwiki-officeimporter/src/test/java/org/xwiki/officeimporter/internal/cleaner

Hi Asiri, On Feb 24, 2009, at 7:26 AM, Asiri Rathnayake wrote:

...

Hi Vincent, This is a bug in the XHTML parser. It should generate an embedded

I will raise a JIRA issue for this.

why?

I was asking why having <div class="xwiki-document"> didn't work nicely since this is the correct behavior. We should get: <td><div class="xwiki-document"><p>whatever</p></div></td> I don't understand why this would not be represented the same as in OO.

...

To work around this problem I chose to strip any isolated paragraph elements found inside table cells / table header items.

2. Remove the paragraph if it's an isolated one (only one paragraph

I think so. In addition you haven't fixed the problem in the general case. For example if someone chooses HTML 4.01 syntax in wiki pages.

Even if the problem was not in the parser/renderer you should still have moved it in the default HTML cleaner and not in the office cleaner IMO since I don't see the relationship with office import.

This is not about large or not large (l&f is handled by the CSS only) and we need to normalize the HMTL in exactly the same manner.

...

But the story is different for OO generated html which puts a paragraph element when there shouldn't be one.

I don't agree since it's very valid to have <p> inside cells and not a OO problem.

...

That is why i beleived that this particular issue belongs to officeimporter module and not html cleaner module.

I still think the HTML parser should generate the following events: beginCell, beginDocument, beginPara, onWord, endPara, endDocument, endCell. I also still think that, as an optimization, the Wiki Syntax Renderer should removed the embedded doc in case there's a single para in the embedded doc. Thanks -Vincent http://xwiki.com http://xwiki.org http://massol.net

Asiri Rathnayake

10:20 a.m.

New subject: [xwiki-devs] [xwiki-notifications] r16999 - platform/core/trunk/xwiki-officeimporter/src/test/java/org/xwiki/officeimporter/internal/cleaner

Hi Vincent,

...

But the story

is different for OO generated html which puts a paragraph element when there shouldn't be one.

I don't agree since it's very valid to have <p> inside cells and not a OO problem.

It's very valid to have <p> elements inside table cells. But my point is this: The original word document when viewed through _oo writer_ displays content within table cells with a particular size. But when saved as html and viewed from a browser, the same table cell becomes enlarged. And this is because there is a paragraph element inside each table cell element generated by oo html generator. Now, since we wanted officeimporter to generate wiki content that would ultimately render an output which looks close to the original document, i decided to strip the paragraph element (to make it look smaller and close to the sizing of original document rendered in oo writer) But if it's only a matter of convension (wiki is wiki, office is office) and the paragraph should be left alone I can make that chage easily. WDYT? Sorry about any confusions :) Thanks. - Asiri

Asiri Rathnayake

10:59 a.m.

New subject: [xwiki-devs] [xwiki-notifications] r16999 - platform/core/trunk/xwiki-officeimporter/src/test/java/org/xwiki/officeimporter/internal/cleaner

Hi Again, On Tue, Feb 24, 2009 at 2:50 PM, Asiri Rathnayake < asiri.rathnayake(a)gmail.com> wrote:

...

Hi Vincent,

But the story

is different for OO generated html which puts a paragraph element when there shouldn't be one.

I don't agree since it's very valid to have <p> inside cells and not a OO problem.

Ok, I think I have mistaken something somewhere. I just tested creating few tables and saving them as html and even though there is a <p> element inside tables cells, they all render correctly on the browser. I don't know how it appeared to me that <p> inside table cells render differently. I need to investigate a bit further. Sorry for all the troubles :( Thanks. - Asiri

...

Sorry about any confusions :) Thanks. - Asiri

Sergiu Dumitriu

4:48 p.m.

New subject: [xwiki-devs] [xwiki-notifications] r16999 - platform/core/trunk/xwiki-officeimporter/src/test/java/org/xwiki/officeimporter/internal/cleaner

Asiri Rathnayake wrote:

...

Hi Vincent,

But the story

is different for OO generated html which puts a paragraph element when there shouldn't be one.

I don't agree since it's very valid to have <p> inside cells and not a OO problem.

I for one prefer removing the paragraph. For me, this is clearly an OO shortcoming. Vincent, the idea is not about paragraphs inside table cells in general, but about this particular paragraph that obviously shouldn't be there. The HTML generated by OO is just an intermediary, we're not interested in keeping it as much as possible in the wiki, we just want to extract the data from it and convert it to wiki syntax. The Office importer transforms office documents to wiki documents, and not HTML to wiki. OO wrongly puts paragraphs in there, and the fact that the same HTML looks much different in a browser than the document looks in OO is a good enough argument, IMO. -- Sergiu Dumitriu http://purl.org/net/sergiu/

Vincent Massol

4:59 p.m.

New subject: [xwiki-devs] [xwiki-notifications] r16999 - platform/core/trunk/xwiki-officeimporter/src/test/java/org/xwiki/officeimporter/internal/cleaner

On Feb 24, 2009, at 4:48 PM, Sergiu Dumitriu wrote:

...

Asiri Rathnayake wrote:

Hi Vincent,

But the story

is different for OO generated html which puts a paragraph element when there shouldn't be one.

I don't agree since it's very valid to have <p> inside cells and not a OO problem.

This is generic and not specific to OO. HTML allows puttings one or several paragraphs in table cells, list item,etc so we need to handle those, independently of OO. If we handle it at the rendering module level then it fixes both OO and direct HTML input. Thanks -Vincent http://xwiki.com http://xwiki.org http://massol.net

Sergiu Dumitriu

7:24 p.m.

New subject: [xwiki-devs] [xwiki-notifications] r16999 - platform/core/trunk/xwiki-officeimporter/src/test/java/org/xwiki/officeimporter/internal/cleaner

Vincent Massol wrote:

...

On Feb 24, 2009, at 4:48 PM, Sergiu Dumitriu wrote:

Asiri Rathnayake wrote:

Hi Vincent,

But the story > is different for OO generated html which puts a paragraph element > when there > shouldn't be one. I don't agree since it's very valid to have <p> inside cells and not a OO problem.

No. We should not strip all the paragraphs that are found inside table cells. Maybe the user wants those there. But we know for sure that the _intermediary_ HTML generated by OO contains Ps where it shouldn't. It is specific. In general we should respect the markup, but in this specific case it is just a workaround for a third party bug. HTMLs generated by office suites is messy in general. I for one really hate the bulky sh1t that MS Word names HTML. -- Sergiu Dumitriu http://purl.org/net/sergiu/

Vincent Massol

7:51 p.m.

New subject: [xwiki-devs] [xwiki-notifications] r16999 - platform/core/trunk/xwiki-officeimporter/src/test/java/org/xwiki/officeimporter/internal/cleaner

On Feb 24, 2009, at 7:24 PM, Sergiu Dumitriu wrote:

...

Vincent Massol wrote:

On Feb 24, 2009, at 4:48 PM, Sergiu Dumitriu wrote:

Asiri Rathnayake wrote:

Hi Vincent, > But the story >> is different for OO generated html which puts a paragraph element >> when there >> shouldn't be one. > I don't agree since it's very valid to have <p> inside cells and > not a > OO problem. It's very valid to have <p> elements inside table cells. But my point is this: The original word document when viewed through _oo writer_ displays content within table cells with a particular size. But when saved as html and viewed from a browser, the same table cell becomes enlarged. And this is because there is a paragraph element inside each table cell element generated by oo html generator. Now, since we wanted officeimporter to generate wiki content that would ultimately render an output which looks close to the original document, i decided to strip the paragraph element (to make it look smaller and close to the sizing of original document rendered in oo writer) But if it's only a matter of convension (wiki is wiki, office is office) and the paragraph should be left alone I can make that chage easily. WDYT?

No. We should not strip all the paragraphs that are found inside table cells.

I've never said this! What I told Asiri is that the XHTML parser should generate the following events: beginCell + beginDocument + beginPara + onWord(sometext) + endPara + endDocument + endCell.

...

Maybe the user wants those there.

I don't agree. We're making transformation and we're not leaving the user content untouched. For example if the user enters "**hello" it'll get converted to "**hello**". There are several cases where we're transforming what the user enters. Here I'm proposing that the XWiki Syntax Renderer transforms the events above into: | sometext instead of: | (((sometext)))

...

But we know for sure that the _intermediary_ HTML generated by OO contains Ps where it shouldn't. It is specific. In general we should respect the markup, but in this specific case it is just a workaround for a third party bug. HTMLs generated by office suites is messy in general. I for one really hate the bulky sh1t that MS Word names HTML.

I still don't agree. See above. Thanks -Vincent

6023

days inactive

6024

days old

xwiki-devs@xwiki.org

Manage subscription

10 comments

3 participants

tags (0)

participants (3)

Asiri Rathnayake
Sergiu Dumitriu
Vincent Massol