[xwiki-devs] [OFFICEIMPORTER] Style Filtering Policy & Mechanism.

Asiri Rathnayake

19 Nov 2008 19 Nov '08

10:56 a.m.

Hi Devs, I'm working on implementing the style filtering functionality of xwiki-office-importer application. But first, i need to make sure that I'm clear about the policy and the correct approach towards filtering style information from imported office documents. I would really appreciate your input on this because I'm not an expert on either html or css :) Ok, I plan to do two types of filtering. One is filtering various attributes of various elements (like removing bgcolor attribute from the <body> element). And the second one is filtering css related stuff. Let's take one by one. 1. Filtering attributes. This is quite straight-forward but i see two possible approaches. * The first approach is to keep a list of attributes that we allow when importing documents. We'll scan each and every tag and strip off any unwanted attributes present. * The second approach is to associate each tag with what attributes we allow for that tag. A list of legal attributes for common tags is presented here http://www.devx.com/projectcool/Article/19816. Similarly, we'll have our tag_name->allowed_attributes mapping and filter all other attributes present. I'm currently leaning towards the second option, WDYT ? 2. Filtering css styles. Ok, there are three ways one can associate css with html content. Let's take one by one. (i) External Style Sheet Well, AFAIK OpenOffice server does not produce this type of output when converting office documents into html. I mean it doesn't output html files that refer external css files. So I guess this is something we don't need to worry about. (ii) Internal Style Sheet This is something we need to worry about. OpenOffice server produces html pages with content like <head><style type="text/css">....</style></head>. Currently we strip off <style> tags completely regardless of the filtering mode (i.e whether styles are set to be filtered or not <style> tags get removed). Does this behaviour need to change ? (iii) In-line Styles This is the most common type of styling found (Example : <p style="....">). Present behaviour is to strip off this style attribute completely (if filterStyles is set to true). The question is, there are some inline styles that directly maps to xwiki 2.0 syntax like <p style="font-weight:bold">, what are we going to do about these ? In any case, I will have to parse the in-line style attribute string to filter those style directives that are not necessary. The complete grammar for in-line style attributes seems to be a bit complicated to be hand crafted (http://www.w3.org/TR/css-style-attr) although in OpenOffice converted documents i have only seen the "key:value;key:value" format. What should be the correct approach to parse the style attribute string ? Thank you very much for your ideas. :) [image: Asiri Rathnayake's Facebook profile]<http://www.facebook.com/people/Asiri_Rathnayake/534607921>

Show replies by date

Guillaume Lerouge

19 Nov 19 Nov

11:20 a.m.

New subject: [xwiki-devs] [OFFICEIMPORTER] Style Filtering Policy & Mechanism.

Hi Asiri, On Wed, Nov 19, 2008 at 10:56 AM, Asiri Rathnayake < [email protected]> wrote:

...

Hi Devs,

I'm working on implementing the style filtering functionality of xwiki-office-importer application. But first, i need to make sure that I'm clear about the policy and the correct approach towards filtering style information from imported office documents. I would really appreciate your input on this because I'm not an expert on either html or css :)

Ok, I plan to do two types of filtering. One is filtering various attributes of various elements (like removing bgcolor attribute from the <body> element). And the second one is filtering css related stuff. Let's take one by one.

1. Filtering attributes.

This is quite straight-forward but i see two possible approaches.

* The first approach is to keep a list of attributes that we allow when importing documents. We'll scan each and every tag and strip off any unwanted attributes present.

* The second approach is to associate each tag with what attributes we allow for that tag. A list of legal attributes for common tags is presented here http://www.devx.com/projectcool/Article/19816. Similarly, we'll have our tag_name->allowed_attributes mapping and filter all other attributes present.

I'm currently leaning towards the second option, WDYT ?

2. Filtering css styles.

Ok, there are three ways one can associate css with html content. Let's take one by one.

(i) External Style Sheet

Well, AFAIK OpenOffice server does not produce this type of output when converting office documents into html. I mean it doesn't output html files that refer external css files. So I guess this is something we don't need to worry about.

(ii) Internal Style Sheet

This is something we need to worry about. OpenOffice server produces html pages with content like <head><style type="text/css">....</style></head>.

Currently we strip off <style> tags completely regardless of the filtering mode (i.e whether styles are set to be filtered or not <style> tags get removed). Does this behaviour need to change ?

(iii) In-line Styles

This is the most common type of styling found (Example : <p style="....">). Present behaviour is to strip off this style attribute completely (if filterStyles is set to true). The question is, there are some inline styles that directly maps to xwiki 2.0 syntax like <p style="font-weight:bold">, what are we going to do about these ?

I can't help you much from the technical perspective. Re styles that can be directly mapped to XWiki 2.0 syntax, I think they should be converted to use that syntax. To summarize my opinion: - When strict filtering is activated (conversion to XWiki 2.0 syntax) - Only style attributes that can be directly mapped to wiki syntax element should be kept - This means that NO (% ... %) should appear Is that fine with everyone?

...

In any case, I will have to parse the in-line style attribute string to filter those style directives that are not necessary. The complete grammar for in-line style attributes seems to be a bit complicated to be hand crafted (http://www.w3.org/TR/css-style-attr) although in OpenOffice converted documents i have only seen the "key:value;key:value" format. What should be the correct approach to parse the style attribute string ?

Thank you very much for your ideas. :)

[image: Asiri Rathnayake's Facebook profile]<http://www.facebook.com/people/Asiri_Rathnayake/534607921> _______________________________________________ devs mailing list [email protected] http://lists.xwiki.org/mailman/listinfo/devs

-- Guillaume Lerouge Product Manager - XWiki Skype ID : wikibc http://blog.xwiki.com/

Asiri Rathnayake

20 Nov 20 Nov

3:47 p.m.

New subject: [xwiki-devs] [OFFICEIMPORTER] Style Filtering Policy & Mechanism.

Hi Guillaume, I can't help you much from the technical perspective. Re styles that can be

...

directly mapped to XWiki 2.0 syntax, I think they should be converted to use that syntax. To summarize my opinion:

- When strict filtering is activated (conversion to XWiki 2.0 syntax) - Only style attributes that can be directly mapped to wiki syntax element should be kept - This means that NO (% ... %) should appear

Is that fine with everyone?

Just came into my mind, what about alignments ? (of texts, images and the like ?) Do you think they should also be ripped off when strict-filtering is on ? Thanks. - Asiri

Asiri Rathnayake

4 p.m.

New subject: [xwiki-devs] [OFFICEIMPORTER] Style Filtering Policy & Mechanism.

On Thu, Nov 20, 2008 at 8:17 PM, Asiri Rathnayake < [email protected]> wrote:

...

Hi Guillaume,

I can't help you much from the technical perspective. Re styles that can be

...
directly mapped to XWiki 2.0 syntax, I think they should be converted to use that syntax. To summarize my opinion:

- When strict filtering is activated (conversion to XWiki 2.0 syntax) - Only style attributes that can be directly mapped to wiki syntax element should be kept - This means that NO (% ... %) should appear

Is that fine with everyone?

Just came into my mind, what about alignments ? (of texts, images and the like ?) Do you think they should also be ripped off when strict-filtering is on ?

+ same question for image height & width... I'm asking these questions because I think that sort of information should be preserved, otherwise the document will look really strange after import. WDYT ? - Asiri

...

Thanks.

- Asiri

Asiri Rathnayake

4:11 p.m.

New subject: [xwiki-devs] [OFFICEIMPORTER] Style Filtering Policy & Mechanism.

On Thu, Nov 20, 2008 at 8:30 PM, Asiri Rathnayake < [email protected]> wrote:

...

On Thu, Nov 20, 2008 at 8:17 PM, Asiri Rathnayake < [email protected]> wrote:

...
Hi Guillaume,

I can't help you much from the technical perspective. Re styles that can

...
be directly mapped to XWiki 2.0 syntax, I think they should be converted to use that syntax. To summarize my opinion:

- When strict filtering is activated (conversion to XWiki 2.0 syntax) - Only style attributes that can be directly mapped to wiki syntax element should be kept - This means that NO (% ... %) should appear

Is that fine with everyone?

Just came into my mind, what about alignments ? (of texts, images and the like ?) Do you think they should also be ripped off when strict-filtering is on ?

+ same question for image height & width... I'm asking these questions because I think that sort of information should be preserved, otherwise the document will look really strange after import.

Again, may be we should introduce another level of style filtering. So we have 3 levels of filtering. 1. Strict filtering (Filter everything, a.k.a not a single (%%)) 2. Moderate (Filter styles as much as possible but try to preserve those formatting elements which makes the document look appealing, like alignment, image height & width etc.) 3. Filter nothing. This is just an idea. - Asiri

Vincent Massol

7:07 p.m.

New subject: [xwiki-devs] [OFFICEIMPORTER] Style Filtering Policy & Mechanism.

On Nov 20, 2008, at 4:11 PM, Asiri Rathnayake wrote:

...

On Thu, Nov 20, 2008 at 8:30 PM, Asiri Rathnayake < [email protected]> wrote:

...
On Thu, Nov 20, 2008 at 8:17 PM, Asiri Rathnayake < [email protected]> wrote:

...
Hi Guillaume,

I can't help you much from the technical perspective. Re styles that can

...
be directly mapped to XWiki 2.0 syntax, I think they should be converted to use that syntax. To summarize my opinion:

- When strict filtering is activated (conversion to XWiki 2.0 syntax) - Only style attributes that can be directly mapped to wiki syntax element should be kept - This means that NO (% ... %) should appear

Is that fine with everyone?

Just came into my mind, what about alignments ? (of texts, images and the like ?) Do you think they should also be ripped off when strict- filtering is on ?

+ same question for image height & width... I'm asking these questions because I think that sort of information should be preserved, otherwise the document will look really strange after import.

Again, may be we should introduce another level of style filtering. So we have 3 levels of filtering.

You should have a design that allows to plug any number of filtering strategies (FilterStrategy interface) and implement 2 for now: StrictFilterStrategy NoopFilterStrategy (or some other name) I think we can implement the moderate one later on after we start using the tool and get user feedback. At least I think it would be good to have those 2 fully done first. Thanks -Vincent

...

1. Strict filtering (Filter everything, a.k.a not a single (%%)) 2. Moderate (Filter styles as much as possible but try to preserve those formatting elements which makes the document look appealing, like alignment, image height & width etc.) 3. Filter nothing.

This is just an idea.

- Asiri

Guillaume Lerouge

21 Nov 21 Nov

9:54 a.m.

New subject: [xwiki-devs] [OFFICEIMPORTER] Style Filtering Policy & Mechanism.

On Thu, Nov 20, 2008 at 7:07 PM, Vincent Massol <[email protected]> wrote:

...

On Nov 20, 2008, at 4:11 PM, Asiri Rathnayake wrote:

...
On Thu, Nov 20, 2008 at 8:30 PM, Asiri Rathnayake < [email protected]> wrote:

...
On Thu, Nov 20, 2008 at 8:17 PM, Asiri Rathnayake < [email protected]> wrote:

...
Hi Guillaume,

I can't help you much from the technical perspective. Re styles that can

...
be directly mapped to XWiki 2.0 syntax, I think they should be converted to use that syntax. To summarize my opinion:

- When strict filtering is activated (conversion to XWiki 2.0 syntax) - Only style attributes that can be directly mapped to wiki syntax element should be kept - This means that NO (% ... %) should appear

Is that fine with everyone?

Just came into my mind, what about alignments ? (of texts, images and the like ?) Do you think they should also be ripped off when strict- filtering is on ?

+ same question for image height & width... I'm asking these questions because I think that sort of information should be preserved, otherwise the document will look really strange after import.

Again, may be we should introduce another level of style filtering. So we have 3 levels of filtering.

You should have a design that allows to plug any number of filtering strategies (FilterStrategy interface) and implement 2 for now:

StrictFilterStrategy NoopFilterStrategy (or some other name)

I think we can implement the moderate one later on after we start using the tool and get user feedback.

At least I think it would be good to have those 2 fully done first.

Jusft for the record, I agree with Vincent on this, I think we should have the strict filtering (wiki syntax only) working very well before worrying about moderate filtering strategies. Guillaume

...

Thanks -Vincent

...
1. Strict filtering (Filter everything, a.k.a not a single (%%)) 2. Moderate (Filter styles as much as possible but try to preserve those formatting elements which makes the document look appealing, like alignment, image height & width etc.) 3. Filter nothing.

This is just an idea.

- Asiri

devs mailing list [email protected] http://lists.xwiki.org/mailman/listinfo/devs

-- Guillaume Lerouge Product Manager - XWiki Skype ID : wikibc http://blog.xwiki.com/

Vincent Massol

20 Nov 20 Nov

7:08 p.m.

New subject: [xwiki-devs] [OFFICEIMPORTER] Style Filtering Policy & Mechanism.

On Nov 20, 2008, at 4:00 PM, Asiri Rathnayake wrote:

...

On Thu, Nov 20, 2008 at 8:17 PM, Asiri Rathnayake < [email protected]> wrote:

...
Hi Guillaume,

I can't help you much from the technical perspective. Re styles that can be

...
directly mapped to XWiki 2.0 syntax, I think they should be converted to use that syntax. To summarize my opinion:

- When strict filtering is activated (conversion to XWiki 2.0 syntax) - Only style attributes that can be directly mapped to wiki syntax element should be kept - This means that NO (% ... %) should appear

Is that fine with everyone?

Just came into my mind, what about alignments ? (of texts, images and the like ?) Do you think they should also be ripped off when strict- filtering is on ?

+ same question for image height & width... I'm asking these questions because I think that sort of information should be preserved, otherwise the document will look really strange after import.

This is a bit tricky... The real answer (which is probably hard to implement is): * Remove height and width for images if they match the original image size * Otherwise keep them Thanks -Vincent

Asiri Rathnayake

21 Nov 21 Nov

4:46 a.m.

New subject: [xwiki-devs] [OFFICEIMPORTER] Style Filtering Policy & Mechanism.

hi Vincent,

...

+ same question for image height & width... I'm asking these questions

...
because I think that sort of information should be preserved, otherwise the document will look really strange after import.

This is a bit tricky... The real answer (which is probably hard to implement is):

* Remove height and width for images if they match the original image size * Otherwise keep them

Well, I don't think there is an easy way to figure out the actual dimensions of an image without using some image manipulation code. At least we'll have to load the image into some kind of structure. But this can be costly in terms of performance specially when there are lost of images embedded in the document. Thanks. - Asiri

...

Thanks -Vincent _______________________________________________ devs mailing list [email protected] http://lists.xwiki.org/mailman/listinfo/devs

Vincent Massol

9:02 a.m.

New subject: [xwiki-devs] [OFFICEIMPORTER] Style Filtering Policy & Mechanism.

On Nov 21, 2008, at 4:46 AM, Asiri Rathnayake wrote:

...

hi Vincent,

...
+ same question for image height & width... I'm asking these questions

...
because I think that sort of information should be preserved, otherwise the document will look really strange after import.

This is a bit tricky... The real answer (which is probably hard to implement is):

* Remove height and width for images if they match the original image size * Otherwise keep them

Well, I don't think there is an easy way to figure out the actual dimensions of an image without using some image manipulation code. At least we'll have to load the image into some kind of structure. But this can be costly in terms of performance specially when there are lost of images embedded in the document.

Yes which is why I said it's tricky. Does the OO-generated HTML always specify width and height even when the image isn't resized? Thanks -Vincent

Asiri Rathnayake

11:21 a.m.

New subject: [xwiki-devs] [OFFICEIMPORTER] Style Filtering Policy & Mechanism.

Hi Vincent, On Fri, Nov 21, 2008 at 1:32 PM, Vincent Massol <[email protected]> wrote:

...

On Nov 21, 2008, at 4:46 AM, Asiri Rathnayake wrote:

...
hi Vincent,

...
+ same question for image height & width... I'm asking these questions

...
because I think that sort of information should be preserved, otherwise the document will look really strange after import.

This is a bit tricky... The real answer (which is probably hard to implement is):

* Remove height and width for images if they match the original image size * Otherwise keep them

Well, I don't think there is an easy way to figure out the actual dimensions of an image without using some image manipulation code. At least we'll have to load the image into some kind of structure. But this can be costly in terms of performance specially when there are lost of images embedded in the document.

Yes which is why I said it's tricky.

Does the OO-generated HTML always specify width and height even when the image isn't resized?

Yes, it includes height & width attributes always. Thanks. - Asiri

...

Thanks -Vincent

_______________________________________________ devs mailing list [email protected] http://lists.xwiki.org/mailman/listinfo/devs

Vincent Massol

20 Nov 20 Nov

7:02 p.m.

New subject: [xwiki-devs] [OFFICEIMPORTER] Style Filtering Policy & Mechanism.

On Nov 20, 2008, at 3:47 PM, Asiri Rathnayake wrote:

...

Hi Guillaume,

I can't help you much from the technical perspective. Re styles that can be

...
directly mapped to XWiki 2.0 syntax, I think they should be converted to use that syntax. To summarize my opinion:

- When strict filtering is activated (conversion to XWiki 2.0 syntax) - Only style attributes that can be directly mapped to wiki syntax element should be kept - This means that NO (% ... %) should appear

Is that fine with everyone?

Just came into my mind, what about alignments ? (of texts, images and the like ?) Do you think they should also be ripped off when strict- filtering is on ?

definitely. Thanks -Vincent

Vincent Massol

7:03 p.m.

New subject: [xwiki-devs] [OFFICEIMPORTER] Style Filtering Policy & Mechanism.

On Nov 19, 2008, at 11:20 AM, Guillaume Lerouge wrote:

...

Hi Asiri,

On Wed, Nov 19, 2008 at 10:56 AM, Asiri Rathnayake < [email protected]> wrote:

...
Hi Devs,

I'm working on implementing the style filtering functionality of xwiki-office-importer application. But first, i need to make sure that I'm clear about the policy and the correct approach towards filtering style information from imported office documents. I would really appreciate your input on this because I'm not an expert on either html or css :)

Ok, I plan to do two types of filtering. One is filtering various attributes of various elements (like removing bgcolor attribute from the <body> element). And the second one is filtering css related stuff. Let's take one by one.

1. Filtering attributes.

This is quite straight-forward but i see two possible approaches.

* The first approach is to keep a list of attributes that we allow when importing documents. We'll scan each and every tag and strip off any unwanted attributes present.

* The second approach is to associate each tag with what attributes we allow for that tag. A list of legal attributes for common tags is presented here http://www.devx.com/projectcool/Article/19816. Similarly, we'll have our tag_name->allowed_attributes mapping and filter all other attributes present.

I'm currently leaning towards the second option, WDYT ?

2. Filtering css styles.

Ok, there are three ways one can associate css with html content. Let's take one by one.

(i) External Style Sheet

Well, AFAIK OpenOffice server does not produce this type of output when converting office documents into html. I mean it doesn't output html files that refer external css files. So I guess this is something we don't need to worry about.

(ii) Internal Style Sheet

This is something we need to worry about. OpenOffice server produces html pages with content like <head><style type="text/css">....</style></ head>.

Currently we strip off <style> tags completely regardless of the filtering mode (i.e whether styles are set to be filtered or not <style> tags get removed). Does this behaviour need to change ?

(iii) In-line Styles

This is the most common type of styling found (Example : <p style="....">). Present behaviour is to strip off this style attribute completely (if filterStyles is set to true). The question is, there are some inline styles that directly maps to xwiki 2.0 syntax like <p style="font- weight:bold">, what are we going to do about these ?

I can't help you much from the technical perspective. Re styles that can be directly mapped to XWiki 2.0 syntax, I think they should be converted to use that syntax. To summarize my opinion:

- When strict filtering is activated (conversion to XWiki 2.0 syntax) - Only style attributes that can be directly mapped to wiki syntax element should be kept - This means that NO (% ... %) should appear

Is that fine with everyone?

Yes fine with me for strict filtering Thanks -Vincent

...

...
In any case, I will have to parse the in-line style attribute string to filter those style directives that are not necessary. The complete grammar for in-line style attributes seems to be a bit complicated to be hand crafted (http://www.w3.org/TR/css-style-attr) although in OpenOffice converted documents i have only seen the "key:value;key:value" format. What should be the correct approach to parse the style attribute string ?

Thank you very much for your ideas. :)

[image: Asiri Rathnayake's Facebook profile]<http://www.facebook.com/people/Asiri_Rathnayake/534607921>

Sergiu Dumitriu

19 Nov 19 Nov

1:27 p.m.

New subject: [xwiki-devs] [OFFICEIMPORTER] Style Filtering Policy & Mechanism.

Asiri Rathnayake wrote:

...

Hi Devs,

I'm working on implementing the style filtering functionality of xwiki-office-importer application. But first, i need to make sure that I'm clear about the policy and the correct approach towards filtering style information from imported office documents. I would really appreciate your input on this because I'm not an expert on either html or css :)

Ok, I plan to do two types of filtering. One is filtering various attributes of various elements (like removing bgcolor attribute from the <body> element). And the second one is filtering css related stuff. Let's take one by one.

1. Filtering attributes.

This is quite straight-forward but i see two possible approaches.

* The first approach is to keep a list of attributes that we allow when importing documents. We'll scan each and every tag and strip off any unwanted attributes present.

* The second approach is to associate each tag with what attributes we allow for that tag. A list of legal attributes for common tags is presented here http://www.devx.com/projectcool/Article/19816. Similarly, we'll have our tag_name->allowed_attributes mapping and filter all other attributes present.

I'm currently leaning towards the second option, WDYT ?

Second sounds better from a functional perspective. However, we must be sure we define this list in a clean way, and take care about performance. We definitely don't want inline array definitions in the java source files.

...

2. Filtering css styles.

Ok, there are three ways one can associate css with html content. Let's take one by one.

(i) External Style Sheet

Well, AFAIK OpenOffice server does not produce this type of output when converting office documents into html. I mean it doesn't output html files that refer external css files. So I guess this is something we don't need to worry about.

(ii) Internal Style Sheet

This is something we need to worry about. OpenOffice server produces html pages with content like <head><style type="text/css">....</style></head>.

Currently we strip off <style> tags completely regardless of the filtering mode (i.e whether styles are set to be filtered or not <style> tags get removed). Does this behaviour need to change ?

(iii) In-line Styles

This is the most common type of styling found (Example : <p style="....">). Present behaviour is to strip off this style attribute completely (if filterStyles is set to true). The question is, there are some inline styles that directly maps to xwiki 2.0 syntax like <p style="font-weight:bold">, what are we going to do about these ?

In any case, I will have to parse the in-line style attribute string to filter those style directives that are not necessary. The complete grammar for in-line style attributes seems to be a bit complicated to be hand crafted (http://www.w3.org/TR/css-style-attr) although in OpenOffice converted documents i have only seen the "key:value;key:value" format. What should be the correct approach to parse the style attribute string ?

Thank you very much for your ideas. :)

All cases can be covered by using a CSS library. We already use css4j in the export (pdf) implementation, and although it has several limitations, this library is a good starting point. Vincent, could you put the latest release (0.10) in our externals? There have been lots of changes since 0.4, which would also help our PDF export. -- Sergiu Dumitriu http://purl.org/net/sergiu/

Asiri Rathnayake

4:06 p.m.

New subject: [xwiki-devs] [OFFICEIMPORTER] Style Filtering Policy & Mechanism.

Hi Sergiu,

...

In any case, I will have to parse the in-line style attribute string to

...
filter those style directives that are not necessary. The complete grammar for in-line style attributes seems to be a bit complicated to be hand crafted (http://www.w3.org/TR/css-style-attr) although in OpenOffice converted documents i have only seen the "key:value;key:value" format. What should be the correct approach to parse the style attribute string ?

Thank you very much for your ideas. :)

All cases can be covered by using a CSS library. We already use css4j in the export (pdf) implementation, and although it has several limitations, this library is a good starting point. Vincent, could you put the latest release (0.10) in our externals? There have been lots of changes since 0.4, which would also help our PDF export.

Thank you for this input. I think css4j is a good option. One more question, will it be a good idea to maintain a list of allowed css properties for each element we are interested in ? So that we can filter out everything else ? By list I meant a comma separated string of allowed properties like "font-family,font-size,font-weight". And then we can use the String.indexOf() method for the search... WDYT ? I'm asking these specific questions to make sure that I don't introduce any performance hogs :) Thanks. - Asiri

...

-- Sergiu Dumitriu http://purl.org/net/sergiu/ _______________________________________________ devs mailing list [email protected] http://lists.xwiki.org/mailman/listinfo/devs

6433

Age (days ago)

6435

Last active (days ago)

List overview

Download

14 comments

4 participants

participants (4)

Asiri Rathnayake
Guillaume Lerouge
Sergiu Dumitriu
Vincent Massol