[xwiki-devs] [OFFICEIMPORTER] Style Filtering Policy & Mechanism.
Hi Devs, I'm working on implementing the style filtering functionality of xwiki-office-importer application. But first, i need to make sure that I'm clear about the policy and the correct approach towards filtering style information from imported office documents. I would really appreciate your input on this because I'm not an expert on either html or css :) Ok, I plan to do two types of filtering. One is filtering various attributes of various elements (like removing bgcolor attribute from the <body> element). And the second one is filtering css related stuff. Let's take one by one. 1. Filtering attributes. This is quite straight-forward but i see two possible approaches. * The first approach is to keep a list of attributes that we allow when importing documents. We'll scan each and every tag and strip off any unwanted attributes present. * The second approach is to associate each tag with what attributes we allow for that tag. A list of legal attributes for common tags is presented here http://www.devx.com/projectcool/Article/19816. Similarly, we'll have our tag_name->allowed_attributes mapping and filter all other attributes present. I'm currently leaning towards the second option, WDYT ? 2. Filtering css styles. Ok, there are three ways one can associate css with html content. Let's take one by one. (i) External Style Sheet Well, AFAIK OpenOffice server does not produce this type of output when converting office documents into html. I mean it doesn't output html files that refer external css files. So I guess this is something we don't need to worry about. (ii) Internal Style Sheet This is something we need to worry about. OpenOffice server produces html pages with content like <head><style type="text/css">....</style></head>. Currently we strip off <style> tags completely regardless of the filtering mode (i.e whether styles are set to be filtered or not <style> tags get removed). Does this behaviour need to change ? (iii) In-line Styles This is the most common type of styling found (Example : <p style="....">). Present behaviour is to strip off this style attribute completely (if filterStyles is set to true). The question is, there are some inline styles that directly maps to xwiki 2.0 syntax like <p style="font-weight:bold">, what are we going to do about these ? In any case, I will have to parse the in-line style attribute string to filter those style directives that are not necessary. The complete grammar for in-line style attributes seems to be a bit complicated to be hand crafted (http://www.w3.org/TR/css-style-attr) although in OpenOffice converted documents i have only seen the "key:value;key:value" format. What should be the correct approach to parse the style attribute string ? Thank you very much for your ideas. :) [image: Asiri Rathnayake's Facebook profile]<http://www.facebook.com/people/Asiri_Rathnayake/534607921>
Hi Asiri, On Wed, Nov 19, 2008 at 10:56 AM, Asiri Rathnayake < [email protected]> wrote:
Hi Devs,
I'm working on implementing the style filtering functionality of xwiki-office-importer application. But first, i need to make sure that I'm clear about the policy and the correct approach towards filtering style information from imported office documents. I would really appreciate your input on this because I'm not an expert on either html or css :)
Ok, I plan to do two types of filtering. One is filtering various attributes of various elements (like removing bgcolor attribute from the <body> element). And the second one is filtering css related stuff. Let's take one by one.
1. Filtering attributes.
This is quite straight-forward but i see two possible approaches.
* The first approach is to keep a list of attributes that we allow when importing documents. We'll scan each and every tag and strip off any unwanted attributes present.
* The second approach is to associate each tag with what attributes we allow for that tag. A list of legal attributes for common tags is presented here http://www.devx.com/projectcool/Article/19816. Similarly, we'll have our tag_name->allowed_attributes mapping and filter all other attributes present.
I'm currently leaning towards the second option, WDYT ?
2. Filtering css styles.
Ok, there are three ways one can associate css with html content. Let's take one by one.
(i) External Style Sheet
Well, AFAIK OpenOffice server does not produce this type of output when converting office documents into html. I mean it doesn't output html files that refer external css files. So I guess this is something we don't need to worry about.
(ii) Internal Style Sheet
This is something we need to worry about. OpenOffice server produces html pages with content like <head><style type="text/css">....</style></head>.
Currently we strip off <style> tags completely regardless of the filtering mode (i.e whether styles are set to be filtered or not <style> tags get removed). Does this behaviour need to change ?
(iii) In-line Styles
This is the most common type of styling found (Example : <p style="....">). Present behaviour is to strip off this style attribute completely (if filterStyles is set to true). The question is, there are some inline styles that directly maps to xwiki 2.0 syntax like <p style="font-weight:bold">, what are we going to do about these ?
I can't help you much from the technical perspective. Re styles that can be directly mapped to XWiki 2.0 syntax, I think they should be converted to use that syntax. To summarize my opinion: - When strict filtering is activated (conversion to XWiki 2.0 syntax) - Only style attributes that can be directly mapped to wiki syntax element should be kept - This means that NO (% ... %) should appear Is that fine with everyone?
In any case, I will have to parse the in-line style attribute string to filter those style directives that are not necessary. The complete grammar for in-line style attributes seems to be a bit complicated to be hand crafted (http://www.w3.org/TR/css-style-attr) although in OpenOffice converted documents i have only seen the "key:value;key:value" format. What should be the correct approach to parse the style attribute string ?
Thank you very much for your ideas. :)
[image: Asiri Rathnayake's Facebook profile]<http://www.facebook.com/people/Asiri_Rathnayake/534607921> _______________________________________________ devs mailing list [email protected] http://lists.xwiki.org/mailman/listinfo/devs
-- Guillaume Lerouge Product Manager - XWiki Skype ID : wikibc http://blog.xwiki.com/
Hi Guillaume, I can't help you much from the technical perspective. Re styles that can be
directly mapped to XWiki 2.0 syntax, I think they should be converted to use that syntax. To summarize my opinion:
- When strict filtering is activated (conversion to XWiki 2.0 syntax) - Only style attributes that can be directly mapped to wiki syntax element should be kept - This means that NO (% ... %) should appear
Is that fine with everyone?
Just came into my mind, what about alignments ? (of texts, images and the like ?) Do you think they should also be ripped off when strict-filtering is on ? Thanks. - Asiri
On Thu, Nov 20, 2008 at 8:17 PM, Asiri Rathnayake < [email protected]> wrote:
Hi Guillaume,
I can't help you much from the technical perspective. Re styles that can be
directly mapped to XWiki 2.0 syntax, I think they should be converted to use that syntax. To summarize my opinion:
- When strict filtering is activated (conversion to XWiki 2.0 syntax) - Only style attributes that can be directly mapped to wiki syntax element should be kept - This means that NO (% ... %) should appear
Is that fine with everyone?
Just came into my mind, what about alignments ? (of texts, images and the like ?) Do you think they should also be ripped off when strict-filtering is on ?
+ same question for image height & width... I'm asking these questions because I think that sort of information should be preserved, otherwise the document will look really strange after import. WDYT ? - Asiri
Thanks.
- Asiri
On Thu, Nov 20, 2008 at 8:30 PM, Asiri Rathnayake < [email protected]> wrote:
On Thu, Nov 20, 2008 at 8:17 PM, Asiri Rathnayake < [email protected]> wrote:
Hi Guillaume,
I can't help you much from the technical perspective. Re styles that can
be directly mapped to XWiki 2.0 syntax, I think they should be converted to use that syntax. To summarize my opinion:
- When strict filtering is activated (conversion to XWiki 2.0 syntax) - Only style attributes that can be directly mapped to wiki syntax element should be kept - This means that NO (% ... %) should appear
Is that fine with everyone?
Just came into my mind, what about alignments ? (of texts, images and the like ?) Do you think they should also be ripped off when strict-filtering is on ?
+ same question for image height & width... I'm asking these questions because I think that sort of information should be preserved, otherwise the document will look really strange after import.
Again, may be we should introduce another level of style filtering. So we have 3 levels of filtering. 1. Strict filtering (Filter everything, a.k.a not a single (%%)) 2. Moderate (Filter styles as much as possible but try to preserve those formatting elements which makes the document look appealing, like alignment, image height & width etc.) 3. Filter nothing. This is just an idea. - Asiri
On Nov 20, 2008, at 4:11 PM, Asiri Rathnayake wrote:
On Thu, Nov 20, 2008 at 8:30 PM, Asiri Rathnayake < [email protected]> wrote:
On Thu, Nov 20, 2008 at 8:17 PM, Asiri Rathnayake < [email protected]> wrote:
Hi Guillaume,
I can't help you much from the technical perspective. Re styles that can
be directly mapped to XWiki 2.0 syntax, I think they should be converted to use that syntax. To summarize my opinion:
- When strict filtering is activated (conversion to XWiki 2.0 syntax) - Only style attributes that can be directly mapped to wiki syntax element should be kept - This means that NO (% ... %) should appear
Is that fine with everyone?
Just came into my mind, what about alignments ? (of texts, images and the like ?) Do you think they should also be ripped off when strict- filtering is on ?
+ same question for image height & width... I'm asking these questions because I think that sort of information should be preserved, otherwise the document will look really strange after import.
Again, may be we should introduce another level of style filtering. So we have 3 levels of filtering.
You should have a design that allows to plug any number of filtering strategies (FilterStrategy interface) and implement 2 for now: StrictFilterStrategy NoopFilterStrategy (or some other name) I think we can implement the moderate one later on after we start using the tool and get user feedback. At least I think it would be good to have those 2 fully done first. Thanks -Vincent
1. Strict filtering (Filter everything, a.k.a not a single (%%)) 2. Moderate (Filter styles as much as possible but try to preserve those formatting elements which makes the document look appealing, like alignment, image height & width etc.) 3. Filter nothing.
This is just an idea.
- Asiri
On Thu, Nov 20, 2008 at 7:07 PM, Vincent Massol <[email protected]> wrote:
On Nov 20, 2008, at 4:11 PM, Asiri Rathnayake wrote:
On Thu, Nov 20, 2008 at 8:30 PM, Asiri Rathnayake < [email protected]> wrote:
On Thu, Nov 20, 2008 at 8:17 PM, Asiri Rathnayake < [email protected]> wrote:
Hi Guillaume,
I can't help you much from the technical perspective. Re styles that can
be directly mapped to XWiki 2.0 syntax, I think they should be converted to use that syntax. To summarize my opinion:
- When strict filtering is activated (conversion to XWiki 2.0 syntax) - Only style attributes that can be directly mapped to wiki syntax element should be kept - This means that NO (% ... %) should appear
Is that fine with everyone?
Just came into my mind, what about alignments ? (of texts, images and the like ?) Do you think they should also be ripped off when strict- filtering is on ?
+ same question for image height & width... I'm asking these questions because I think that sort of information should be preserved, otherwise the document will look really strange after import.
Again, may be we should introduce another level of style filtering. So we have 3 levels of filtering.
You should have a design that allows to plug any number of filtering strategies (FilterStrategy interface) and implement 2 for now:
StrictFilterStrategy NoopFilterStrategy (or some other name)
I think we can implement the moderate one later on after we start using the tool and get user feedback.
At least I think it would be good to have those 2 fully done first.
Jusft for the record, I agree with Vincent on this, I think we should have the strict filtering (wiki syntax only) working very well before worrying about moderate filtering strategies. Guillaume
Thanks -Vincent
1. Strict filtering (Filter everything, a.k.a not a single (%%)) 2. Moderate (Filter styles as much as possible but try to preserve those formatting elements which makes the document look appealing, like alignment, image height & width etc.) 3. Filter nothing.
This is just an idea.
- Asiri
devs mailing list [email protected] http://lists.xwiki.org/mailman/listinfo/devs
-- Guillaume Lerouge Product Manager - XWiki Skype ID : wikibc http://blog.xwiki.com/
On Nov 20, 2008, at 4:00 PM, Asiri Rathnayake wrote:
On Thu, Nov 20, 2008 at 8:17 PM, Asiri Rathnayake < [email protected]> wrote:
Hi Guillaume,
I can't help you much from the technical perspective. Re styles that can be
directly mapped to XWiki 2.0 syntax, I think they should be converted to use that syntax. To summarize my opinion:
- When strict filtering is activated (conversion to XWiki 2.0 syntax) - Only style attributes that can be directly mapped to wiki syntax element should be kept - This means that NO (% ... %) should appear
Is that fine with everyone?
Just came into my mind, what about alignments ? (of texts, images and the like ?) Do you think they should also be ripped off when strict- filtering is on ?
+ same question for image height & width... I'm asking these questions because I think that sort of information should be preserved, otherwise the document will look really strange after import.
This is a bit tricky... The real answer (which is probably hard to implement is): * Remove height and width for images if they match the original image size * Otherwise keep them Thanks -Vincent
hi Vincent,
+ same question for image height & width... I'm asking these questions
because I think that sort of information should be preserved, otherwise the document will look really strange after import.
This is a bit tricky... The real answer (which is probably hard to implement is):
* Remove height and width for images if they match the original image size * Otherwise keep them
Well, I don't think there is an easy way to figure out the actual dimensions of an image without using some image manipulation code. At least we'll have to load the image into some kind of structure. But this can be costly in terms of performance specially when there are lost of images embedded in the document. Thanks. - Asiri
Thanks -Vincent _______________________________________________ devs mailing list [email protected] http://lists.xwiki.org/mailman/listinfo/devs
On Nov 21, 2008, at 4:46 AM, Asiri Rathnayake wrote:
hi Vincent,
+ same question for image height & width... I'm asking these questions
because I think that sort of information should be preserved, otherwise the document will look really strange after import.
This is a bit tricky... The real answer (which is probably hard to implement is):
* Remove height and width for images if they match the original image size * Otherwise keep them
Well, I don't think there is an easy way to figure out the actual dimensions of an image without using some image manipulation code. At least we'll have to load the image into some kind of structure. But this can be costly in terms of performance specially when there are lost of images embedded in the document.
Yes which is why I said it's tricky. Does the OO-generated HTML always specify width and height even when the image isn't resized? Thanks -Vincent
Hi Vincent, On Fri, Nov 21, 2008 at 1:32 PM, Vincent Massol <[email protected]> wrote:
On Nov 21, 2008, at 4:46 AM, Asiri Rathnayake wrote:
hi Vincent,
+ same question for image height & width... I'm asking these questions
because I think that sort of information should be preserved, otherwise the document will look really strange after import.
This is a bit tricky... The real answer (which is probably hard to implement is):
* Remove height and width for images if they match the original image size * Otherwise keep them
Well, I don't think there is an easy way to figure out the actual dimensions of an image without using some image manipulation code. At least we'll have to load the image into some kind of structure. But this can be costly in terms of performance specially when there are lost of images embedded in the document.
Yes which is why I said it's tricky.
Does the OO-generated HTML always specify width and height even when the image isn't resized?
Yes, it includes height & width attributes always. Thanks. - Asiri
Thanks -Vincent
_______________________________________________ devs mailing list [email protected] http://lists.xwiki.org/mailman/listinfo/devs
On Nov 20, 2008, at 3:47 PM, Asiri Rathnayake wrote:
Hi Guillaume,
I can't help you much from the technical perspective. Re styles that can be
directly mapped to XWiki 2.0 syntax, I think they should be converted to use that syntax. To summarize my opinion:
- When strict filtering is activated (conversion to XWiki 2.0 syntax) - Only style attributes that can be directly mapped to wiki syntax element should be kept - This means that NO (% ... %) should appear
Is that fine with everyone?
Just came into my mind, what about alignments ? (of texts, images and the like ?) Do you think they should also be ripped off when strict- filtering is on ?
definitely. Thanks -Vincent
On Nov 19, 2008, at 11:20 AM, Guillaume Lerouge wrote:
Hi Asiri,
On Wed, Nov 19, 2008 at 10:56 AM, Asiri Rathnayake < [email protected]> wrote:
Hi Devs,
I'm working on implementing the style filtering functionality of xwiki-office-importer application. But first, i need to make sure that I'm clear about the policy and the correct approach towards filtering style information from imported office documents. I would really appreciate your input on this because I'm not an expert on either html or css :)
Ok, I plan to do two types of filtering. One is filtering various attributes of various elements (like removing bgcolor attribute from the <body> element). And the second one is filtering css related stuff. Let's take one by one.
1. Filtering attributes.
This is quite straight-forward but i see two possible approaches.
* The first approach is to keep a list of attributes that we allow when importing documents. We'll scan each and every tag and strip off any unwanted attributes present.
* The second approach is to associate each tag with what attributes we allow for that tag. A list of legal attributes for common tags is presented here http://www.devx.com/projectcool/Article/19816. Similarly, we'll have our tag_name->allowed_attributes mapping and filter all other attributes present.
I'm currently leaning towards the second option, WDYT ?
2. Filtering css styles.
Ok, there are three ways one can associate css with html content. Let's take one by one.
(i) External Style Sheet
Well, AFAIK OpenOffice server does not produce this type of output when converting office documents into html. I mean it doesn't output html files that refer external css files. So I guess this is something we don't need to worry about.
(ii) Internal Style Sheet
This is something we need to worry about. OpenOffice server produces html pages with content like <head><style type="text/css">....</style></ head>.
Currently we strip off <style> tags completely regardless of the filtering mode (i.e whether styles are set to be filtered or not <style> tags get removed). Does this behaviour need to change ?
(iii) In-line Styles
This is the most common type of styling found (Example : <p style="....">). Present behaviour is to strip off this style attribute completely (if filterStyles is set to true). The question is, there are some inline styles that directly maps to xwiki 2.0 syntax like <p style="font- weight:bold">, what are we going to do about these ?
I can't help you much from the technical perspective. Re styles that can be directly mapped to XWiki 2.0 syntax, I think they should be converted to use that syntax. To summarize my opinion:
- When strict filtering is activated (conversion to XWiki 2.0 syntax) - Only style attributes that can be directly mapped to wiki syntax element should be kept - This means that NO (% ... %) should appear
Is that fine with everyone?
Yes fine with me for strict filtering Thanks -Vincent
In any case, I will have to parse the in-line style attribute string to filter those style directives that are not necessary. The complete grammar for in-line style attributes seems to be a bit complicated to be hand crafted (http://www.w3.org/TR/css-style-attr) although in OpenOffice converted documents i have only seen the "key:value;key:value" format. What should be the correct approach to parse the style attribute string ?
Thank you very much for your ideas. :)
[image: Asiri Rathnayake's Facebook profile]<http://www.facebook.com/people/Asiri_Rathnayake/534607921>
Asiri Rathnayake wrote:
Hi Devs,
I'm working on implementing the style filtering functionality of xwiki-office-importer application. But first, i need to make sure that I'm clear about the policy and the correct approach towards filtering style information from imported office documents. I would really appreciate your input on this because I'm not an expert on either html or css :)
Ok, I plan to do two types of filtering. One is filtering various attributes of various elements (like removing bgcolor attribute from the <body> element). And the second one is filtering css related stuff. Let's take one by one.
1. Filtering attributes.
This is quite straight-forward but i see two possible approaches.
* The first approach is to keep a list of attributes that we allow when importing documents. We'll scan each and every tag and strip off any unwanted attributes present.
* The second approach is to associate each tag with what attributes we allow for that tag. A list of legal attributes for common tags is presented here http://www.devx.com/projectcool/Article/19816. Similarly, we'll have our tag_name->allowed_attributes mapping and filter all other attributes present.
I'm currently leaning towards the second option, WDYT ?
Second sounds better from a functional perspective. However, we must be sure we define this list in a clean way, and take care about performance. We definitely don't want inline array definitions in the java source files.
2. Filtering css styles.
Ok, there are three ways one can associate css with html content. Let's take one by one.
(i) External Style Sheet
Well, AFAIK OpenOffice server does not produce this type of output when converting office documents into html. I mean it doesn't output html files that refer external css files. So I guess this is something we don't need to worry about.
(ii) Internal Style Sheet
This is something we need to worry about. OpenOffice server produces html pages with content like <head><style type="text/css">....</style></head>.
Currently we strip off <style> tags completely regardless of the filtering mode (i.e whether styles are set to be filtered or not <style> tags get removed). Does this behaviour need to change ?
(iii) In-line Styles
This is the most common type of styling found (Example : <p style="....">). Present behaviour is to strip off this style attribute completely (if filterStyles is set to true). The question is, there are some inline styles that directly maps to xwiki 2.0 syntax like <p style="font-weight:bold">, what are we going to do about these ?
In any case, I will have to parse the in-line style attribute string to filter those style directives that are not necessary. The complete grammar for in-line style attributes seems to be a bit complicated to be hand crafted (http://www.w3.org/TR/css-style-attr) although in OpenOffice converted documents i have only seen the "key:value;key:value" format. What should be the correct approach to parse the style attribute string ?
Thank you very much for your ideas. :)
All cases can be covered by using a CSS library. We already use css4j in the export (pdf) implementation, and although it has several limitations, this library is a good starting point. Vincent, could you put the latest release (0.10) in our externals? There have been lots of changes since 0.4, which would also help our PDF export. -- Sergiu Dumitriu http://purl.org/net/sergiu/
Hi Sergiu,
In any case, I will have to parse the in-line style attribute string to
filter those style directives that are not necessary. The complete grammar for in-line style attributes seems to be a bit complicated to be hand crafted (http://www.w3.org/TR/css-style-attr) although in OpenOffice converted documents i have only seen the "key:value;key:value" format. What should be the correct approach to parse the style attribute string ?
Thank you very much for your ideas. :)
All cases can be covered by using a CSS library. We already use css4j in the export (pdf) implementation, and although it has several limitations, this library is a good starting point. Vincent, could you put the latest release (0.10) in our externals? There have been lots of changes since 0.4, which would also help our PDF export.
Thank you for this input. I think css4j is a good option. One more question, will it be a good idea to maintain a list of allowed css properties for each element we are interested in ? So that we can filter out everything else ? By list I meant a comma separated string of allowed properties like "font-family,font-size,font-weight". And then we can use the String.indexOf() method for the search... WDYT ? I'm asking these specific questions to make sure that I don't introduce any performance hogs :) Thanks. - Asiri
-- Sergiu Dumitriu http://purl.org/net/sergiu/ _______________________________________________ devs mailing list [email protected] http://lists.xwiki.org/mailman/listinfo/devs
participants (4)
-
Asiri Rathnayake -
Guillaume Lerouge -
Sergiu Dumitriu -
Vincent Massol