Hi Asiri,
On Wed, Nov 19, 2008 at 10:56 AM, Asiri Rathnayake <
asiri.rathnayake(a)gmail.com> wrote:
Hi Devs,
I'm working on implementing the style filtering functionality of
xwiki-office-importer application. But first, i need to make sure that I'm
clear about the policy and the correct approach towards filtering style
information from imported office documents. I would really appreciate your
input on this because I'm not an expert on either html or css :)
Ok, I plan to do two types of filtering. One is filtering various
attributes
of various elements (like removing bgcolor attribute from the <body>
element). And the second one is filtering css related stuff. Let's take one
by one.
1. Filtering attributes.
This is quite straight-forward but i see two possible approaches.
* The first approach is to keep a list of attributes that we allow when
importing documents. We'll scan each and every tag and strip off any
unwanted attributes present.
* The second approach is to associate each tag with what attributes we
allow
for that tag. A list of legal attributes for common tags is presented here
http://www.devx.com/projectcool/Article/19816. Similarly, we'll have our
tag_name->allowed_attributes mapping and filter all other attributes
present.
I'm currently leaning towards the second option, WDYT ?
2. Filtering css styles.
Ok, there are three ways one can associate css with html content. Let's
take
one by one.
(i) External Style Sheet
Well, AFAIK OpenOffice server does not produce this type of output when
converting office documents into html. I mean it doesn't output html files
that refer external css files. So I guess this is something we don't need
to
worry about.
(ii) Internal Style Sheet
This is something we need to worry about. OpenOffice server produces html
pages with content like <head><style
type="text/css">....</style></head>.
Currently we strip off <style> tags completely regardless of the filtering
mode (i.e whether styles are set to be filtered or not <style> tags get
removed). Does this behaviour need to change ?
(iii) In-line Styles
This is the most common type of styling found (Example : <p
style="....">).
Present behaviour is to strip off this style attribute completely (if
filterStyles is set to true). The question is, there are some inline styles
that directly maps to xwiki 2.0 syntax like <p
style="font-weight:bold">,
what are we going to do about these ?
I can't help you much from the technical perspective. Re styles that can be
directly mapped to XWiki 2.0 syntax, I think they should be converted to use
that syntax. To summarize my opinion:
- When strict filtering is activated (conversion to XWiki 2.0 syntax)
- Only style attributes that can be directly mapped to wiki syntax
element should be kept
- This means that NO (% ... %) should appear
Is that fine with everyone?
In any case, I will have to parse the in-line style
attribute string to
filter those style directives that are not necessary. The complete grammar
for in-line style attributes seems to be a bit complicated to be hand
crafted (
http://www.w3.org/TR/css-style-attr) although in OpenOffice
converted documents i have only seen the "key:value;key:value" format. What
should be the correct approach to parse the style attribute string ?
Thank you very much for your ideas. :)
[image: Asiri Rathnayake's Facebook
profile]<http://www.facebook.com/people/Asiri_Rathnayake/534607921>
_______________________________________________
devs mailing list
devs(a)xwiki.org
http://lists.xwiki.org/mailman/listinfo/devs
--
Guillaume Lerouge
Product Manager - XWiki
Skype ID : wikibc
http://blog.xwiki.com/