Re: [xwiki-devs] [OFFICEIMPORTER] Style Filtering Policy & Mechanism.

20 Nov 2008

On Nov 19, 2008, at 11:20 AM, Guillaume Lerouge wrote:
...
  Hi Asiri,
 On Wed, Nov 19, 2008 at 10:56 AM, Asiri Rathnayake <
 asiri.rathnayake(a)gmail.com&gt; wrote:
  Hi Devs,
 I'm working on implementing the style filtering functionality of
 xwiki-office-importer application. But first, i need to make sure
 that I'm
 clear about the policy and the correct approach towards filtering
 style
 information from imported office documents. I would really
 appreciate your
 input on this because I'm not an expert on either html or css :)
 Ok, I plan to do two types of filtering. One is filtering various
 attributes
 of various elements (like removing bgcolor attribute from the <body>
 element). And the second one is filtering css related stuff. Let's
 take one
 by one.
 1. Filtering attributes.
 This is quite straight-forward but i see two possible approaches.
 * The first approach is to keep a list of attributes that we allow
 when
 importing documents. We'll scan each and every tag and strip off any
 unwanted attributes present.
 * The second approach is to associate each tag with what attributes
 we
 allow
 for that tag. A list of legal attributes for common tags is
 presented here
 http://www.devx.com/projectcool/Article/19816. Similarly, we'll
 have our
 tag_name->allowed_attributes mapping and filter all other attributes
 present.
 I'm currently leaning towards the second option, WDYT ?
 2. Filtering css styles.
 Ok, there are three ways one can associate css with html content.
 Let's
 take
 one by one.
 (i) External Style Sheet
 Well, AFAIK OpenOffice server does not produce this type of output
 when
 converting office documents into html. I mean it doesn't output
 html files
 that refer external css files. So I guess this is something we
 don't need
 to
 worry about.
 (ii) Internal Style Sheet
 This is something we need to worry about. OpenOffice server
 produces html
 pages with content like <head><style
type="text/css">....</style></
 head>.
 Currently we strip off <style> tags completely regardless of the
 filtering
 mode (i.e whether styles are set to be filtered or not <style> tags
 get
 removed). Does this behaviour need to change ?
 (iii) In-line Styles
 This is the most common type of styling found (Example : <p
 style="....">).
 Present behaviour is to strip off this style attribute completely (if
 filterStyles is set to true). The question is, there are some
 inline styles
 that directly maps to xwiki 2.0 syntax like <p style="font-
 weight:bold">,
 what are we going to do about these ? 
 I can't help you much from the technical perspective. Re styles that
 can be
 directly mapped to XWiki 2.0 syntax, I think they should be
 converted to use
 that syntax. To summarize my opinion:
   - When strict filtering is activated (conversion to XWiki 2.0
 syntax)
   - Only style attributes that can be directly mapped to wiki syntax
      element should be kept
      - This means that NO (% ... %) should appear
 Is that fine with everyone? 
Yes fine with me for strict filtering
Thanks
-Vincent
...
 > In any case, I will have to parse the in-line
style attribute
> string to
> filter those style directives that are not necessary. The complete
> grammar
> for in-line style attributes seems to be a bit complicated to be hand
> crafted (http://www.w3.org/TR/css-style-attr) although in OpenOffice
> converted documents i have only seen the "key:value;key:value"
> format. What
> should be the correct approach to parse the style attribute string ?
>
> Thank you very much for your ideas. :)
>
>
> [image: Asiri Rathnayake's Facebook
> profile]<http://www.facebook.com/people/Asiri_Rathnayake/534607921>

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [xwiki-devs] [OFFICEIMPORTER] Style Filtering Policy & Mechanism.