Re: [xwiki-devs] Office Importer - Importing into multiple wiki pages

11 Mar 2009

On Mar 10, 2009, at 12:30 PM, Asiri Rathnayake wrote:

...
  Hi devs,

 To implement the above functionality I have created the following UI:
 http://i43.tinypic.com/28l7x2u.png which was dervied from the mockups
 located at
 http://incubator.myxwiki.org/xwiki/bin/view/Mockups/ImportCompositeDocument

 Descriptions of various fields are as follows:

 * Document - The office document to be uploaded (and imported)

 * Style filtering - Whether to filter office styles or not

 * Heading level to split - If the user wishes to split the imported  
 document
 into multiple wiki pages, he has to select the heading level (h1,  
 h2, h3...
 h6) to be used when splitting the document. If the user does not  
 select a
 heading level, the document will be imported as it is (no splitting).

 * Custom split regex - If the user wants to further refine the split
 criterion (based on the content of header) this field allows him to  
 specify
 that criterion through a regular expression.

    Example regular expression: <b>Section<b>.*

    Open Question: Aren't regular expressions bit too technical for  
 users?

 * Target space - This is where the resulting document(s) will land.

 * Target (master) page - The main document holding the TOC (in case of
 splitting), otherwise this is the name of resulting wiki page.

 * Child pages naming method - If the document is split into multiple  
 pages,
 pages should be named according to some criterion. This combo box  
 allows
 users to specify that criterion.

 Regarding the implementation, we have two possible approaches.

 1. Implement the splitting in w3c dom level (xhtml)
 2. Implement the splitting in XDOM level

 * In the first approach we will navigate through the child elements  
 directly
 under <body> tag and find matching heading elements. For the regex,  
 we will
 have to serialize the heading element so that the regex can be  
 evaluated.
 Heading elements can be serialized as explained here:
 http://forums.sun.com/thread.jspa?threadID=698475

 * In the second approach we can either use XDOM operations or use a
 SplittingChainingListener. But I don't know whether regex matching is
 possible with this scheme. 
Definitely not possible. You need to use block information.

...
  Also, regardless of the method we follow, there will
be a problem  
 with large
 office documents (say 100MB or so). Loading such a file into memory  
 (dom or
 xdom) would not be a good idea.

 I haven't decided which method to go with yet. So it will be really  
 great if
 we can sort this out as soon as possible. 
Both 1) and 2) are not very good since they're limited by memory size.
A better solution as mentioned by Sergiu is to use a streaming parser.  
Our existing rendering parsers are all streaming parsers so this is  
good.

However we have 2 limitations in the rendering which we need to fix  
anyway to have a scalable rendering engine:
- html cleaner generates a W3C DOM object. The problem here is that  
the SF HTML Cleaner that we use is not scalable I think since it  
creates an in memory DOM... That's a big problem.
- new parser api for going directly to listeners without going through  
the XDOM (that's very easy)

So till we fix the html cleaner we'll be limited by memory size anyway.

what do others think?

Thanks
-Vincent

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [xwiki-devs] Office Importer - Importing into multiple wiki pages