Asiri Rathnayake wrote:
Hi devs,
To implement the above functionality I have created the following UI:
http://i43.tinypic.com/28l7x2u.png which was dervied from the mockups
located at
http://incubator.myxwiki.org/xwiki/bin/view/Mockups/ImportCompositeDocument
I don't like this mockup. This should be part of our standard Office
import feature, and as such it should provide the simplest interface
possible for basic users, without cluttering the UI with regexps and
"Leave empty if..." messages. Instead, this should be split into 2
dialogs, one with the file input field, and checkboxes for "Clean the
document styles to better match the content of the wiki" and "Split the
document into several wiki pages". If the second checkbox is selected,
then when pressing Next the second dialog appears, and the user can
select the rest of the information.
Descriptions of various fields are as follows:
* Document - The office document to be uploaded (and imported)
* Style filtering - Whether to filter office styles or not
* Heading level to split - If the user wishes to split the imported document
into multiple wiki pages, he has to select the heading level (h1, h2, h3...
h6) to be used when splitting the document. If the user does not select a
heading level, the document will be imported as it is (no splitting).
* Custom split regex - If the user wants to further refine the split
criterion (based on the content of header) this field allows him to specify
that criterion through a regular expression.
Example regular expression: <b>Section<b>.*
Open Question: Aren't regular expressions bit too technical for users?
* Target space - This is where the resulting document(s) will land.
* Target (master) page - The main document holding the TOC (in case of
splitting), otherwise this is the name of resulting wiki page.
* Child pages naming method - If the document is split into multiple pages,
pages should be named according to some criterion. This combo box allows
users to specify that criterion.
Regarding the implementation, we have two possible approaches.
1. Implement the splitting in w3c dom level (xhtml)
2. Implement the splitting in XDOM level
* In the first approach we will navigate through the child elements directly
under <body> tag and find matching heading elements. For the regex, we will
have to serialize the heading element so that the regex can be evaluated.
Heading elements can be serialized as explained here:
http://forums.sun.com/thread.jspa?threadID=698475
* In the second approach we can either use XDOM operations or use a
SplittingChainingListener. But I don't know whether regex matching is
possible with this scheme.
Also, regardless of the method we follow, there will be a problem with large
office documents (say 100MB or so). Loading such a file into memory (dom or
xdom) would not be a good idea.
I haven't decided which method to go with yet. So it will be really great if
we can sort this out as soon as possible.
3. At the SAX level, since SAX doesn't load the whole document in
memory. This is the best option if we want to consider large documents
and memory consumption. This, however, will make the splitter harder to
integrate in the current rendering engine.
--
Sergiu Dumitriu
http://purl.org/net/sergiu/