[xwiki-devs] OfficeImporter API changes - Take 2 - xwiki-devs@xwiki.org

List overview All Threads
Download

newer

[xwiki-devs] OfficeImporter API changes - Take 2

older

[xwiki-devs] Authentication with...

[xwiki-devs] [Proposal] Resource...

Asiri Rathnayake

26 Oct 2009 26 Oct '09

6:28 a.m.

Hello Devs, After few discussions I have revised the new officeimporter API to take into account the use of DocumentName instead of plain strings for representing document names. I'll repeat the details of the previous proposal with the new changes applied: Currently we have the following officeimporter API: <code> OfficeImporter::importStream(InputStream is, String documentFormat, String targetDocumentName, Map params):void OfficeImporter::importAttachment(String documentName, String attachmentName, Map params):String </code> Problems with this API: * Loosely typed (params, document names) * Both of the above methods perform almost the same task. * Customizing the import process is implemented in a hackish way. (not visisble on the API) The new API proposed looks like below: <code> OfficeImporter::officeToXHTML(byte[] officeFileData, DocumentName referenceDocument, boolean filterStyles):XHTMLOfficeDocument OfficeImporter::xhtmlToXDOM(XHTMLOfficeDocument xhtmlOfficeDocument):XDOMOfficeDocument OfficeImporter::officeToXDOM(byte[] officeFileData, DocumentName referenceDocument, boolean filterStyles):XDOMOfficeDocument OfficeImporter::buildPresentation(byte[] officeFileData):XDOMOfficeDocument OfficeImporter::splitImport(XDOMOfficeDocument xdomOfficeDocument, int[] headingLevelsToSplit, NamingCriterion namingCriterion, DocumentName baseDocumentName):Map<TargetPageDescriptor, XDOMOfficeDocument> </code> As you can see, these methods are more granular and the responsibilities are well defined. Customizing the import process can be done from the client code. For an example: 1. Make the initial import from office to XHTMLOfficeDocument - OfficeImporter::officeToXHTML() 2. Perform customizations on the XHTMLOfficeDocument - w3c DOM manipulations. 3. Import the XHTMLOfficeDocument into XDOMOfficeDocument - OfficeImporter::xhtmlToXDOM() 4. Perform customizations on the XDOMOfficeDocument (XDOM) - XDOM manipulations. 5. Split the XDOMOfficeDocument into multiple XDOMOfficeDocument instances - OfficeImporter::splitImport() 6. Perform customizations on these child XDOMOfficeDocument instances - XDOM manipulations. 7. Render the XDOMOfficeDocument instances & save them into wiki pages - XWiki rendering operations. I think this interface will make it easy to extend & maintain officeimporter functionality in the future. Along with this, I would also like to refactor the xwiki-refactoring module a bit to get rid of string based document names from it. This whole refactoring operation would take approximately one day to complete. And since this operation is not adding any new features, I think it can be committed on both trunk and 2.0 branch. Here's my +1 to all of above. Thanks. - Asiri

Show replies by date

Marius Dumitru Florea

26 Oct 26 Oct

8:12 a.m.

Hi Asiri, Asiri Rathnayake wrote:

...

OfficeImporter::officeToXHTML(byte[] officeFileData, DocumentName referenceDocument, boolean filterStyles):XHTMLOfficeDocument

Can you explain why you replaced InputStream with byte[]? Arrays require contiguous memory locations and loading large arrays into memory is bad IMO. Thanks, Marius

...

OfficeImporter::xhtmlToXDOM(XHTMLOfficeDocument xhtmlOfficeDocument):XDOMOfficeDocument OfficeImporter::officeToXDOM(byte[] officeFileData, DocumentName referenceDocument, boolean filterStyles):XDOMOfficeDocument OfficeImporter::buildPresentation(byte[] officeFileData):XDOMOfficeDocument OfficeImporter::splitImport(XDOMOfficeDocument xdomOfficeDocument, int[] headingLevelsToSplit, NamingCriterion namingCriterion, DocumentName baseDocumentName):Map<TargetPageDescriptor, XDOMOfficeDocument> </code> As you can see, these methods are more granular and the responsibilities are well defined. Customizing the import process can be done from the client code. For an example: 1. Make the initial import from office to XHTMLOfficeDocument - OfficeImporter::officeToXHTML() 2. Perform customizations on the XHTMLOfficeDocument - w3c DOM manipulations. 3. Import the XHTMLOfficeDocument into XDOMOfficeDocument - OfficeImporter::xhtmlToXDOM() 4. Perform customizations on the XDOMOfficeDocument (XDOM) - XDOM manipulations. 5. Split the XDOMOfficeDocument into multiple XDOMOfficeDocument instances - OfficeImporter::splitImport() 6. Perform customizations on these child XDOMOfficeDocument instances - XDOM manipulations. 7. Render the XDOMOfficeDocument instances & save them into wiki pages - XWiki rendering operations. I think this interface will make it easy to extend & maintain officeimporter functionality in the future. Along with this, I would also like to refactor the xwiki-refactoring module a bit to get rid of string based document names from it. This whole refactoring operation would take approximately one day to complete. And since this operation is not adding any new features, I think it can be committed on both trunk and 2.0 branch. Here's my +1 to all of above. Thanks. - Asiri _______________________________________________ devs mailing list devs(a)xwiki.org http://lists.xwiki.org/mailman/listinfo/devs

Asiri Rathnayake

9 a.m.

Hi,

...

OfficeImporter::officeToXHTML(byte[] officeFileData, DocumentName

referenceDocument, boolean filterStyles):XHTMLOfficeDocument

Can you explain why you replaced InputStream with byte[]? Arrays require contiguous memory locations and loading large arrays into memory is bad IMO.

Yes, you are correct. I still have some doubts about this: 1. OpenOffice requires the input to be on a disk file -- so we need to write the stream / byte [] into the disk. Having a stream instead of a byte[] is good in this case because we won't be loading the whole file into memory at once. 2. However OpenOffice will load the whole file into memory at once when performing the conversion. Still, having a stream is kind of better because we are not consuming any additional memory. 3. Currently file upload plugin has no method to get a stream of input. So it's returning a byte[]. Having this byte[] converted into a stream is kind of useless. 4. However, having an API based on streams is better because in future we might anyway upgrade the fileupload plugin to return a stream. 5. Even after the conversion I need to load the results into memory (byte[]) so that I can attach them into wiki pages. 6. But we might change the DAB api in future so that attachments can be saved in an streaming fashion (this might be hard). I might be wrong here, but I don't think having streams oriented API for officeimporter will bring any performance enhancement in near future. All it will add is some glue code to convert between streams / byte[] so that it can work with rest of XE infrastructure (fileupload plugin, DAB, attachment saving). If we base our decision solely on "having a streamable API is always wise", I'd also go with the streams based API. I'd like to stay away from this decision and let you all decide :) Thanks. - Asiri

...

Thanks, Marius

officeFileData):XDOMOfficeDocument

OfficeImporter::splitImport(XDOMOfficeDocument xdomOfficeDocument, int[] headingLevelsToSplit, NamingCriterion namingCriterion, DocumentName baseDocumentName):Map<TargetPageDescriptor, XDOMOfficeDocument> </code> As you can see, these methods are more granular and the responsibilities

are

well defined. Customizing the import process can be done from the client code. For an example: 1. Make the initial import from office to XHTMLOfficeDocument - OfficeImporter::officeToXHTML() 2. Perform customizations on the XHTMLOfficeDocument - w3c DOM manipulations. 3. Import the XHTMLOfficeDocument into XDOMOfficeDocument - OfficeImporter::xhtmlToXDOM() 4. Perform customizations on the XDOMOfficeDocument (XDOM) - XDOM manipulations. 5. Split the XDOMOfficeDocument into multiple XDOMOfficeDocument

instances -

OfficeImporter::splitImport() 6. Perform customizations on these child XDOMOfficeDocument instances -

XDOM

manipulations. 7. Render the XDOMOfficeDocument instances & save them into wiki pages - XWiki rendering operations. I think this interface will make it easy to extend & maintain

officeimporter

functionality in the future. Along with this, I would also like to refactor the xwiki-refactoring

module

a bit to get rid of string based document names from it. This whole refactoring operation would take approximately one day to complete. And since this operation is not adding any new features, I

think

it can be committed on both trunk and 2.0 branch. Here's my +1 to all of above. Thanks. - Asiri _______________________________________________ devs mailing list devs(a)xwiki.org http://lists.xwiki.org/mailman/listinfo/devs

_______________________________________________ devs mailing list devs(a)xwiki.org http://lists.xwiki.org/mailman/listinfo/devs

Thomas Mortagne

10:30 a.m.

On Mon, Oct 26, 2009 at 09:00, Asiri Rathnayake <asiri.rathnayake(a)gmail.com> wrote:

...

Hi,

OfficeImporter::officeToXHTML(byte[] officeFileData, DocumentName

referenceDocument, boolean filterStyles):XHTMLOfficeDocument

Can you explain why you replaced InputStream with byte[]? Arrays require contiguous memory locations and loading large arrays into memory is bad IMO.

+1 for "having a streamable API is always wise" :)

...

I'd also go with the streams based API. I'd like to stay away from this decision and let you all decide :) Thanks. - Asiri

Thanks, Marius

officeFileData):XDOMOfficeDocument

are

instances -

OfficeImporter::splitImport() 6. Perform customizations on these child XDOMOfficeDocument instances -

XDOM

manipulations. 7. Render the XDOMOfficeDocument instances & save them into wiki pages - XWiki rendering operations. I think this interface will make it easy to extend & maintain

officeimporter

functionality in the future. Along with this, I would also like to refactor the xwiki-refactoring

module

a bit to get rid of string based document names from it. This whole refactoring operation would take approximately one day to complete. And since this operation is not adding any new features, I

think

_______________________________________________ devs mailing list devs(a)xwiki.org http://lists.xwiki.org/mailman/listinfo/devs

-- Thomas Mortagne

Sergiu Dumitriu

1:52 p.m.

On 10/26/2009 09:00 AM, Asiri Rathnayake wrote:

...

Hi,

OfficeImporter::officeToXHTML(byte[] officeFileData, DocumentName

referenceDocument, boolean filterStyles):XHTMLOfficeDocument

Can you explain why you replaced InputStream with byte[]? Arrays require contiguous memory locations and loading large arrays into memory is bad IMO.

+1 for streams. -- Sergiu Dumitriu http://purl.org/net/sergiu/

Asiri Rathnayake

18 Dec 18 Dec

8:38 a.m.

Hi Devs, <code>

...

OfficeImporter::officeToXHTML(byte[] officeFileData, DocumentName referenceDocument, boolean filterStyles):XHTMLOfficeDocument OfficeImporter::xhtmlToXDOM(XHTMLOfficeDocument xhtmlOfficeDocument):XDOMOfficeDocument OfficeImporter::officeToXDOM(byte[] officeFileData, DocumentName referenceDocument, boolean filterStyles):XDOMOfficeDocument OfficeImporter::buildPresentation(byte[] officeFileData):XDOMOfficeDocument OfficeImporter::splitImport(XDOMOfficeDocument xdomOfficeDocument, int[] headingLevelsToSplit, NamingCriterion namingCriterion, DocumentName baseDocumentName):Map<TargetPageDescriptor, XDOMOfficeDocument> </code>

I'd like to implement this proposal for XE 2.2M1 release with few changes: 1. Change all byte[] references to use InputStream instead (as discussed). 2. Add a "String officeFileName" parameter to first four methods described above. This is because currently office importer fails on Office2007 documents unable to figure out document type. (For all other document types jodconverter correctly identifies document type without a file extension but for Office2007 documents it seems to fail). Please let me know if you have anything against. I'm going ahead with the implementation. Thanks. - Asiri

Vincent Massol

10:14 a.m.

Hi Asiri, On Oct 26, 2009, at 6:28 AM, Asiri Rathnayake wrote:

...

I don't like too much this API because it mixes several things that are different. All the To methods seem to be of the domain of the conversion to me and are not related to having a connected openoffice server running and not related to having documents. For me they should be in a Converter interface. This would allow to use them in various contexts. So I'd see 2 interfaces at the top level: - OfficeConverter: no relation with a running OO server or with the XWiki Model - OfficeImporter: connect to the running OO, get the data, use the OfficeConverter to perform conversion, knows about XWiki Model to save the result in Wiki pages. + the notion of Transformation (or Split) to split a XDOMOfficeDocument into several. In OfficeImporter I'd see only 1 method: import(Source (whatever object you use to represent the filename to import), Target (whatever object you use to represent the target location)) And in Target I'd add the possibility to pass a Transformation or maybe simply have a SplittingTarget that extends Target and adds splitting. WDYT? Thanks -Vincent

...

As you can see, these methods are more granular and the responsibilities are well defined. Customizing the import process can be done from the client code. For an example: 1. Make the initial import from office to XHTMLOfficeDocument - OfficeImporter::officeToXHTML() 2. Perform customizations on the XHTMLOfficeDocument - w3c DOM manipulations. 3. Import the XHTMLOfficeDocument into XDOMOfficeDocument - OfficeImporter::xhtmlToXDOM() 4. Perform customizations on the XDOMOfficeDocument (XDOM) - XDOM manipulations. 5. Split the XDOMOfficeDocument into multiple XDOMOfficeDocument instances - OfficeImporter::splitImport() 6. Perform customizations on these child XDOMOfficeDocument instances - XDOM manipulations. 7. Render the XDOMOfficeDocument instances & save them into wiki pages - XWiki rendering operations. I think this interface will make it easy to extend & maintain officeimporter functionality in the future. Along with this, I would also like to refactor the xwiki-refactoring module a bit to get rid of string based document names from it. This whole refactoring operation would take approximately one day to complete. And since this operation is not adding any new features, I think it can be committed on both trunk and 2.0 branch. Here's my +1 to all of above. Thanks. - Asiri

Vincent Massol

10:55 a.m.

On Dec 18, 2009, at 10:14 AM, Vincent Massol wrote:

...

Hi Asiri, On Oct 26, 2009, at 6:28 AM, Asiri Rathnayake wrote:

Some corrections after discussing with Asiri: - it seems what Asiri is proposing is a Velocity API, ie for usage of the importer from velocity, not from Java - the To methods do need the OO server running. Thanks -Vincent

...

Thanks -Vincent > As you can see, these methods are more granular and the > responsibilities are > well defined. Customizing the import process can be done from the > client > code. For an example: > > 1. Make the initial import from office to XHTMLOfficeDocument - > OfficeImporter::officeToXHTML() > > 2. Perform customizations on the XHTMLOfficeDocument - w3c DOM > manipulations. > > 3. Import the XHTMLOfficeDocument into XDOMOfficeDocument - > OfficeImporter::xhtmlToXDOM() > > 4. Perform customizations on the XDOMOfficeDocument (XDOM) - XDOM > manipulations. > > 5. Split the XDOMOfficeDocument into multiple XDOMOfficeDocument > instances - > OfficeImporter::splitImport() > > 6. Perform customizations on these child XDOMOfficeDocument > instances - XDOM > manipulations. > > 7. Render the XDOMOfficeDocument instances & save them into wiki > pages - > XWiki rendering operations. > > I think this interface will make it easy to extend & maintain > officeimporter > functionality in the future. > > Along with this, I would also like to refactor the xwiki- > refactoring module > a bit to get rid of string based document names from it. > > > This whole refactoring operation would take approximately one day to > complete. And since this operation is not adding any new features, > I think > it can be committed on both trunk and 2.0 branch. > > Here's my +1 to all of above. > > Thanks. > > - Asiri

Asiri Rathnayake

11:29 a.m.

Hi,

...

Some corrections after discussing with Asiri: - it seems what Asiri is proposing is a Velocity API, ie for usage of the importer from velocity, not from Java

I was actually proposing a java API but now I understand that it's not correct. I already have different components responsible for carrying out the tasks these proposed methods are supposed to do. So I will simply add these proposed methods to office importer velocity API instead (won't take more than a couple of hours). Although I would have to change DocumentName parameters to string parameters in velocity API and convert those string parameters to DocumentName instances when invoking corresponding components. Also, I will create a small module document explaining the internals of officeimporter module on code.xwiki.org so that we can discuss whether we want to improve the officeimporter module further. Thanks. - Asiri

5837

days inactive

5890

days old

xwiki-devs@xwiki.org

Manage subscription

8 comments

5 participants

tags (0)

participants (5)

Asiri Rathnayake
Marius Dumitru Florea
Sergiu Dumitriu
Thomas Mortagne
Vincent Massol