[xwiki-devs] Office Importer Wysiwyg Integration - xwiki-devs@xwiki.org

20 Dec 2008

Devs,
I'm working on integrating office importer functionality into our new
wysiwyg editor and I thought of sharing few ideas and different approaches
available to us so that you can comment on them and help me select the most
appropriate path.
First of all, a crude version of the integration is available at
http://91.121.237.216/xwiki/bin/view/Main/. You can login with asiri/asiri
and try editing the Main.WebHome page. You will notice the import button
placed right next to first 4 text formatting buttons.
Now the full version of the importer dialog will include two tabs. First
visible tag will allow a user to paste office content right into a local
rich-text-area and import them in place of the current selection in the
wysiwyg editor. The second tab allows the user to upload an office document
which will also be imported in place of the current selection. My idea is to
include the first tab's functionality into 1.8M1 release.
The general approach towards implementation is to introduce a new wysiwyg
plugin called "importer". The crude version mentioned above make use of the
default WysiwygService::cleanHTML() gwt rpc call for cleaning the pasted
html content. But for the actual implementation we have to introduce a new
importer specific gwt rpc call because the default cleanHTML() is not going
to be enough when cleaning html content comming from various office suits.
So here I propose we introduce WysiwygService::cleanOfficeHTML() gwt rpc
call.
The next question is about the implementation of the cleanOfficeHTML()
method. Here the complication arises from the fact that the incomming html
can be from various sources; it can be MSWord, MSExcel, OpenOffice Writer
and any editor capable of exporting html content into user's clipboard (so
that user can paste them). I see several approaches to solve this issue:
1. Perform an exhaustive cleaning leaving out only basic html elements that
we can handle. Here we'll have to consider various formatting elements used
by different office suits and convert them as necessary. For an example,
http://office.microsoft.com/en-us/help/HA010549981033.aspx contains
information about various MSOffice 2000 specific markup.
2. By analysing the incomming html content, first we determine the office
suite which generated the content (with a default case for unknowns). And
there after perform a specific cleaning procedure for that office suite.
Note that there is no generic way of determining the office suite and there
can be many such suites.
3. We pass the incomming html through jodconverter (openoffice server) and
get the resulting html. Since this html is from openoffice, there is no need
to worry about different office suites and we can perform our cleaning
afterwards. The downsides of this approach are:
 * Passing though openoffice server can be time consuming.
 * Jodconverter seems to have some limitations with html formats as
mentioned at
http://www.artofsolving.com/opensource/jodconverter/guide/supportedformats
Of the three approaches mentioned above, I'm preferring the first approach
since it's the most simple one. Also, the other two approaches doesn't seem
to provide any advantages w.r.t first approach.
And the final question is about embedded elements inside pasted html content
(like images). This is not a problem if the whole document is uploaded for
importing; the document is a container for images and they can be retrieved
on the server. But with pasted html content, images are simply links to
local files. Now I don't know if it's possible to trigger an automated
upload of these files but for the moment I'm proposing we should strip-off
such content.
Thanks.
- Asiri