Hi Asiri,
Asiri Rathnayake wrote:
Devs,
I'm working on integrating office importer functionality into our new
wysiwyg editor and I thought of sharing few ideas and different approaches
available to us so that you can comment on them and help me select the most
appropriate path.
First of all, a crude version of the integration is available at
http://91.121.237.216/xwiki/bin/view/Main/. You can login with asiri/asiri
and try editing the Main.WebHome page. You will notice the import button
placed right next to first 4 text formatting buttons.
Now the full version of the importer dialog will include two tabs. First
visible tag will allow a user to paste office content right into a local
rich-text-area and import them in place of the current
selection in the
Allowing the user to edit the pasted content using a rich text area
before cleaning is not a good idea because the HTML generated by these
Office suites can easily mess up the rich text area. The best option I
think would be to have a panel (div) that can catch paste events and set
the pasted content as its inner HTML.
Btw, by moving the Import dialog box you loose the pasted content
because the in-line frame used by the rich text area is detached during
dragging and a new document is created each time the in-line frame is
(re)attached. This is a dialog box bug anyway.
wysiwyg editor. The second tab allows the user to
upload an office document
which will also be imported in place of the current selection. My idea is to
include the first tab's functionality into 1.8M1 release.
The general approach towards implementation is to introduce a new wysiwyg
plugin called "importer". The crude version mentioned above make use of the
default WysiwygService::cleanHTML() gwt rpc call for cleaning the pasted
html content. But for the actual implementation we have to introduce a new
importer specific gwt rpc call because the default cleanHTML() is not going
to be enough when cleaning html content comming from various office suits.
So here I propose we introduce
WysiwygService::cleanOfficeHTML() gwt rpc
call.
+1
The next question is about the implementation of the cleanOfficeHTML()
method. Here the complication arises from the fact that the incomming html
can be from various sources; it can be MSWord, MSExcel, OpenOffice Writer
and any editor capable of exporting html content into user's clipboard (so
that user can paste them). I see several approaches to solve this issue:
1. Perform an exhaustive cleaning leaving out only basic html elements that
we can handle. Here we'll have to consider various formatting elements used
by different office suits and convert them as necessary. For an example,
http://office.microsoft.com/en-us/help/HA010549981033.aspx contains
information about various MSOffice 2000 specific markup.
2. By analysing the incomming html content, first we determine the office
suite which generated the content (with a default case for unknowns). And
there after perform a specific cleaning procedure for that office suite.
Note that there is no generic way of determining the office suite and there
can be many such suites.
3. We pass the incomming html through jodconverter (openoffice server) and
get the resulting html. Since this html is from openoffice, there is no need
to worry about different office suites and we can perform our cleaning
afterwards. The downsides of this approach are:
* Passing though openoffice server can be time consuming.
* Jodconverter seems to have some limitations with html formats as
mentioned at
http://www.artofsolving.com/opensource/jodconverter/guide/supportedformats
4. Ask the user. He may know better which Office suite he's using. Of
course, you need to provide also an option like "Don't know" or
"Others"
in which case you can apply one of the previous 3 solutions. Ideally, if
possible, you would use solution 2 to detect the Office suite but still
allowing the user to change it.
Of the three approaches mentioned above, I'm preferring the first approach
since it's the most simple one. Also, the other two approaches doesn't seem
to provide any advantages w.r.t first approach.
And the final question is about embedded elements inside pasted html content
(like images). This is not a problem if the whole document is uploaded for
importing; the document is a container for images and they can be retrieved
on the server. But with pasted html content, images are simply links to
local files. Now I don't know if it's possible to trigger an automated
upload of these files but for the moment I'm
proposing we should strip-off
such content.
Or use a placeholder for local images with a tooltip explaining that
those images couldn't be uploaded.
Thanks.
- Asiri
Thanks,
Marius