New subject: [xwiki-devs] Reduce heap exhaustion during export operation (and partly import for now)

17 Sep 2009

Hi,
As you knows, Import/Export operation on large XAR file cause problem.
We felt on an even worse situation, where a single XWiki document is
not properly exported due to heap exhaustion during the built of its
XML DOM or due to large attachments.
Having a look at the source, I have notice that many optimizations on
the way the export is produce could be quite easily introduced.
Currently, XWiki stored the exported document several times in memory
during the operation, which impact performance negativly and is
uselessly heavy on memory usage, even for reasonable documents.
Therefore, I have started a large patch to avoid these caveats, here
is the strategies I have followed:
1) The current implementation mostly build a DOM in memory for
immediately serialized it into a stream. So I have remove the
intermediate DOM and provide direct streaming of Element content by:
         1.1) extending org.dom4j.XMLWriter to allow direct streaming
of Element content into the output stream,  as is, or Base64 encoded.
Accessorily, my extension also ensure proper pairing of open/close tag.
        1.2) writing a minimal DOMXMLWriter which extends my XMLWriter and
could be used with the same toXML() code to build a DOMDocument to
provide the toXMLDocument() methods to support the older
implementation unchanged if ever needed.
        1.3) using the above, minimal change to the current XML code was
required
                1.3.1) replacing element.add(Element) by either writer.writeElement)
or writer.writeOpen(Element)
                 1.3.2) for large content, use my extensions, either
writer.write(Element, InputStream) or writer.writeBase64(Element,
InputStream) which use the InputStream for the element content
2) The current implementation for binary data such as attachments and
export zip file is mostly based on in memory byte[] passing from
function to function while these data initially came from a
request.getInputStream() or are written to a response.getOutputStream
(). So I have change these to passover the stream instead of the data:
        2.1) using IOUtils.copy when required
         2.2) using org.apache.commons.codec.binary.Base64OutputStream
for base64 encoding when required
         2.3) using an extension of ZipInputStream to cope with
unexpected close()
         2.4) avoid buffer duplication in favor of stream filters
3) Since most oftently used large data came from the database through
an attachment content, it would be nice to have these attachment
streamed from the database when they are too large. However, I feel
that it is still too early to convert our binary into a blob, mainly
because HSQLDB and MySQL still does not really support blob, just an
emulation. These are also used to be cached in the document cache, and
this will require improvement to support blob. However I propose to
take the occasion to go in the direction of the blob by:
        3.1) deprecating setContent(byte[]) and Byte[] getContent() in favor
of newly created setContent(InputStream, int), InputStream
getContentInputStream() and getSize()
        3.2) Begin to use these new function as much as possible as 2) implied
        3.3) this also open the ability to store attachment in another
repository that support better the streaming aspect (ie: a filesystem)
I am currently testing the above changes onto our client project, and
I expect to provide a patch really soon. It will require an upgrade of
org.apache.commons.codec to version 1.4, to have access to
Base64OutputStream.
I feel it will be a first step into the right direction, further
improvement should be to:
  - import XML using a SAXParser without building a DOM in memory
  - manage JRCS Archive better, the way they are built and store raise
the same issue then attachment
  - manage the recycle bin better for the same reason
  - improve caching to avoid caching very large stuffs
WDYT ?
Denis Gervalle
--
SOFTEC sa
http://www.softec.st