[Proposal] Document history storage
Hi, Sometime ago, there was a discussion regarding how should the document history be stored in a better way. Right now, the complete history is stored as one field in the xwikidoc table. From my PoV, this has some major disadvantages: - loading an older version requires parsing all the history -> memory inefficiency - as the documents grow older, loading a document will take a lot of time -> time inefficiency - queries on archives cannot return just one version, but they match the whole document (somewhere in the history, there was a version containing "search term") The blocking issue with storing old version in a different table was, at that time, the fact that a document archive should contain all information needed for completely restoring the document, like content, metadata, objects. I don't think that is actually an issue. We are archiving document versions, but we're joining all versions in one large string. Why don't we archive the complete version, but one version per row? So, the archive table should look like: - document name - version number - language (for translations) - content - archived metadata (one field, or the same fields as in xwikidoc) - archived objects (one field) - attachment names and versions It is not like storing the version as a normal document is, with separate objects and properties, but at least it provides a better storage and retrieval mechanism, and it separates a bit the parts of a wikidocument - content, metadata, objects. WDYT? -- http://purl.org/net/sergiu
Hi, I think it's a good idea to have a versions table. One thing I'm not sure of is wether this table should hold the master information or just a cache for the information stored in the revision. If it is a cache it could not have all the info but just the most important one. What I'm worried about is the volume of information when there are many changes. Suppose we get a comment spam of 500 comments. The JRCS revision system will only add you the actual spam. If you have the archived info in the table system you get 500 times the size of the document. And how will you export the whole document including archives. Would you use RCS or would you have the whole history inside an XML field inside the document. One downside of RCS is that you need to parse the whole RCS document to get the version. But we could solve this by cutting the RCS file in chunks of 50 versions so that we get faster retrieval. It's true that this is a little painfull to code. The cache table with the most important metadata (version, date, author, comment) would allow to have what we need for getting information about contributors and number of contributions, retrieving comments at edit time. Ludovic Sergiu Dumitriu a écrit :
Hi,
Sometime ago, there was a discussion regarding how should the document history be stored in a better way.
Right now, the complete history is stored as one field in the xwikidoc table. From my PoV, this has some major disadvantages: - loading an older version requires parsing all the history -> memory inefficiency - as the documents grow older, loading a document will take a lot of time -> time inefficiency - queries on archives cannot return just one version, but they match the whole document (somewhere in the history, there was a version containing "search term")
The blocking issue with storing old version in a different table was, at that time, the fact that a document archive should contain all information needed for completely restoring the document, like content, metadata, objects.
I don't think that is actually an issue. We are archiving document versions, but we're joining all versions in one large string. Why don't we archive the complete version, but one version per row?
So, the archive table should look like: - document name - version number - language (for translations) - content - archived metadata (one field, or the same fields as in xwikidoc) - archived objects (one field) - attachment names and versions It is not like storing the version as a normal document is, with separate objects and properties, but at least it provides a better storage and retrieval mechanism, and it separates a bit the parts of a wikidocument - content, metadata, objects.
WDYT?
-- http://purl.org/net/sergiu ------------------------------------------------------------------------
-- You receive this message as a subscriber of the [email protected] mailing list. To unsubscribe: mailto:[email protected] For general help: mailto:[email protected]?subject=help ObjectWeb mailing lists service home page: http://www.objectweb.org/wws
-- Ludovic Dubost Blog: http://www.ludovic.org/blog/ XWiki: http://www.xwiki.com Skype: ldubost GTalk: ldubost AIM: nvludo Yahoo: ludovic
To clarify one misunderstanding: the attachments are not stored, just the attachments' name and version (number). AFAIK, the attachment history is stored separately. I know that it is not so efficient to store the complete document, with content and objects, even if there is a small change like a comment added. But this is how it is done now too, and this is a change that tries to do better, not perfect. On 3/1/07, Ludovic Dubost <[email protected]> wrote:
Hi,
I think it's a good idea to have a versions table. One thing I'm not sure of is whether this table should hold the master information or just a cache for the information stored in the revision. If it is a cache it could not have all the info but just the most important one. What I'm worried about is the volume of information when there are many changes. Suppose we get a comment spam of 500 comments. The JRCS revision system will only add you the actual spam. If you have the archived info in the table system you get 500 times the size of the document. And how will you export the whole document including archives. Would you use RCS or would you have the whole history inside an XML field inside the document.
One downside of RCS is that you need to parse the whole RCS document to get the version. But we could solve this by cutting the RCS file in chunks of 50 versions so that we get faster retrieval. It's true that this is a little painfull to code. The cache table with the most important metadata (version, date, author, comment) would allow to have what we need for getting information about contributors and number of contributions, retrieving comments at edit time.
Ludovic
Sergiu Dumitriu a écrit :
Hi,
Sometime ago, there was a discussion regarding how should the document history be stored in a better way.
Right now, the complete history is stored as one field in the xwikidoc table. From my PoV, this has some major disadvantages: - loading an older version requires parsing all the history -> memory inefficiency - as the documents grow older, loading a document will take a lot of time -> time inefficiency - queries on archives cannot return just one version, but they match the whole document (somewhere in the history, there was a version containing "search term")
The blocking issue with storing old version in a different table was, at that time, the fact that a document archive should contain all information needed for completely restoring the document, like content, metadata, objects.
I don't think that is actually an issue. We are archiving document versions, but we're joining all versions in one large string. Why don't we archive the complete version, but one version per row?
So, the archive table should look like: - document name - version number - language (for translations) - content - archived metadata (one field, or the same fields as in xwikidoc) - archived objects (one field) - attachment names and versions It is not like storing the version as a normal document is, with separate objects and properties, but at least it provides a better storage and retrieval mechanism, and it separates a bit the parts of a wikidocument - content, metadata, objects.
WDYT?
Hi. I implement some of this proposal in XWIKI-1459. And I want to discuss some problems about it. 1) Separate diffs. Sergiu propose to store document archive in separate fields (content, metadata, objects, attachments) instead of one field. But it is incompatible with old document history system (If we know one diff for all, it is impossible to understand what field has changed) If we will implement this, we will lose old document history or we will be needed to write complicated migrator from old history to new. Need we save compatibility of 1.0 document history in xwiki-1.1 ? I think separate diff will bring more complex than profit and no needed at least in xwiki-platform-1.1. WDYT? 2) Fetching strategy. Now I load all version infos at once and version contents (diff) one by one demand (fetching strategy #2). I see following possible fetching strategies for history storage: 1. Load all content at once This is bad as old history storage 2. Load one content by demand and cache (RCSNodeInfo contains softreference to RCSNodeContent) (code: foreach needed versions do getContent(context) ) - Many sql requests for first time. 3. Load list of the needed content per request (hql: from NodeContent where version>=1.2) One sql request per http request but always. 4. Cache list of latest nodes (from some node to latest node). Make only needed requests and recache. (cache = softref to SortedMap<version, RCSNodeContent>, If not finded in cache - fetch by hql (where version>=1.2 and version<=2.3) ) I think it is the best fetching strategy concerning performance. 5. Something else? What fetching strategy is best for history storage? Any comments about XWIKI-1459 also welcome. Sergiu Dumitriu wrote:
Hi,
Sometime ago, there was a discussion regarding how should the document history be stored in a better way.
Right now, the complete history is stored as one field in the xwikidoc table. From my PoV, this has some major disadvantages: - loading an older version requires parsing all the history -> memory inefficiency - as the documents grow older, loading a document will take a lot of time -> time inefficiency - queries on archives cannot return just one version, but they match the whole document (somewhere in the history, there was a version containing "search term")
The blocking issue with storing old version in a different table was, at that time, the fact that a document archive should contain all information needed for completely restoring the document, like content, metadata, objects.
I don't think that is actually an issue. We are archiving document versions, but we're joining all versions in one large string. Why don't we archive the complete version, but one version per row?
So, the archive table should look like: - document name - version number - language (for translations) - content - archived metadata (one field, or the same fields as in xwikidoc) - archived objects (one field) - attachment names and versions It is not like storing the version as a normal document is, with separate objects and properties, but at least it provides a better storage and retrieval mechanism, and it separates a bit the parts of a wikidocument - content, metadata, objects.
-- Artem Melentyev
Artem Melentyev a écrit :
Hi.
I implement some of this proposal in XWIKI-1459.
And I want to discuss some problems about it.
1) Separate diffs. Sergiu propose to store document archive in separate fields (content, metadata, objects, attachments) instead of one field. But it is incompatible with old document history system (If we know one diff for all, it is impossible to understand what field has changed) If we will implement this, we will lose old document history or we will be needed to write complicated migrator from old history to new.
Need we save compatibility of 1.0 document history in xwiki-1.1 ?
I think separate diff will bring more complex than profit and no needed at least in xwiki-platform-1.1.
WDYT?
I think we don't need history to be compatible but we need a migration path (a script to migrate the previous history). I'm more and more thinking we should get rid of RCS as the versioning system. In the P2P XWiki Project we have been talking about implementing a "XWiki Patch" notion because we need it to send it over the P2P network for replication. This "XWiki Patch" could be the new minimal set of information we need for a version. Now I think we also need a table of versions to hold some key meta data directly available (not as diff) so that we can display it in the history page quickly. We could decide to store either the patch (less space) or the full XML version in this table (more space but very safe and faster).
2) Fetching strategy.
Now I load all version infos at once and version contents (diff) one by one demand (fetching strategy #2).
I see following possible fetching strategies for history storage:
1. Load all content at once This is bad as old history storage
Currently we have a lazy fetching strategy already except when we need a specific version we need to load the full RCS file to be able to retrieve it.
2. Load one content by demand and cache (RCSNodeInfo contains softreference to RCSNodeContent) (code: foreach needed versions do getContent(context) ) - Many sql requests for first time.
3. Load list of the needed content per request (hql: from NodeContent where version>=1.2) One sql request per http request but always.
4. Cache list of latest nodes (from some node to latest node). Make only needed requests and recache. (cache = softref to SortedMap<version, RCSNodeContent>, If not finded in cache - fetch by hql (where version>=1.2 and version<=2.3) ) I think it is the best fetching strategy concerning performance.
5. Something else?
What fetching strategy is best for history storage?
We could decide to store the full document every 10 versions and store only the patch (RCS or new XWiki Patch) for each intermediary version.. This would mean that to retrieve any version you need one full version + 10 nodes.. It would be great to work on the new "XWiki Patch" system since it is needed for the P2P. What we discussed at the meeting was a language like: ins(content,6,'Hello') = insert in field 'content' at char 6 the text 'Hello' del(content,6,5) = delete 5 char from field content starting at char 6 set(author,'XWiki.LudovicDubost' = set author field to XWiki.LudovicDubost setObjectProperty('XWiki.ArticleClass',0,'propname','propvalue') insObjectProperty('XWiki.ArticleClass',0,'propname',6,'propvalue') etc... Ludovic
Any comments about XWIKI-1459 also welcome.
Sergiu Dumitriu wrote:
Hi,
Sometime ago, there was a discussion regarding how should the document history be stored in a better way.
Right now, the complete history is stored as one field in the xwikidoc table. From my PoV, this has some major disadvantages: - loading an older version requires parsing all the history -> memory inefficiency - as the documents grow older, loading a document will take a lot of time -> time inefficiency - queries on archives cannot return just one version, but they match the whole document (somewhere in the history, there was a version containing "search term")
The blocking issue with storing old version in a different table was, at that time, the fact that a document archive should contain all information needed for completely restoring the document, like content, metadata, objects.
I don't think that is actually an issue. We are archiving document versions, but we're joining all versions in one large string. Why don't we archive the complete version, but one version per row?
So, the archive table should look like: - document name - version number - language (for translations) - content - archived metadata (one field, or the same fields as in xwikidoc) - archived objects (one field) - attachment names and versions It is not like storing the version as a normal document is, with separate objects and properties, but at least it provides a better storage and retrieval mechanism, and it separates a bit the parts of a wikidocument - content, metadata, objects.
-- Ludovic Dubost Blog: http://www.ludovic.org/blog/ XWiki: http://www.xwiki.com Skype: ldubost GTalk: ldubost AIM: nvludo Yahoo: ludovic
On Jul 24, 2007, at 11:30 PM, Ludovic Dubost wrote:
Artem Melentyev a écrit :
Hi.
I implement some of this proposal in XWIKI-1459.
And I want to discuss some problems about it.
1) Separate diffs. Sergiu propose to store document archive in separate fields (content, metadata, objects, attachments) instead of one field. But it is incompatible with old document history system (If we know one diff for all, it is impossible to understand what field has changed) If we will implement this, we will lose old document history or we will be needed to write complicated migrator from old history to new.
Need we save compatibility of 1.0 document history in xwiki-1.1 ?
I think separate diff will bring more complex than profit and no needed at least in xwiki-platform-1.1.
WDYT?
I think we don't need history to be compatible but we need a migration path (a script to migrate the previous history). I'm more and more thinking we should get rid of RCS as the versioning system. In the P2P XWiki Project we have been talking about implementing a "XWiki Patch" notion because we need it to send it over the P2P network for replication. This "XWiki Patch" could be the new minimal set of information we need for a version.
I also sent an email about having a diff object inside XWikiDocument to know what changes were brought to a document. I'm definitely +1 for doing this. It's quite easy to do at the level of XWikiDocument because then you can set the state of this diff object at the time when the different methods of XWikiDocument are called. The only "hard" part would be to break down a content change into several patches but this can be done as we're now storing the original document into the XWikiDocument class. [snip] -Vincent
Hi. Ludovic Dubost wrote:
I think we don't need history to be compatible but we need a migration path (a script to migrate the previous history). My current implementation is migrateable via package plugin.
I'm more and more thinking we should get rid of RCS as the versioning system. I'm too. JRCS is not extensible and there are no real alternatives. In my code I tried whenever possible get rid of dependence from JRCS. So it is easy to replace JRCS with something else. I used mainly jrcs.diff. jrcs.rcs is used only by package plugin ([de]serialization all archive to/from string) for compatibility.
In the P2P XWiki Project we have been talking about implementing a "XWiki Patch" notion because we need it to send it over the P2P network for replication. This "XWiki Patch" could be the new minimal set of information we need for a version.
Now I think we also need a table of versions to hold some key meta data directly available (not as diff) so that we can display it in the history page quickly. I store version, date, comment and author in history table (xwikircs, XWikiRCSNodeInfo), so history page (?viewer=history) is loading without load any diffs (nodes content).
We could decide to store either the patch (less space) or the full XML version in this table (more space but very safe and faster).
2) Fetching strategy.
Now I load all version infos at once and version contents (diff) one by one demand (fetching strategy #2).
I see following possible fetching strategies for history storage:
1. Load all content at once This is bad as old history storage Currently we have a lazy fetching strategy already except when we need a specific version we need to load the full RCS file to be able to retrieve it.
Yes. Others strategies cache is lazy^2 :) And they load only necessary content.
2. Load one content by demand and cache (RCSNodeInfo contains softreference to RCSNodeContent) (code: foreach needed versions do getContent(context) ) - Many sql requests for first time.
3. Load list of the needed content per request (hql: from NodeContent where version>=1.2) One sql request per http request but always.
4. Cache list of latest nodes (from some node to latest node). Make only needed requests and recache. (cache = softref to SortedMap<version, RCSNodeContent>, If not finded in cache - fetch by hql (where version>=1.2 and version<=2.3) ) I think it is the best fetching strategy concerning performance.
5. Something else?
What fetching strategy is best for history storage?
We could decide to store the full document every 10 versions and store only the patch (RCS or new XWiki Patch) for each intermediary version.. This would mean that to retrieve any version you need one full version + 10 nodes.. I will try to implement this now. Implementation thoughts: onsave: If (count % 50 == 0) save full version onload: load nearest full version (by hql), or latest node if not finded.
It would be great to work on the new "XWiki Patch" system since it is needed for the P2P. What we discussed at the meeting was a language like:
ins(content,6,'Hello') = insert in field 'content' at char 6 the text 'Hello' del(content,6,5) = delete 5 char from field content starting at char 6 set(author,'XWiki.LudovicDubost' = set author field to XWiki.LudovicDubost setObjectProperty('XWiki.ArticleClass',0,'propname','propvalue') insObjectProperty('XWiki.ArticleClass',0,'propname',6,'propvalue')
Great. I will try to find some time to implement this, but not now. -- Artem Melentyev
Hi. Ludovic Dubost wrote:
I think we don't need history to be compatible but we need a migration path (a script to migrate the previous history). I'm more and more thinking we should get rid of RCS as the versioning system. In the P2P XWiki Project we have been talking about implementing a "XWiki Patch" notion because we need it to send it over the P2P network for replication. This "XWiki Patch" could be the new minimal set of information we need for a version.
Now I think we also need a table of versions to hold some key meta data directly available (not as diff) so that we can display it in the history page quickly. We could decide to store either the patch (less space) or the full XML version in this table (more space but very safe and faster). .... We could decide to store the full document every 10 versions and store only the patch (RCS or new XWiki Patch) for each intermediary version.. This would mean that to retrieve any version you need one full version + 10 nodes..
What if we allow to configure (via xwiki.cfg. parameter "xwiki.store.rcs.fullpernodes" for example) per how many nodes to store full version? If we choose per 1 node, xwiki will store full document for each version. If we choose per 0 node, xwiki will store only diffs. This parameter will be 50 by default. I think it is best solution of choose full vs diff and it is not to hard to implement. I'm implementing it now. WDYT? -- Artem Melentyev
On Jul 25, 2007, at 4:25 PM, Artem Melentyev wrote:
Hi.
Ludovic Dubost wrote:
I think we don't need history to be compatible but we need a migration path (a script to migrate the previous history). I'm more and more thinking we should get rid of RCS as the versioning system. In the P2P XWiki Project we have been talking about implementing a "XWiki Patch" notion because we need it to send it over the P2P network for replication. This "XWiki Patch" could be the new minimal set of information we need for a version. Now I think we also need a table of versions to hold some key meta data directly available (not as diff) so that we can display it in the history page quickly. We could decide to store either the patch (less space) or the full XML version in this table (more space but very safe and faster). .... We could decide to store the full document every 10 versions and store only the patch (RCS or new XWiki Patch) for each intermediary version.. This would mean that to retrieve any version you need one full version + 10 nodes..
What if we allow to configure (via xwiki.cfg. parameter "xwiki.store.rcs.fullpernodes" for example) per how many nodes to store full version? If we choose per 1 node, xwiki will store full document for each version. If we choose per 0 node, xwiki will store only diffs. This parameter will be 50 by default.
That is looking complex to me. I'd rather we implement a solution that works without having to configure anything. That said I haven't followed the discussion... Will catch up later. -Vincent
I think it is best solution of choose full vs diff and it is not to hard to implement. I'm implementing it now.
WDYT?
Hi. Vincent Massol wrote:
...
What if we allow to configure (via xwiki.cfg. parameter "xwiki.store.rcs.fullpernodes" for example) per how many nodes to store full version? If we choose per 1 node, xwiki will store full document for each version. If we choose per 0 node, xwiki will store only diffs. This parameter will be 50 by default.
That is looking complex to me. I'd rather we implement a solution that works without having to configure anything. That said I haven't followed the discussion... Will catch up later.
If we decide to store full version by 10 nodes, I think reasonable allow to configure it.
I think it is best solution of choose full vs diff and it is not to hard to implement. I'm implementing it now.
Implemented in XWIKI-1459-r4.patch : " Added possibility to store full xml version instead of diff. By default full version storing per 4 diff (small for debuging yet). xwiki.cfg configure option xwiki.store.rcs.nodesPerFull to configure this value. It is safe to change this parameter in any time. Applied coding style to new files and some modified " -- Artem Melentyev
Hi. I implement some of this proposal in XWIKI-1459. And I want to discuss some problems about it. 1) Separate diffs. Sergiu propose to store document archive in separate fields (content, metadata, objects, attachments) instead of one field. But it is incompatible with old document history system (If we know one diff for all, it is impossible to understand what field has changed) If we will implement this, we will lose old document history or we will be needed to write complicated migrator from old history to new. Need we save compatibility of 1.0 document history in xwiki-1.1 ? I think separate diff will bring more complex than profit and no needed at least in xwiki-platform-1.1. WDYT? 2) Fetching strategy. Now I load all version infos at once and version contents (diff) one by one demand (fetching strategy #2). I see following possible fetching strategies for history storage: 1. Load all content at once This is bad as old history storage 2. Load one content by demand and cache (RCSNodeInfo contains softreference to RCSNodeContent) (code: foreach needed versions do getContent(context) ) - Many sql requests for first time. 3. Load list of the needed content per request (hql: from NodeContent where version>=1.2) One sql request per http request but always. 4. Cache list of latest nodes (from some node to latest node). Make only needed requests and recache. (XWikiDocumentArchive contains cache = softref to SortedMap<Version, RCSNodeContent>, If content is not finded in cache - fetch it and others by hql (where version>=1.2 and version<=2.3) ) I think it is the best fetching strategy concerning performance. 5. Something else? What fetching strategy is best for history storage? Any comments about XWIKI-1459 also welcome. Sergiu Dumitriu wrote:
Hi,
Sometime ago, there was a discussion regarding how should the document history be stored in a better way.
Right now, the complete history is stored as one field in the xwikidoc table. From my PoV, this has some major disadvantages: - loading an older version requires parsing all the history -> memory inefficiency - as the documents grow older, loading a document will take a lot of time -> time inefficiency - queries on archives cannot return just one version, but they match the whole document (somewhere in the history, there was a version containing "search term")
The blocking issue with storing old version in a different table was, at that time, the fact that a document archive should contain all information needed for completely restoring the document, like content, metadata, objects.
I don't think that is actually an issue. We are archiving document versions, but we're joining all versions in one large string. Why don't we archive the complete version, but one version per row?
So, the archive table should look like: - document name - version number - language (for translations) - content - archived metadata (one field, or the same fields as in xwikidoc) - archived objects (one field) - attachment names and versions It is not like storing the version as a normal document is, with separate objects and properties, but at least it provides a better storage and retrieval mechanism, and it separates a bit the parts of a wikidocument - content, metadata, objects.
-- Artem Melentyev
participants (4)
-
Artem Melentyev -
Ludovic Dubost -
Sergiu Dumitriu -
Vincent Massol