+1
Caleb
On 01/07/2012 04:39 PM, Denis Gervalle wrote:
Now that the database migration mechanism has been
improved, I would like
to go ahead with my patch to improve document ids.
Currently, ids are simple string hashcode of a locally serialized document
reference, including the language for translated documents. The likelihood
of having duplicates with the string hashing algorithm of java is really
high.
What I propose is:
1) use an MD5 hashing which is particularly good at distributing.
2) truncate the hash to the first 64bits, since the XWD_ID column is a
64bit long.
3) use a better string representation as the source of hashing
Based on previous discussion, point 1) and 2) has already been agreed, and
this vote is in particular about the string used for 3).
I propose it in 2 steps:
1) before locale are fully supported in document reference, use this
format:
<lengthOfLastSpaceName>:<lastSpaceName><lengthOfDocumentName>:<documentName><lengthOfLanguage>:<language>
where language would be an empty string for the default document, so it
would look like 7:mySpace5:myDoc0: and its french translation could be
7:mySpace5:myDoc2:fr
2) when locale are included in reference, we will replace the
implementation by a reference serializer that would produce the same kind
of representation, but that will include all spaces (not only the last
one), to be prepared for the future.
While doing so, I also propose to fix the cache key issue by using the same
reference, but prefixed by <lengthOfWikiName>:<wikiName>, so the previous
examples will have the following key in the document cache:
5:xwiki7:mySpace5:myDoc0: and 5:xwiki7:mySpace5:myDoc2:fr
Using such a key (compared to the usual serialization) has the following
advantages:
- ensure uniqueness of the reference without requiring a complex escaping
algorithm, which is unneeded here.
- potentially reversible
- faster than the usual serialization
- support language
- independent of the current serialization that may evolved independently,
so it will be stable over time which is really important when it is used as
a base for the hashing algorithm used for document ids stored in the
database.
I would like to introduce this as early as possible, which means has soon
has we are confident with the migration mechanism recently introduced.
Since the migration of ids will convert 32bits hashes into 64bits ones, the
risk of collision is really low, and to be careful, I have written a
migration algorithm that would support such collision (unless it cause a
circular reference collision, but this is really unexpected). However,
changing ids again later, if we change our mind, will be really more risky
and the migration difficult to implements, so it is really important that
we agree on the way we compute these ids, once for all.
Here is my +1,