Hi,
I have one small concern which leads to a big concern, and I was wondering
about something.
1/ Small concern: what did we do to verify the potential level of
collisions and if there is a chance they happen in our case
I see we want to truncate the MD5 hash to 64 bits. I was wondering if there
is a not a risk of having more collisions.
My question here is what did we do to verify the level of collisions on
real data.
We could provide some XWiki SAS client DBs, including our Intranet which is
quite big for testing if there was a testing program.
2/ Bigger concern: wouldn't it be better to have a way to
activate/deactivate the new feature. This would allow to still upgrade and
make tests on real life data without risking being in a corner
3/ Wondering: wouldn't it be better to use the real reference as the ID and
move to strings for it
Give that in an XWiki database, this part is really small (compared to
attachments and the data itself), are there really any reasons to use IDs
for this reference. Wouldn't the use of a String be better in the end ? We
already use this for the join between xwikidoc and xwikiobjects and haven't
seen any big problem with that did we ?
If we used that method wouldn't it means ZERO collision ?
4/ Small additional stuff
There is also the migration of Object IDs right ? The object IDs use the
same system and also have a risk of collision (which would lead to property
data being shared with completely unrelevant documents)
Ludovic
2012/1/7 Denis Gervalle <dgl(a)softec.lu>
Now that the database migration mechanism has been
improved, I would like
to go ahead with my patch to improve document ids.
Currently, ids are simple string hashcode of a locally serialized document
reference, including the language for translated documents. The likelihood
of having duplicates with the string hashing algorithm of java is really
high.
What I propose is:
1) use an MD5 hashing which is particularly good at distributing.
2) truncate the hash to the first 64bits, since the XWD_ID column is a
64bit long.
3) use a better string representation as the source of hashing
Based on previous discussion, point 1) and 2) has already been agreed, and
this vote is in particular about the string used for 3).
I propose it in 2 steps:
1) before locale are fully supported in document reference, use this
format:
<lengthOfLastSpaceName>:<lastSpaceName><lengthOfDocumentName>:<documentName><lengthOfLanguage>:<language>
where language would be an empty string for the default document, so it
would look like 7:mySpace5:myDoc0: and its french translation could be
7:mySpace5:myDoc2:fr
2) when locale are included in reference, we will replace the
implementation by a reference serializer that would produce the same kind
of representation, but that will include all spaces (not only the last
one), to be prepared for the future.
While doing so, I also propose to fix the cache key issue by using the same
reference, but prefixed by <lengthOfWikiName>:<wikiName>, so the previous
examples will have the following key in the document cache:
5:xwiki7:mySpace5:myDoc0: and 5:xwiki7:mySpace5:myDoc2:fr
Using such a key (compared to the usual serialization) has the following
advantages:
- ensure uniqueness of the reference without requiring a complex escaping
algorithm, which is unneeded here.
- potentially reversible
- faster than the usual serialization
- support language
- independent of the current serialization that may evolved independently,
so it will be stable over time which is really important when it is used as
a base for the hashing algorithm used for document ids stored in the
database.
I would like to introduce this as early as possible, which means has soon
has we are confident with the migration mechanism recently introduced.
Since the migration of ids will convert 32bits hashes into 64bits ones, the
risk of collision is really low, and to be careful, I have written a
migration algorithm that would support such collision (unless it cause a
circular reference collision, but this is really unexpected). However,
changing ids again later, if we change our mind, will be really more risky
and the migration difficult to implements, so it is really important that
we agree on the way we compute these ids, once for all.
Here is my +1,
--
Denis Gervalle
SOFTEC sa - CEO
eGuilde sarl - CTO
_______________________________________________
devs mailing list
devs(a)xwiki.org
http://lists.xwiki.org/mailman/listinfo/devs