Hi Devs,
Since I fall on this well-know XWiki caveat recently, I would like to
improve this.
Currently, XWikiDocument.getId() is almost equivalent to
String.hashcode(fullName:lang). Since this is a really poor hashing method
for small changes, the risk that two documents with similar names of same
length are colliding is really high. This Id is used by the Hibernate store
for document unicity and really needs to be unique, and at most a
64bit-numeric at the same time. Currently we use only 32bits since java
hashcode are limited to 32bits integers.
The ideal would be not to have this link between ids and documents
fullName:lang, but converting the current implementation is not really easy.
This is probably why XWIKI-4396 has never been fixed. Therefore, my current
goal is to reduce the likeliness of a collision by choosing a better hashing
method and taking into account the fact that document fullname are short
string and the number of unique ids required are very limited (since the
unicity is only expected for a given XWiki database) compared to the 64bits
integrer range.
So we need to choose a better algorithm, and here are IMHO the potential
options:
A) use a simple but more efficient non-cryptographic hashing function that
runs on 64bits, I was thinking about using the algorithm produced by
Professor Daniel J. Bernstein (DJB) since it is well-know, wildly used, easy
to implement algorithm with a good distribution on small strings.
Pro: no dependency; fast; 64bits better than hashcode
Cons: probably more risk of collision compare to MD5 or SHA, but far less
than now; require db migration of all document keys
B) use an MD5 or even stronger SHA1 or SHA256 algorithm from JCA, truncating
to the lower 64bits. Note that oldcore already use MDA5 for hashcoding a
whole XWikiDocument to provide the API with a getVersionHashcode(), and for
the validation hash used by the persistent login manager. The first use
Object.hashcode() as a fallback, which is really bad and defeat the purpose.
The second does not provide any fallback and may fail unexpectedly. For our
case, if we really want a fallback, we needs to store the hashing algorithm
used in the database at creation time, and anyway, fail when it was not
available.
Pro: already used in oldcore, probably less collision; with fallback, really
flexible since it would be possible to choose the algorithm at creation time
and does not require full migration for existing database.
Cons: require at least a DB schema change to add the hashing algorithm,
probably as a column of xwikidbversion; if this config value is broken, the
whole DB is broken
C) use our own MD5 implementation when the JCA provider is missing it. I was
thinking about integrating something like
http://twmacinta.com/myjava/fast_md5.php (non-native version) which is LGPL.
This will ensure availability of the hashing algorythm while having a rather
strong one.
Pro: no dependency, could also provide MD5 to getVersionHashcode and the
persistent manager
Cons: require db migration of all document keys
A) is really quick to implement, simple, and the less risky, but some may
found it insufficient. Caleb ?
Obviously, B) with fallback is really nice, but I wonder if it is not
overkill ?
I am worried about the B) without fallback, but maybe I want it too flawless
C) is rather solid, while staying simple, but maybe overkill too.
I am really undecided.
WDYT ?
--
Denis Gervalle
SOFTEC sa - CEO
eGuilde sarl - CTO