Since nobody seems to have remark nor opinion, I have started to do A), but
it was still insufficient for my requirements. So I have now done B) without
fallback using MD5, and if no one stop me, I will probably do C) soon.
Since the storage is entirely broken by this change, I need the startup to
be stopped on migration failure and migration to be mandatory. Migration
error was ignored until now, but this is wrong IMO. It would means that the
code was both backward compatible with old data and failsafe to any
migration failure. Moreover, Migration could also be disabled by config, but
this is not really safe IMO.
Currently I have added code that throws if migrations fails, preventing the
wiki to be started. Should I improve further ? could we expect user to take
care to migration requirement, or should we have mandatory migration and
optional one ?
I am also wondering if I should target to commit in release 3.2 or 3.3 since
the name of the migrator and the db version will depends on this.
Hoping to have some comments this time.
On Mon, Sep 19, 2011 at 12:54, Denis Gervalle <dgl(a)softec.lu> wrote:
Hi Devs,
Since I fall on this well-know XWiki caveat recently, I would like to
improve
this.
Currently, XWikiDocument.getId() is almost equivalent
to
String.hashcode(fullName:lang). Since this is a really poor hashing method
for small changes, the risk that two documents with similar names of same
length are colliding is really high. This Id is used by the Hibernate store
for document unicity and really needs to be unique, and at most a
64bit-numeric at the same time. Currently we use only 32bits since java
hashcode are limited to 32bits integers.
The ideal would be not to have this link between ids
and documents
fullName:lang, but converting the current implementation is not really
easy.
This is probably why XWIKI-4396 has never been fixed. Therefore, my current
goal is to reduce the likeliness of a collision by choosing a better hashing
method and taking into account the fact that document fullname are short
string and the number of unique ids required are very limited (since the
unicity is only expected for a given XWiki database) compared to the 64bits
integrer range.
So we need to choose a better algorithm, and here are
IMHO the potential
options:
A) use a simple but more efficient non-cryptographic
hashing function that
runs on 64bits, I was thinking about using the algorithm
produced by
Professor Daniel J. Bernstein (DJB) since it is well-know, wildly used, easy
to implement algorithm with a good distribution on small strings.
Pro: no dependency; fast; 64bits better than hashcode
Cons: probably more risk of collision compare to MD5 or SHA, but far less
than now;
require db migration of all document keys
B) use an MD5 or even stronger SHA1 or SHA256
algorithm from JCA,
truncating to the lower 64bits. Note that oldcore already use
MDA5 for
hashcoding a whole XWikiDocument to provide the API with a
getVersionHashcode(), and for the validation hash used by the persistent
login manager. The first use Object.hashcode() as a fallback, which is
really bad and defeat the purpose. The second does not provide any fallback
and may fail unexpectedly. For our case, if we really want a fallback, we
needs to store the hashing algorithm used in the database at creation time,
and anyway, fail when it was not available.
Pro: already used in oldcore, probably less collision;
with fallback,
really flexible since it would be possible to choose the algorithm
at
creation time and does not require full migration for existing database.
Cons: require at least a DB schema change to add the
hashing algorithm,
probably as a column of xwikidbversion; if this config value is
broken, the
whole DB is broken
C) use our own MD5 implementation when the JCA
provider is missing it. I
was thinking about integrating something like
http://twmacinta.com/myjava/fast_md5.php (non-native version) which is LGPL.
This will ensure availability of the hashing algorythm while having a rather
strong one.
Pro: no dependency, could also provide MD5 to
getVersionHashcode and the
persistent manager
Cons: require db migration of all document keys
A) is really quick to implement, simple, and the less risky, but some may
found it
insufficient. Caleb ?
Obviously, B) with fallback is really nice, but I
wonder if it is not
overkill ?
I am worried about the B) without fallback, but maybe
I want it too
flawless
C) is rather solid, while staying simple, but maybe
overkill too.
I am really undecided.
WDYT ?
--
Denis Gervalle
SOFTEC sa - CEO
eGuilde sarl - CTO
--
Denis Gervalle
SOFTEC sa - CEO
eGuilde sarl - CTO
--
Denis Gervalle
SOFTEC sa - CEO