On 09/20/2011 01:32 PM, Denis Gervalle wrote:
Since nobody seems to have remark nor opinion, I have
started to do A), but
it was still insufficient for my requirements. So I have now done B) without
fallback using MD5, and if no one stop me, I will probably do C) soon.
Could you easily switch to SHA-1 instead?
Since the storage is entirely broken by this change, I
need the startup to
be stopped on migration failure and migration to be mandatory. Migration
error was ignored until now, but this is wrong IMO. It would means that the
code was both backward compatible with old data and failsafe to any
migration failure. Moreover, Migration could also be disabled by config, but
this is not really safe IMO.
I agree that migration failure should stop the wiki from starting,
provided the failure is very clear about error, not just a dumb
"database bad, good bye, and good luck!".
I don't fully agree that they should be mandatory, though. If the
sysadmin is competent enough to disable migrations, since they are
enabled by default, he should be able to decide when to run them and
when to disable them. It should be known that during an upgrade
migrations should be run, but it's better for performance to disable
this extra check that could take a while on large farms.
Currently I have added code that throws if migrations
fails, preventing the
wiki to be started. Should I improve further ? could we expect user to take
care to migration requirement, or should we have mandatory migration and
optional one ?
I don't think that there are any optional migrations currently...
They're all important.
I am also wondering if I should target to commit in
release 3.2 or 3.3 since
the name of the migrator and the db version will depends on this.
We're getting close to the 3.2 release, so the codebase should be moving
towards a lower entropy point. If you fully trust your changes and can
commit them during the next couple of days, you have my +0.
Hoping to have some comments this time.
On Mon, Sep 19, 2011 at 12:54, Denis Gervalle<dgl(a)softec.lu> wrote:
Hi Devs,
Since I fall on this well-know XWiki caveat recently, I would like to
improve
this.
Currently, XWikiDocument.getId() is almost
equivalent to
String.hashcode(fullName:lang). Since this is a really poor hashing
method
for small changes, the risk that two documents with similar names of same
length are colliding is really high. This Id is used by the Hibernate store
for document unicity and really needs to be unique, and at most a
64bit-numeric at the same time. Currently we use only 32bits since java
hashcode are limited to 32bits integers.
The ideal would be not to have this link between
ids and documents
fullName:lang, but converting the current implementation is not
really easy.
This is probably why XWIKI-4396 has never been fixed. Therefore, my current
goal is to reduce the likeliness of a collision by choosing a better hashing
method and taking into account the fact that document fullname are short
string and the number of unique ids required are very limited (since the
unicity is only expected for a given XWiki database) compared to the 64bits
integrer range.
So we need to choose a better algorithm, and here
are IMHO the potential
options:
A) use a simple but more efficient
non-cryptographic hashing function that
runs on 64bits, I was thinking about using
the algorithm produced by
Professor Daniel J. Bernstein (DJB) since it is well-know, wildly used, easy
to implement algorithm with a good distribution on small strings.
Pro: no dependency; fast; 64bits better than
hashcode
Cons: probably more risk of collision compare to MD5 or SHA, but far less
than
now; require db migration of all document keys
B) use an MD5 or even stronger SHA1 or SHA256
algorithm from JCA,
truncating to the lower 64bits. Note that oldcore already use
MDA5 for
hashcoding a whole XWikiDocument to provide the API with a
getVersionHashcode(), and for the validation hash used by the persistent
login manager. The first use Object.hashcode() as a fallback, which is
really bad and defeat the purpose. The second does not provide any fallback
and may fail unexpectedly. For our case, if we really want a fallback, we
needs to store the hashing algorithm used in the database at creation time,
and anyway, fail when it was not available.
Pro: already used in oldcore, probably less
collision; with fallback,
really flexible since it would be possible to choose the
algorithm at
creation time and does not require full migration for existing database.
Cons: require at least a DB schema change to add
the hashing algorithm,
probably as a column of xwikidbversion; if this config value
is broken, the
whole DB is broken
C) use our own MD5 implementation when the JCA
provider is missing it. I
was thinking about integrating something like
http://twmacinta.com/myjava/fast_md5.php (non-native version) which is LGPL.
This will ensure availability of the hashing algorythm while having a rather
strong one.
Pro: no dependency, could also provide MD5 to
getVersionHashcode and the
persistent manager
I don't think C is needed, we depend on cryptography in many other
places, like the password hashes, not to mention the whole crypto module.
Cons: require
db migration of all document keys
A) is really quick to implement, simple, and the less risky, but some may
found it
insufficient. Caleb ?
Obviously, B) with fallback is really nice, but I
wonder if it is not
overkill ?
I am worried about the B) without fallback, but
maybe I want it too
flawless
> C) is rather solid, while staying simple, but maybe overkill too.
> I am really undecided.
> WDYT ?
>
--
Sergiu Dumitriu
http://purl.org/net/sergiu/