Re: [xwiki-devs] [Proposal] Reduce risk of duplicated document id (XWIKI-4396)

22 Sep 2011

On 09/20/2011 01:32 PM, Denis Gervalle wrote:
...
  Since nobody seems to have remark nor opinion, I have
started to do A), but
 it was still insufficient for my requirements. So I have now done B) without
 fallback using MD5, and if no one stop me, I will probably do C) soon. 
Could you easily switch to SHA-1 instead?
...
  Since the storage is entirely broken by this change, I
need the startup to
 be stopped on migration failure and migration to be mandatory. Migration
 error was ignored until now, but this is wrong IMO. It would means that the
 code was both backward compatible with old data and failsafe to any
 migration failure. Moreover, Migration could also be disabled by config, but
 this is not really safe IMO. 
I agree that migration failure should stop the wiki from starting,
provided the failure is very clear about error, not just a dumb
"database bad, good bye, and good luck!".
I don't fully agree that they should be mandatory, though. If the
sysadmin is competent enough to disable migrations, since they are
enabled by default, he should be able to decide when to run them and
when to disable them. It should be known that during an upgrade
migrations should be run, but it's better for performance to disable
this extra check that could take a while on large farms.
...
  Currently I have added code that throws if migrations
fails, preventing the
 wiki to be started. Should I improve further ? could we expect user to take
 care to migration requirement, or should we have mandatory migration and
 optional one ? 
I don't think that there are any optional migrations currently...
They're all important.
...
  I am also wondering if I should target to commit in
release 3.2 or 3.3 since
 the name of the migrator and the db version will depends on this. 
We're getting close to the 3.2 release, so the codebase should be moving
towards a lower entropy point. If you fully trust your changes and can
commit them during the next couple of days, you have my +0.
...
  Hoping to have some comments this time.
 On Mon, Sep 19, 2011 at 12:54, Denis Gervalle&lt;dgl(a)softec.lu&gt;  wrote:

 Hi Devs,
 Since I fall on this well-know XWiki caveat recently, I would like to  improve
this.
  Currently, XWikiDocument.getId() is almost
equivalent to  String.hashcode(fullName:lang). Since this is a really poor hashing
method
 for small changes, the risk that two documents with similar names of same
 length are colliding is really high. This Id is used by the Hibernate store
 for document unicity and really needs to be unique, and at most a
 64bit-numeric at the same time. Currently we use only 32bits since java
 hashcode are limited to 32bits integers.
  The ideal would be not to have this link between
ids and documents  fullName:lang, but converting the current implementation is not
really easy.
 This is probably why XWIKI-4396 has never been fixed. Therefore, my current
 goal is to reduce the likeliness of a collision by choosing a better hashing
 method and taking into account the fact that document fullname are short
 string and the number of unique ids required are very limited (since the
 unicity is only expected for a given XWiki database) compared to the 64bits
 integrer range.
  So we need to choose a better algorithm, and here
are IMHO the potential  options:
  A) use a simple but more efficient
non-cryptographic hashing function that  runs on 64bits, I was thinking about using
the algorithm produced by
 Professor Daniel J. Bernstein (DJB) since it is well-know, wildly used, easy
 to implement algorithm with a good distribution on small strings.
  Pro: no dependency; fast; 64bits better than
hashcode
 Cons: probably more risk of collision compare to MD5 or SHA, but far less  than
now; require db migration of all document keys
  B) use an MD5 or even stronger SHA1 or SHA256
algorithm from JCA,  truncating to the lower 64bits. Note that oldcore already use
MDA5 for
 hashcoding a whole XWikiDocument to provide the API with a
 getVersionHashcode(), and for the validation hash used by the persistent
 login manager. The first use Object.hashcode() as a fallback, which is
 really bad and defeat the purpose. The second does not provide any fallback
 and may fail unexpectedly. For our case, if we really want a fallback, we
 needs to store the hashing algorithm used in the database at creation time,
 and anyway, fail when it was not available.
  Pro: already used in oldcore, probably less
collision; with fallback,  really flexible since it would be possible to choose the
algorithm at
 creation time and does not require full migration for existing database.
  Cons: require at least a DB schema change to add
the hashing algorithm,  probably as a column of xwikidbversion; if this config value
is broken, the
 whole DB is broken
  C) use our own MD5 implementation when the JCA
provider is missing it. I  was thinking about integrating something like
 http://twmacinta.com/myjava/fast_md5.php (non-native version) which is LGPL.
 This will ensure availability of the hashing algorythm while having a rather
 strong one.
  Pro: no dependency, could also provide MD5 to
getVersionHashcode and the 
 persistent manager 
I don't think C is needed, we depend on cryptography in many other
places, like the password hashes, not to mention the whole crypto module.
...
   Cons: require
db migration of all document keys
 A) is really quick to implement, simple, and the less risky, but some may  found it
insufficient. Caleb ?
  Obviously, B) with fallback is really nice, but I
wonder if it is not  overkill ?
  I am worried about the B) without fallback, but
maybe I want it too  flawless
> C) is rather solid, while staying simple, but maybe overkill too.
> I am really undecided.
> WDYT ?
> 
--
Sergiu Dumitriu
http://purl.org/net/sergiu/

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [xwiki-devs] [Proposal] Reduce risk of duplicated document id (XWIKI-4396)