Re: [xwiki-devs] [Proposal] Reduce risk of duplicated document id (XWIKI-4396)

23 Sep 2011

Caleb and Sergui,
If I have a preference on MD5 it is not about performance but a matter of
dependency. Currently, oldcore is only really dependent on MD5 for the
persistence of the authentication which is an optional stuff, so I do not
agree that there is already serious dependency on
the JCA providers. The reason I was looking at C, is only because there will
be no workaround if no JCA provider for the chosen algorithm is available,
and that docid are essential to the core. Therefore, having our own built-in
one could be a security to have a working core while no JCA provider is
available.
I stay currently on B) without fallback, but I am now more worried about the
migration part, and I open another thread for it.
On Fri, Sep 23, 2011 at 08:35, Sergiu Dumitriu &lt;sergiu(a)xwiki.com&gt; wrote:
...
  On 09/22/2011 04:30 PM, Caleb James DeLisle wrote:
  Sorry for not replying sooner.
 I am -1 to A because I have worked with DJB's ciphers before and one  thing I
can say about them
  is that they are unforgiving. For example:
salsa/20 is not collision  resistant, it is a hash but
  if you try to use it to hash passwords you will
be disappointed.
 B seems like a good decision, we generally are not trying to protect  against an
adversary so
  cryptographic strength is not tremendously
important, however I would  caution against using
  deprecated hash functions to try and squeeze out
a little more  performance on each page load.
  The performance penalty is only when pages are
loaded from the DB or  saved to it so it's of little
  concern compared to even removing a single query
from a lookup or  converting a like constraint to an
  = constraint.
 Regarding C, I think it is counterproductive since it will grow the  binary size
for no measurable
  improvement. Also I recommend testing MD5 against
SHA1 and SHA256 before  deciding that one is faster
  than another. Just because MD5 is old doesn't
make it fast. It might also  be that the average document
  name is longer than 16 bytes so it requires 2 MD5
rounds where a single  SHA1 round would be sufficient.
  Another interesting piece of information is, on
average, how many times  this function gets called per
  hour in a given wiki.
 As I said on IRC, I'd use SHA256 and not worry about performance since I  think
it is so rarely called
  that it makes no measurable difference.

 I second Caleb on the performance thing. There are much slower things
 going on in a request, a couple more SHA computations won't make any
 difference. Plus, the ID is not needed that often, since documents that
 are in the cache don't need to have their ID recomputed.
 About C, I also said something in my previous mail in case it wasn't
 noticed (see below).
  Caleb
 On 09/22/2011 04:56 AM, Denis Gervalle wrote:
> On Thu, Sep 22, 2011 at 09:19, Denis Gervalle&lt;dgl(a)softec.lu&gt;  wrote:
>
>> Hi Sergiu,
>>
>> Great thanks for your feedback, I am no more alone in the dark ;)
>>
>> On Thu, Sep 22, 2011 at 02:02, Sergiu Dumitriu&lt;sergiu(a)xwiki.com&gt; 
wrote:
 >>
>>   On 09/20/2011 01:32 PM, Denis Gervalle wrote:
>>>> Since nobody seems to have remark nor opinion, I have started to do
 A),
 >>> but
>>>> it was still insufficient for my requirements. So I have now done B)
>>> without
>>>> fallback using MD5, and if no one stop me, I will probably do C) 
soon.
 >>>
>>> Could you easily switch to SHA-1 instead?
>>>
>>
>> I can, but I do not feel this could be a huge improvement since we keep
>> only the lower 64bits.
>> Since oldcore already use MD5 for other stuffs (login persistence and
>> hashcoding of document content for comparison), I would prefer not to
>> introduce new dependencies on a different algorithm without a real  need.
 >>
>>
>>>
>>>> Since the storage is entirely broken by this change, I need the 
startup
 >>> to
>>>> be stopped on migration failure and migration to be mandatory. 
Migration
 >>>> error was ignored until now, but
this is wrong IMO. It would means  that
 >>> the
>>>> code was both backward compatible with old data and failsafe to any
>>>> migration failure. Moreover, Migration could also be disabled by 
config,
 >>> but
>>>> this is not really safe IMO.
>>>
>>> I agree that migration failure should stop the wiki from starting,
>>> provided the failure is very clear about error, not just a dumb
>>> "database bad, good bye, and good luck!".
>>>
>>
>> The failure is in two steps:
>>   1) a "migration failure on database x" with the exception returned by
 the
 >> migrator attached
>>   2) a "could not continue due to x migration failures"
>>
>>
>>>
>>> I don't fully agree that they should be mandatory, though. If the
>>> sysadmin is competent enough to disable migrations, since they are
>>> enabled by default, he should be able to decide when to run them and
>>> when to disable them. It should be known that during an upgrade
>>> migrations should be run, but it's better for performance to disable
>>> this extra check that could take a while on large farms.
>>>
>>
>> Probably that I should have been clearer, but it means for me a simple 
step
 >> which ensure that the database is in a
version higher enough for being
>> supported by the running core. This means that even if migration are
>> disabled (meaning I do not want my database to be migrated without my
>> explicit authorization), we still need to check migration, and stop if
>> anyone have not been applied. I agree that this could have a  performance
 >> issue at startup of large farms, maybe we
could delay the check until a  real
 >> wiki is set to a given database (I have
not check if this is complex or
>> not). What afraid me is the following:
>>
>
> (I have checked now, and I think that this could be done quite easily in
> XWiki#updateDatabase)
>
>
>>
>>   1) admin has disabled migration
>>   2) admin upgrade XWiki, and reuse its old config
>>   3) admin start the new version, which use changed id, and the wiki  start
 >> creating many duplicate document like in
an empty DB.
>>   4) the only option of admin, if he understand what happened is to 
restore
 >> all wiki that have been accessed during
3)
>>
>>
>>>> Currently I have added code that throws if migrations fails, 
preventing
 >>> the
>>>> wiki to be started. Should I improve further ? could we expect user
 to
 >>> take
>>>> care to migration requirement, or should we have mandatory migration
 and
 >>>> optional one ?
>>>
>>> I don't think that there are any optional migrations currently...
>>> They're all important.
>>>
>>
>> That was my feeling, simply some has less consequences than others when 
not
 >> applied.
>>
>>
>>>
>>>> I am also wondering if I should target to commit in release 3.2 or 
3.3
 >>> since
>>>> the name of the migrator and the db version will depends on this.
>>>
>>> We're getting close to the 3.2 release, so the codebase should be 
moving
 >>> towards a lower entropy point. If you
fully trust your changes and can
>>> commit them during the next couple of days, you have my +0.
>>>
>>
>> The time to commit depends mainly on this discussion.
>> I trust my code but a review could be welcome.
>>
>>
>>>> Hoping to have some comments this time.
>>>>
>>>> On Mon, Sep 19, 2011 at 12:54, Denis Gervalle&lt;dgl(a)softec.lu&gt; 
wrote:
 >>>>>
>>>>> Hi Devs,
>>>>> Since I fall on this well-know XWiki caveat recently, I would like
 to
 >>>> improve this.
>>>>> Currently, XWikiDocument.getId() is almost equivalent to
>>>> String.hashcode(fullName:lang). Since this is a really poor hashing
>>> method
>>>> for small changes, the risk that two documents with similar names of
>>> same
>>>> length are colliding is really high. This Id is used by the Hibernate
>>> store
>>>> for document unicity and really needs to be unique, and at most a
>>>> 64bit-numeric at the same time. Currently we use only 32bits since 
java
 >>>> hashcode are limited to 32bits
integers.
>>>>> The ideal would be not to have this link between ids and documents
>>>> fullName:lang, but converting the current implementation is not 
really
 >>> easy.
>>>> This is probably why XWIKI-4396 has never been fixed. Therefore, my
>>> current
>>>> goal is to reduce the likeliness of a collision by choosing a better
>>> hashing
>>>> method and taking into account the fact that document fullname are 
short
 >>>> string and the number of unique
ids required are very limited (since  the
 >>>> unicity is only expected for a
given XWiki database) compared to the
>>> 64bits
>>>> integrer range.
>>>>> So we need to choose a better algorithm, and here are IMHO the
>>> potential
>>>> options:
>>>>> A) use a simple but more efficient non-cryptographic hashing 
function
 >>> that
>>>> runs on 64bits, I was thinking about using the algorithm produced by
>>>> Professor Daniel J. Bernstein (DJB) since it is well-know, wildly 
used,
 >>> easy
>>>> to implement algorithm with a good distribution on small strings.
>>>>> Pro: no dependency; fast; 64bits better than hashcode
>>>>> Cons: probably more risk of collision compare to MD5 or SHA, but far
>>> less
>>>> than now; require db migration of all document keys
>>>>> B) use an MD5 or even stronger SHA1 or SHA256 algorithm from JCA,
>>>> truncating to the lower 64bits. Note that oldcore already use MDA5 
for
 >>>> hashcoding a whole XWikiDocument
to provide the API with a
>>>> getVersionHashcode(), and for the validation hash used by the 
persistent
 >>>> login manager. The first use
Object.hashcode() as a fallback, which  is
 >>>> really bad and defeat the
purpose. The second does not provide any
>>> fallback
>>>> and may fail unexpectedly. For our case, if we really want a 
fallback,
 >>> we
>>>> needs to store the hashing algorithm used in the database at creation
>>> time,
>>>> and anyway, fail when it was not available.
>>>>> Pro: already used in oldcore, probably less collision; with 
fallback,
 >>>> really flexible since it would be
possible to choose the algorithm at
>>>> creation time and does not require full migration for existing 
database.
 >>>>> Cons: require at least a DB
schema change to add the hashing  algorithm,
 >>>> probably as a column of
xwikidbversion; if this config value is  broken,
 >>> the
>>>> whole DB is broken
>>>>> C) use our own MD5 implementation when the JCA provider is missing
 it.
 >>> I
>>>> was thinking about integrating something like
>>>> http://twmacinta.com/myjava/fast_md5.php (non-native version) which
 is
 >>> LGPL.
>>>> This will ensure availability of the hashing algorythm while having a
>>> rather
>>>> strong one.
>>>>> Pro: no dependency, could also provide MD5 to getVersionHashcode and
>>> the
>>>> persistent manager
>>> 
 Here:
 >>> I don't think C is needed, we
depend on cryptography in many other
>>> places, like the password hashes, not to mention the whole crypto 
module.
 >>>
>>>>> Cons: require db migration of all document keys
>>>>> A) is really quick to implement, simple, and the less risky, but
 some
 >>> may
>>>> found it insufficient. Caleb ?
>>>>> Obviously, B) with fallback is really nice, but I wonder if it is
 not
 >>>> overkill ?
>>>>> I am worried about the B) without fallback, but maybe I want it too
>>>> flawless
>>>>> C) is rather solid, while staying simple, but maybe overkill too.
>>>>> I am really undecided.
>>>>> WDYT ?
>>>>> 
 --
 Sergiu Dumitriu
 http://purl.org/net/sergiu/
 _______________________________________________
 devs mailing list
 devs(a)xwiki.org
 http://lists.xwiki.org/mailman/listinfo/devs

--
Denis Gervalle
SOFTEC sa - CEO
eGuilde sarl - CTO

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [xwiki-devs] [Proposal] Reduce risk of duplicated document id (XWIKI-4396)