Re: [xwiki-devs] [Proposal] Reduce risk of duplicated document id (XWIKI-4396)

22 Sep 2011

Hi Sergiu,
Great thanks for your feedback, I am no more alone in the dark ;)
On Thu, Sep 22, 2011 at 02:02, Sergiu Dumitriu &lt;sergiu(a)xwiki.com&gt; wrote:
...
  On 09/20/2011 01:32 PM, Denis Gervalle wrote:
  Since nobody seems to have remark nor opinion, I
have started to do A),  but
  it was still insufficient for my requirements. So
I have now done B)  without
  fallback using MD5, and if no one stop me, I will
probably do C) soon. 
 Could you easily switch to SHA-1 instead?

I can, but I do not feel this could be a huge improvement since we keep only
the lower 64bits.
Since oldcore already use MD5 for other stuffs (login persistence and
hashcoding of document content for comparison), I would prefer not to
introduce new dependencies on a different algorithm without a real need.
...

  Since the storage is entirely broken by this
change, I need the startup  to
  be stopped on migration failure and migration to
be mandatory. Migration
 error was ignored until now, but this is wrong IMO. It would means that  the
  code was both backward compatible with old data
and failsafe to any
 migration failure. Moreover, Migration could also be disabled by config,  but
  this is not really safe IMO. 
 I agree that migration failure should stop the wiki from starting,
 provided the failure is very clear about error, not just a dumb
 "database bad, good bye, and good luck!".

The failure is in two steps:
 1) a "migration failure on database x" with the exception returned by the
migrator attached
 2) a "could not continue due to x migration failures"
...

 I don't fully agree that they should be mandatory, though. If the
 sysadmin is competent enough to disable migrations, since they are
 enabled by default, he should be able to decide when to run them and
 when to disable them. It should be known that during an upgrade
 migrations should be run, but it's better for performance to disable
 this extra check that could take a while on large farms.

Probably that I should have been clearer, but it means for me a simple step
which ensure that the database is in a version higher enough for being
supported by the running core. This means that even if migration are
disabled (meaning I do not want my database to be migrated without my
explicit authorization), we still need to check migration, and stop if
anyone have not been applied. I agree that this could have a performance
issue at startup of large farms, maybe we could delay the check until a real
wiki is set to a given database (I have not check if this is complex or
not). What afraid me is the following:
 1) admin has disabled migration
 2) admin upgrade XWiki, and reuse its old config
 3) admin start the new version, which use changed id, and the wiki start
creating many duplicate document like in an empty DB.
 4) the only option of admin, if he understand what happened is to restore
all wiki that have been accessed during 3)
...
   Currently I
have added code that throws if migrations fails, preventing  the
  wiki to be started. Should I improve further ?
could we expect user to  take
  care to migration requirement, or should we have
mandatory migration and
 optional one ? 
 I don't think that there are any optional migrations currently...
 They're all important.

That was my feeling, simply some has less consequences than others when not
applied.
...

  I am also wondering if I should target to commit
in release 3.2 or 3.3  since
  the name of the migrator and the db version will
depends on this. 
 We're getting close to the 3.2 release, so the codebase should be moving
 towards a lower entropy point. If you fully trust your changes and can
 commit them during the next couple of days, you have my +0.

The time to commit depends mainly on this discussion.
I trust my code but a review could be welcome.
...
   Hoping to have
some comments this time.
 On Mon, Sep 19, 2011 at 12:54, Denis Gervalle&lt;dgl(a)softec.lu&gt;  wrote:

 Hi Devs,
 Since I fall on this well-know XWiki caveat recently, I would like to  improve
this.
  Currently, XWikiDocument.getId() is almost
equivalent to  String.hashcode(fullName:lang). Since this is a really poor hashing
 method
  for small changes, the risk that two documents
with similar names of same
 length are colliding is really high. This Id is used by the Hibernate  store
  for document unicity and really needs to be
unique, and at most a
 64bit-numeric at the same time. Currently we use only 32bits since java
 hashcode are limited to 32bits integers.
  The ideal would be not to have this link between
ids and documents  fullName:lang, but converting the current implementation is not
really  easy.
  This is probably why XWIKI-4396 has never been
fixed. Therefore, my  current
  goal is to reduce the likeliness of a collision
by choosing a better  hashing
  method and taking into account the fact that
document fullname are short
 string and the number of unique ids required are very limited (since the
 unicity is only expected for a given XWiki database) compared to the  64bits
  integrer range.
  So we need to choose a better algorithm, and here
are IMHO the potential  options:
> A) use a simple but more efficient non-cryptographic hashing function  that
  runs on 64bits, I was thinking about using the
algorithm produced by
 Professor Daniel J. Bernstein (DJB) since it is well-know, wildly used,  easy
  to implement algorithm with a good distribution
on small strings.
> Pro: no dependency; fast; 64bits better than hashcode
> Cons: probably more risk of collision compare to MD5 or SHA, but far  less
  than now; require db migration of all document
keys
  B) use an MD5 or even stronger SHA1 or SHA256
algorithm from JCA,  truncating to the lower 64bits. Note that oldcore already use
MDA5 for
 hashcoding a whole XWikiDocument to provide the API with a
 getVersionHashcode(), and for the validation hash used by the persistent
 login manager. The first use Object.hashcode() as a fallback, which is
 really bad and defeat the purpose. The second does not provide any  fallback
  and may fail unexpectedly. For our case, if we
really want a fallback, we
 needs to store the hashing algorithm used in the database at creation  time,
  and anyway, fail when it was not available.
  Pro: already used in oldcore, probably less
collision; with fallback,  really flexible since it would be possible to choose the
algorithm at
 creation time and does not require full migration for existing database.
  Cons: require at least a DB schema change to add
the hashing algorithm,  probably as a column of xwikidbversion; if this config value
is broken,  the
  whole DB is broken
  C) use our own MD5 implementation when the JCA
provider is missing it. I  was thinking about integrating something like
 http://twmacinta.com/myjava/fast_md5.php (non-native version) which is  LGPL.
  This will ensure availability of the hashing
algorythm while having a  rather
  strong one.
  Pro: no dependency, could also provide MD5 to
getVersionHashcode and the 
 persistent manager 
 I don't think C is needed, we depend on cryptography in many other
 places, like the password hashes, not to mention the whole crypto module.
 > Cons: require db migration of all document
keys
> A) is really quick to implement, simple, and the less risky, but some  may
  found it insufficient. Caleb ?
  Obviously, B) with fallback is really nice, but I
wonder if it is not  overkill ?
  I am worried about the B) without fallback, but
maybe I want it too  flawless
> C) is rather solid, while staying simple, but maybe overkill too.
> I am really undecided.
> WDYT ?
> 
 --
 Sergiu Dumitriu
 http://purl.org/net/sergiu/
 _______________________________________________
 devs mailing list
 devs(a)xwiki.org
 http://lists.xwiki.org/mailman/listinfo/devs

--
Denis Gervalle
SOFTEC sa - CEO
eGuilde sarl - CTO

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [xwiki-devs] [Proposal] Reduce risk of duplicated document id (XWIKI-4396)