Re: [xwiki-devs] [VOTE] Change document id stored in the database to reduce the likelihood of duplicate id

9 Jan 2012

On Mon, Jan 9, 2012 at 12:20, Ludovic Dubost &lt;ludovic(a)xwiki.com&gt; wrote:
...
  Hi,
 I have one small concern which leads to a big concern, and I was wondering
 about something.
 1/ Small concern: what did we do to verify the potential level of
 collisions and if there is a chance they happen in our case

In theory, there is a risk, the same than the one you have with current
ids, but since the risk is reduced, it would be really bad luck to felt on
it. Note that we move from 32bits badly formed hash, to 64bits well suited
one. So the risk is not zero, but really unexpected since we double the
hash space and use a better repartition algorithm.
...
  I see we want to truncate the MD5 hash to 64 bits. I
was wondering if there
 is a not a risk of having more collisions.

Sure, using the lower 64bits is not as good as using the full 128bits.
Using more bits require a change in the mapping and schema (see later).
...
  My question here is what did we do to verify the level
of collisions on
 real data.
 We could provide some XWiki SAS client DBs, including our Intranet which is
 quite big for testing if there was a testing program.

That will be the purpose of the non final release, have some of you check
that their large DB works well with it. In particular, I would appreciate
some tests in non MySQL environment, especially Oracle...
...
  2/ Bigger concern: wouldn't it be better to have a
way to
 activate/deactivate the new feature. This would allow to still upgrade and
 make tests on real life data without risking being in a corner

We are already in a corner since we have fallen on id collision, so I only
try to put that corner further until we can fully change this to a fully
unique id. I know this is not easy, but we cannot stay stucked.
Providing both would require a two way migrator, and this would also
introduce more risk of mistakes that would cause database corruptions. I
have built a solid migrator that ensure you will migrate properly and
voluntarily before using the new core.
Note that if you plug an old core on the new DB, this one will corrupt it
somewhat, by not seing any documents, and recreating some default initial
documents using the old ids. This is more concerning since event if
document will not mixup, there object will partly, which will cause really
annoying issue. But what can we do, we cannot change old core retroactively.
So even if we provide rock solid solution, an old core will still corrupt
data. We cannot prevent all administrator mistakes.
Also providing both would means that we do not trust our 64bits hash to be
better than the 32bits one. As I said, there is no zero risk, but the
probability is really near zero.
3/ Wondering: wouldn't it be better to use the real reference as the ID and
...
  move to strings for it

...
  Give that in an XWiki database, this part is really
small (compared to
 attachments and the data itself), are there really any reasons to use IDs
 for this reference. Wouldn't the use of a String be better in the end ? We
 already use this for the join between xwikidoc and xwikiobjects and haven't
 seen any big problem with that did we ?

But objects still use an id (for hibernate and) to link properties to
objects as you mention later. Maybe you want to means that we use a mix of
objects  for properties.
...
  If we used that method wouldn't it means ZERO
collision ?

Sure, it would have been from the beginning, but it was not, and this is
now really difficult to change. The best would have been not to use
significant ID. Changing to string now would require an external migration
process, that use the old mapping to create a new id, than another process
that use the new mapping and remove the old ids. This is really another
job, that would be best done when we fully review the model.
...
  4/ Small additional stuff
 There is also the migration of Object IDs right ? The object IDs use the
 same system and also have a risk of collision (which would lead to property
 data being shared with completely unrelevant documents)

Good catch ! I have not seen it, since the object does not directly depends
on document ids, but indirectly depends on similarly calculated ones. I
need to look at these as well since this could mean two object of a
previously colliding documents will collide.
Thanks,
...

 Ludovic
 2012/1/7 Denis Gervalle &lt;dgl(a)softec.lu&gt;
  Now that the database migration mechanism has
been improved, I would like
 to go ahead with my patch to improve document ids.
 Currently, ids are simple string hashcode of a locally serialized  document
  reference, including the language for translated
documents. The  likelihood
  of having duplicates with the string hashing
algorithm of java is really
 high.
 What I propose is:
  1) use an MD5 hashing which is particularly good at distributing.
  2) truncate the hash to the first 64bits, since the XWD_ID column is a
 64bit long.
  3) use a better string representation as the source of hashing
 Based on previous discussion, point 1) and 2) has already been agreed,  and
  this vote is in particular about the string used
for 3).
 I propose it in 2 steps:
  1) before locale are fully supported in document reference, use this
 format:

<lengthOfLastSpaceName>:<lastSpaceName><lengthOfDocumentName>:<documentName><lengthOfLanguage>:<language>
     where language would be an empty string for
the default document, so  it
  would look like 7:mySpace5:myDoc0: and its french
translation could be
 7:mySpace5:myDoc2:fr
  2) when locale are included in reference, we will replace the
 implementation by a reference serializer that would produce the same kind
 of representation, but that will include all spaces (not only the last
 one), to be prepared for the future.
 While doing so, I also propose to fix the cache key issue by using the  same
  reference, but prefixed by
<lengthOfWikiName>:<wikiName>, so the previous
 examples will have the following key in the document cache:
 5:xwiki7:mySpace5:myDoc0: and 5:xwiki7:mySpace5:myDoc2:fr
 Using such a key (compared to the usual serialization) has the following
 advantages:
  - ensure uniqueness of the reference without requiring a complex  escaping
  algorithm, which is unneeded here.
  - potentially reversible
  - faster than the usual serialization
  - support language
  - independent of the current serialization that may evolved  independently,
  so it will be stable over time which is really
important when it is used  as
  a base for the hashing algorithm used for
document ids stored in the
 database.
 I would like to introduce this as early as possible, which means has soon
 has we are confident with the migration mechanism recently introduced.
 Since the migration of ids will convert 32bits hashes into 64bits ones,  the
  risk of collision is really low, and to be
careful, I have written a
 migration algorithm that would support such collision (unless it cause a
 circular reference collision, but this is really unexpected). However,
 changing ids again later, if we change our mind, will be really more  risky
  and the migration difficult to implements, so it
is really important that
 we agree on the way we compute these ids, once for all.
 Here is my +1,
 --
 Denis Gervalle
 SOFTEC sa - CEO
 eGuilde sarl - CTO
 _______________________________________________
 devs mailing list
 devs(a)xwiki.org
 http://lists.xwiki.org/mailman/listinfo/devs

 --
 Ludovic Dubost
 Founder and CEO
 Blog: http://blog.ludovic.org/
 XWiki: http://www.xwiki.com
 Skype: ldubost GTalk: ldubost
 _______________________________________________
 devs mailing list
 devs(a)xwiki.org
 http://lists.xwiki.org/mailman/listinfo/devs

--
Denis Gervalle
SOFTEC sa - CEO
eGuilde sarl - CTO

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [xwiki-devs] [VOTE] Change document id stored in the database to reduce the likelihood of duplicate id