Re: [xwiki-devs] [VOTE] Change document id stored in the database to reduce the likelihood of duplicate id

8 Jan 2012

+1

Caleb

On 01/07/2012 04:39 PM, Denis Gervalle wrote:
...
  Now that the database migration mechanism has been
improved, I would like
 to go ahead with my patch to improve document ids.

 Currently, ids are simple string hashcode of a locally serialized document
 reference, including the language for translated documents. The likelihood
 of having duplicates with the string hashing algorithm of java is really
 high.

 What I propose is:

  1) use an MD5 hashing which is particularly good at distributing.
  2) truncate the hash to the first 64bits, since the XWD_ID column is a
 64bit long.
  3) use a better string representation as the source of hashing

 Based on previous discussion, point 1) and 2) has already been agreed, and
 this vote is in particular about the string used for 3).
 I propose it in 2 steps:

  1) before locale are fully supported in document reference, use this
 format:

<lengthOfLastSpaceName>:<lastSpaceName><lengthOfDocumentName>:<documentName><lengthOfLanguage>:<language>
     where language would be an empty string for the default document, so it
 would look like 7:mySpace5:myDoc0: and its french translation could be
 7:mySpace5:myDoc2:fr
  2) when locale are included in reference, we will replace the
 implementation by a reference serializer that would produce the same kind
 of representation, but that will include all spaces (not only the last
 one), to be prepared for the future.

 While doing so, I also propose to fix the cache key issue by using the same
 reference, but prefixed by <lengthOfWikiName>:<wikiName>, so the previous
 examples will have the following key in the document cache:
 5:xwiki7:mySpace5:myDoc0: and 5:xwiki7:mySpace5:myDoc2:fr

 Using such a key (compared to the usual serialization) has the following
 advantages:
  - ensure uniqueness of the reference without requiring a complex escaping
 algorithm, which is unneeded here.
  - potentially reversible
  - faster than the usual serialization
  - support language
  - independent of the current serialization that may evolved independently,
 so it will be stable over time which is really important when it is used as
 a base for the hashing algorithm used for document ids stored in the
 database.

 I would like to introduce this as early as possible, which means has soon
 has we are confident with the migration mechanism recently introduced.
 Since the migration of ids will convert 32bits hashes into 64bits ones, the
 risk of collision is really low, and to be careful, I have written a
 migration algorithm that would support such collision (unless it cause a
 circular reference collision, but this is really unexpected). However,
 changing ids again later, if we change our mind, will be really more risky
 and the migration difficult to implements, so it is really important that
 we agree on the way we compute these ids, once for all.

 Here is my +1,

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [xwiki-devs] [VOTE] Change document id stored in the database to reduce the likelihood of duplicate id