On Mon, Jan 9, 2012 at 12:20, Ludovic Dubost <ludovic(a)xwiki.com> wrote:
Hi,
I have one small concern which leads to a big concern, and I was wondering
about something.
1/ Small concern: what did we do to verify the potential level of
collisions and if there is a chance they happen in our case
In theory, there is a risk, the same than the one you have with current
ids, but since the risk is reduced, it would be really bad luck to felt on
it. Note that we move from 32bits badly formed hash, to 64bits well suited
one. So the risk is not zero, but really unexpected since we double the
hash space and use a better repartition algorithm.
I see we want to truncate the MD5 hash to 64 bits. I
was wondering if there
is a not a risk of having more collisions.
Sure, using the lower 64bits is not as good as using the full 128bits.
Using more bits require a change in the mapping and schema (see later).
My question here is what did we do to verify the level
of collisions on
real data.
We could provide some XWiki SAS client DBs, including our Intranet which is
quite big for testing if there was a testing program.
That will be the purpose of the non final release, have some of you check
that their large DB works well with it. In particular, I would appreciate
some tests in non MySQL environment, especially Oracle...
2/ Bigger concern: wouldn't it be better to have a
way to
activate/deactivate the new feature. This would allow to still upgrade and
make tests on real life data without risking being in a corner
We are already in a corner since we have fallen on id collision, so I only
try to put that corner further until we can fully change this to a fully
unique id. I know this is not easy, but we cannot stay stucked.
Providing both would require a two way migrator, and this would also
introduce more risk of mistakes that would cause database corruptions. I
have built a solid migrator that ensure you will migrate properly and
voluntarily before using the new core.
Note that if you plug an old core on the new DB, this one will corrupt it
somewhat, by not seing any documents, and recreating some default initial
documents using the old ids. This is more concerning since event if
document will not mixup, there object will partly, which will cause really
annoying issue. But what can we do, we cannot change old core retroactively.
So even if we provide rock solid solution, an old core will still corrupt
data. We cannot prevent all administrator mistakes.
Also providing both would means that we do not trust our 64bits hash to be
better than the 32bits one. As I said, there is no zero risk, but the
probability is really near zero.
3/ Wondering: wouldn't it be better to use the real reference as the ID and
move to strings for it
Give that in an XWiki database, this part is really
small (compared to
attachments and the data itself), are there really any reasons to use IDs
for this reference. Wouldn't the use of a String be better in the end ? We
already use this for the join between xwikidoc and xwikiobjects and haven't
seen any big problem with that did we ?
But objects still use an id (for hibernate and) to link properties to
objects as you mention later. Maybe you want to means that we use a mix of
objects for properties.
If we used that method wouldn't it means ZERO
collision ?
Sure, it would have been from the beginning, but it was not, and this is
now really difficult to change. The best would have been not to use
significant ID. Changing to string now would require an external migration
process, that use the old mapping to create a new id, than another process
that use the new mapping and remove the old ids. This is really another
job, that would be best done when we fully review the model.
4/ Small additional stuff
There is also the migration of Object IDs right ? The object IDs use the
same system and also have a risk of collision (which would lead to property
data being shared with completely unrelevant documents)
Good catch ! I have not seen it, since the object does not directly depends
on document ids, but indirectly depends on similarly calculated ones. I
need to look at these as well since this could mean two object of a
previously colliding documents will collide.
Thanks,
Ludovic
2012/1/7 Denis Gervalle <dgl(a)softec.lu>
Now that the database migration mechanism has
been improved, I would like
to go ahead with my patch to improve document ids.
Currently, ids are simple string hashcode of a locally serialized
document
reference, including the language for translated
documents. The
likelihood
of having duplicates with the string hashing
algorithm of java is really
high.
What I propose is:
1) use an MD5 hashing which is particularly good at distributing.
2) truncate the hash to the first 64bits, since the XWD_ID column is a
64bit long.
3) use a better string representation as the source of hashing
Based on previous discussion, point 1) and 2) has already been agreed,
and
this vote is in particular about the string used
for 3).
I propose it in 2 steps:
1) before locale are fully supported in document reference, use this
format:
<lengthOfLastSpaceName>:<lastSpaceName><lengthOfDocumentName>:<documentName><lengthOfLanguage>:<language>
where language would be an empty string for
the default document, so
it
would look like 7:mySpace5:myDoc0: and its french
translation could be
7:mySpace5:myDoc2:fr
2) when locale are included in reference, we will replace the
implementation by a reference serializer that would produce the same kind
of representation, but that will include all spaces (not only the last
one), to be prepared for the future.
While doing so, I also propose to fix the cache key issue by using the
same
reference, but prefixed by
<lengthOfWikiName>:<wikiName>, so the previous
examples will have the following key in the document cache:
5:xwiki7:mySpace5:myDoc0: and 5:xwiki7:mySpace5:myDoc2:fr
Using such a key (compared to the usual serialization) has the following
advantages:
- ensure uniqueness of the reference without requiring a complex
escaping
algorithm, which is unneeded here.
- potentially reversible
- faster than the usual serialization
- support language
- independent of the current serialization that may evolved
independently,
so it will be stable over time which is really
important when it is used
as
a base for the hashing algorithm used for
document ids stored in the
database.
I would like to introduce this as early as possible, which means has soon
has we are confident with the migration mechanism recently introduced.
Since the migration of ids will convert 32bits hashes into 64bits ones,
the
risk of collision is really low, and to be
careful, I have written a
migration algorithm that would support such collision (unless it cause a
circular reference collision, but this is really unexpected). However,
changing ids again later, if we change our mind, will be really more
risky
and the migration difficult to implements, so it
is really important that
we agree on the way we compute these ids, once for all.
Here is my +1,
--
Denis Gervalle
SOFTEC sa - CEO
eGuilde sarl - CTO
_______________________________________________
devs mailing list
devs(a)xwiki.org
http://lists.xwiki.org/mailman/listinfo/devs
--
Ludovic Dubost
Founder and CEO
Blog:
http://blog.ludovic.org/
XWiki:
http://www.xwiki.com
Skype: ldubost GTalk: ldubost
_______________________________________________
devs mailing list
devs(a)xwiki.org
http://lists.xwiki.org/mailman/listinfo/devs
--
Denis Gervalle
SOFTEC sa - CEO
eGuilde sarl - CTO