Hi Karel,
On 24 Sep 2014 at 11:49:02, Karel Gardas
(karel.gardas@centrum.cz(mailto:karel.gardas@centrum.cz)) wrote:
Hello,
I'm trying to import Wikipedia xml (English dumps w/o history) into the
Xwiki 6.0.1 running using PostgreSQL as a DB. I'm using mediawiki/1.0
syntax to easy job on my side especially when the task is to test if
xwiki is able to hold just this amount of data and nothing more.
Interesting experiment :)
So far probably the most critical found issues are:
1) wikipedia's links are a little bit longer than expected. I'm afraid
this is usually whole citation going into the link hence after
installing xwiki and initialization of hibernate I needed to switch it
off and alter PostgreSQL table by:
alter table xwikilinks alter column xwl_link type varchar(4096);
that ensures that much more pages may be imported.
The xwiklinks table is the table containing all the backlinks for a given document.
Indeed the default is 255 chars for the “link” field which contains a serialized reference
to linked pages (but without the wiki part if the wiki is the same as the wiki of the
document containing the link).
And "fullName” is also 255 chars by default and contains a serialized reference of
the document containing a link (without the wiki part).
So indeed it can quickly become not enough if space names and wiki pages are a bit long.
2) while importing I hit issue on duplication of the
xwikircs_pkey key.
It shows as:
STATEMENT: insert into xwikircs (XWR_DATE, XWR_COMMENT, XWR_AUTHOR,
XWR_DOCID, XWR_VERSION1, XWR_VERSION2) values ($1, $2, $3, $4, $5, $6)
ERROR: duplicate key value violates unique constraint "xwikircs_pkey"
DETAIL: Key (xwr_docid, xwr_version1,
xwr_version2)=(3170339397610733377, 1, 1) already exists.
in PostgreSQL console and as:
2014-09-22 00:53:51,601
[
http://localhost:8080/xwiki/rest/wikis/xwiki/spaces/Wikipedia/pages/Brecon_…]
WARN o.h.u.JDBCExceptionReporter - SQL Error: 0, SQLState: 23505
2014-09-22 00:53:51,601
[
http://localhost:8080/xwiki/rest/wikis/xwiki/spaces/Wikipedia/pages/Brecon_…]
ERROR o.h.u.JDBCExceptionReporter - Batch entry 0 insert into
xwikircs (XWR_DATE, XWR_COMMENT, XWR_AUTHOR, XWR_DOCID, XWR_VERSION1,
XWR_VERSION2) values ('2014-09-22 00:53:51.000000 +02:00:00', '',
'XWiki.Admin', 3170339397610733377, 1, 1) was aborted. Call
getNextException to see the cause.
2014-09-22 00:53:51,601
[
http://localhost:8080/xwiki/rest/wikis/xwiki/spaces/Wikipedia/pages/Brecon_…]
WARN o.h.u.JDBCExceptionReporter - SQL Error: 0, SQLState: 23505
2014-09-22 00:53:51,601
[
http://localhost:8080/xwiki/rest/wikis/xwiki/spaces/Wikipedia/pages/Brecon_…]
ERROR o.h.u.JDBCExceptionReporter - ERROR: duplicate key value
violates unique constraint "xwikircs_pkey"
Detail: Key (xwr_docid, xwr_version1,
xwr_version2)=(3170339397610733377, 1, 1) already exists.
in xwiki/tomcat console.
This issue I'm not able to solve so far as it looks like the key value
itself is somehow generated by xwiki probably from some other data and
I'm not able to find so far related code.
The code is in XWikiDocument.getId().
There’s this caveat in the code:
// TODO: Ensure uniqueness of the generated id
// The implementation doesn't guarantee a unique id since it uses a hashing
method which never guarantee
// uniqueness. However, the hash algorithm is really unlikely to collide in a
given wiki. This needs to be
// fixed to produce a real unique id since otherwise we can have clashes in the
database.
I don’t have much ideas except rename the pages causing the problems since the unique id
is computed based on that.
Here’s the algorithm FYI:
/**
* <p>
* Serialize a reference into a unique identifier string within a wiki. Its similar to
the
* {@link UidStringEntityReferenceSerializer}, but is made appropriate for a wiki
independent storage.
* </p>
* <p>
* The string created looks like {@code 5:space3:doc} for the {@code wiki:space.doc}
document reference.
* and {@code 5:space3:doc15:xspace.class[0]} for the wiki:space.doc^wiki:xspace.class[0]
object.
* (with {@code 5} being the length of the space name, i.e the length of {@code space} and
{@code 3} being the length of
* the page name, i.e. the length of {@code doc}).
* </p>
Denis might know better since he improved the uniqueness some time back.
Also the question is if this is kind of hash function
if I did not break
that by making links longer by hack in (1).
No, it’s unrelated.
Thanks
-Vincent
Any comment on (1) and its correctness and idea to fix
(2) is highly
appreciated here.
Thanks!
Karel