[xwiki-devs] Watch Lucene integration: XWATCH-116 - xwiki-devs@xwiki.org

12 Aug 2008

Hi devs,
my research about Lucene querying for Watch
(http://jira.xwiki.org/jira/browse/XWATCH-116) from the past weeks brought me to
the following results:
First of all, as I already mentioned, since Lucene is a fulltext indexing engine
and Watch data is, even if wiki data, quite structured data (and so are its
queries), something feels not right in trying to do this. But the query speed
improvements are / should be significant so we could try to workaround --  XWiki
Lucene plugin indexes object properties in documents as fields so basically, a
structured search is plausible (with the right type of Lucene fields, etc).
Even like this, not all Watch SQL queries are fully translatable to Lucene
queries without having Watch specific Lucene querying code and / or Watch
specific Lucene indexing.
Second, there is a problem with the Lucene reliability and "response time", as
mentioned in the jira issue comments:
* there is a delay between the moment a document is created / modified and the
moment the change is retrievable through Lucene queries, because first it needs
to be indexed by Lucene. This fuzzyness, although acceptable in some situations
(for example, retrieving the list of articles to show to the user) it is not
desired in situations like article properties updates (star, read, trash) or
feed adding / deleting -- "caching" these changes until they are retrievable
through Lucene querying is not at all an option.
* XWiki Lucene plugin seems quite buggy / unstable: a lot of exceptions (server
restart seems to be fatal due to some hanged file lock), duplicate documents in
results sometimes, missing results some time, all explainable and acceptable in
the "fuzzy" situation of a fulltext search engine, but not when trying to use it
  for structured search.
Despite all these, I devised some code to help me test Lucene and compare it to
SQL in the real-watch situation, for the article list retrieval. The results for
the time spent on the server (I assumed that the time taken to transport
documents and print them in the Reader is the same regardless of the querying
technique), for a mysql server with ~60000 articles in ~200 feeds, are:
* for the initial loading of the articles, in a newly started server, SQL can
take up to 30-40 seconds, Lucene takes "only" up to 20 (15-16)
* for the initial load of the interface, in a non-newly started server, it takes
~15 seconds for SQL and ~4-5 on average (but can go up to 10) for Lucene
* for a click on the All group, which it's basically the same query as for the
initial load of the article list, it can go under a second for Lucene and around
7-8-9 seconds for SQL
* for a click on a feed with 1023 articles (therefore loading the list of
articles in a specific feed), SQL goes to 3 seconds while Lucene can take from
under a second to a couple of seconds, depending on the time took to load the
actual documents corresponding to the search results
* for pagination navigation, Lucene takes a second on average and SQL 2-3 seconds.
Note that Lucene retrieval still uses the database and SQL access because its
results are LuceneDocuments (that hold names of XWikiDocuments), not XWikiDocuments.
All the tests above were made with a server running on my computer with a
5.0.51a mysql version, a 1.6-SNAPSHOT XWiki version , java 1.5.0_15 version, AMD
Turion64 x 2 TL-56, 1.8 GB RAM, but with other applications running too.
I feel that, overall, Lucene querying improvements are not so spectacular,
especially because it cannot solve all situations and we would still need SQL
querying in some cases, and because of its relative instability (which we could
think about fixing, though).
The other option for performance improvements in Watch would be to have a Watch
specialized server (as we've already discussed) which would allow, amongst other
things, to use Watch specific SQL queries (as opposed to now when we use generic
queries since we have to go through the XWiki GWT API) and optimize as much as
possible at that level. I haven't tested yet the amount of improvement but,
since I think we might be able to drop some tables from some of the SQL joins
we're doing right now, it should be better. Of course, this kind of approach
requires heavy refactor, and potentially complete rewriting and rearchitecturing
of some pieces.
Despite the coding challenge, I'd go for the second approach, WDYT?
Happy coding,
Anca Luca