Fabio,
interesting document and challenging mission!
There's a whole lot to tell about your document, but here's a few guts feelings:
- first, I think you should describe a few application scenarios in more details; I think
you'd come with the conclusion that both an EmbeddedServer and a Solr WebApp (inside
server or outside) make sense. It looks like this decision is not needed now as Solrj
offers you abstraction.
- I am fearing you do not get all the benefits with EmbeddedServer, in particular the
caching and auto-warming but that seems not to be the case:
http://lucene.472066.n3.nabble.com/Embedded-Server-Caching-Stats-page-updat…
- Your query examples are pretty hairy I find. Joe-bo users want to "just type"
and find relevance ranked results. Solr supports this well with the DisMax query handler
(it allows to put a higher rank on title for example, than on body, than on
attachments...). I would say you need both (the solr web-app's default query handler
allows both with an extra prefix). Another major advantage, which the lucene plugin missed
is that you can have one field that is "stemmed" and a copy of it that is not. A
match in the exact field would rank higher.
- In all applications I've worked on, indexing pages when they change is not enough
because they are pages that depend on others... this needs to be addressed at the
application level (think, e.g. about the dashboard, about "book" pages that
enclose others): re-index triggers.
Another crucial aspect is to stimulate anyone working on a particular schema to be
economic. The biggest flaw of the xwiki-lucene-module is that it indexed and stored
everything... that meant that a single result document was quite big. Storing typically is
probably not useful.
- particular scenarios will have particular UIs. Would you sketch one that would be
default for 3.2? Would authors be facets? spaces?
- I would suggest to enter best practice as soon as possible: make evaluations possible
per default. A typical evaluation would be run by a content expert that would know his
documents and would invent a few queries (e.g. reading the logs) and check the correct or
incorrect results, that'd give mean precision and recall at each of the results,
something you can then collect and tabulate to assess the "mean" quality of a
search engine (that paper:
http://www.oracleimg.com/technetwork/database/enterprise-edition/imt-qualit…
explains this well). I'm just back from a summer school on Information Retrieval and
there's a lot there.
I am sorry I cannot offer much time but I would love to lend a little hand.
paul
Le 6 sept. 2011 à 17:29, Fabio Mancinelli a écrit :
Hi everybody,
for the 3.2 release cycle I said that I was going to investigate a bit
the SOLR search engine and how to use/integrate it in the current
platform.
I wrote a document that you can find here:
http://dev.xwiki.org/xwiki/bin/view/Design/SOLRIntegration about some
of the things I looked at.
There is a lot of room for discussion/improvement but I think the
document is already a good starting point.
Feedback is welcome.
Thanks,
Fabio
_______________________________________________
devs mailing list
devs(a)xwiki.org
http://lists.xwiki.org/mailman/listinfo/devs