On Fri, Sep 9, 2011 at 12:03 AM, Paul Libbrecht <paul(a)hoplahup.net> wrote:
Fabio,
interesting document and challenging mission!
Thanks Paul for your emal.
There's a whole lot to tell about your document,
but here's a few guts feelings:
- first, I think you should describe a few application scenarios in more details; I think
you'd come with the conclusion that both an EmbeddedServer and a Solr WebApp (inside
server or outside) make sense. It looks like this decision is not needed now as Solrj
offers you abstraction.
Well, this investigation was a starting point for understanding SOLR
and how it could be possibly used to improve XWiki search features.
I agree about the description of the application scenarios but I was
counting on the community to help on this as well :)
I just wrote in the document something interesting I found (I never
used SOLR before and I am discovering it now :))
- I am fearing you do not get all the benefits with
EmbeddedServer, in particular the caching and auto-warming but that seems not to be the
case:
http://lucene.472066.n3.nabble.com/Embedded-Server-Caching-Stats-page-updat…
AFAIU the EmbeddedSolrServer is equivalent to the web application.
The problem is that using the EmbeddedSolrServer is easier from an
integration point of view (it's just a matter of declaring some
dependencies in the pom.xml) while using a WAR version would mean to
"merge" SOLR web application with the XWiki one which could be a more
difficult task.
So the first investigation focused on the EmbeddedSolrServer.
This, however, doesn't prevent to use an external SOLR server on some
deployment :
SolrServer server = getSolrServer();
public SolrServer getSolrServer(){
if(some options are specified in the xwiki.cfg) return new
CommonsHttpSolrServer();
else return new EmbeddedSolrServer();
}
The SolrJ APIs are the same for the two components so it's just a
matter of choosing the right implementation.
- Your query examples are pretty hairy I find. Joe-bo
users want to "just type" and find relevance ranked results. Solr supports this
well with the DisMax query handler (it allows to put a higher rank on title for example,
than on body, than on attachments...). I would say you need both (the solr web-app's
default query handler allows both with an extra prefix). Another major advantage, which
the lucene plugin missed is that you can have one field that is "stemmed" and a
copy of it that is not. A match in the exact field would rank higher.
Yes, I looked at it. The solrconfig.xml allows you to tune a lot of details.
My first idea was: "let's see what we can do with a minimal
solrconfig.xml, the one that could end up packaged with a standard XE
distribution if we decide to bundle solr"
It's clear that, given the power of SOLR, we will need at some point
to provide the user/administrator the mechanisms to tune the
configuration of SOLR (for example. a French site might be interested
in using a different type of analysers, tokenizer, etc. for the
analysis)
Though I think that it should be done in a way that the user interface
stays the same.
- In all applications I've worked on, indexing
pages when they change is not enough because they are pages that depend on others... this
needs to be addressed at the application level (think, e.g. about the dashboard, about
"book" pages that enclose others): re-index triggers.
This could be done in the component logic.
Another crucial aspect is to stimulate anyone working
on a particular schema to be economic. The biggest flaw of the xwiki-lucene-module is that
it indexed and stored everything... that meant that a single result document was quite
big. Storing typically is probably not useful.
Yep. If you store everything you will duplicate your XWiki database :)
The schema.xml is a delicate point because once it's decided it should
be freezed because the fields declared will then be used by other
component via the API to retrieve the returned information.
I've found interesting the fact that you can declare dynamic fields
which are associated to a given type using a prefix/suffix. This could
be used as a way to extending the schema at runtime if an application
needs to.
- particular scenarios will have particular UIs. Would
you sketch one that would be default for 3.2? Would authors be facets? spaces?
UI is another tricky point.
I am thinking about a "standard distribution", that is, how a UI
leveraging SOLR as a search engine should appear if SOLR is integrated
in XWiki by default.
So basically the basic scenario is: everything in the UI stays the
same and we just change the engine under the hood.
However the fact that SOLR has a lot of interesting features (e.g.,
facets) might drive the *standard* search UI towards some
improvements.
For example, as you suggested, spaces and authors could be interesting
facets, but I would say also dates.
This is an open discussion.
From your question I also understand that you are
suggesting a way to
customize the UI in order to take into account particular
search
scenarios.
This would be great but I have no idea, at this point, about how to do
it, and if it's really interesting in adding this flexibility in the
standard distribution.
Afterall if your scenario is that particular you can always write an
application that uses a custom solrconfig.xml and schema and UI :)
- I would suggest to enter best practice as soon as
possible: make evaluations possible per default. A typical evaluation would be run by a
content expert that would know his documents and would invent a few queries (e.g. reading
the logs) and check the correct or incorrect results, that'd give mean precision and
recall at each of the results, something you can then collect and tabulate to assess the
"mean" quality of a search engine (that paper:
http://www.oracleimg.com/technetwork/database/enterprise-edition/imt-qualit…
explains this well). I'm just back from a summer school on Information Retrieval and
there's a lot there.
I see where you are heading :)
Well, this investigation was more modest.
As I said, it was just to try to understand if/how we could use SOLR
as the default search infrastructure for the default XWiki
distribution.
It's clear that indexing/searching should be tuned with respect to the
domain. It would be good to make the integration so flexible that
these tuning could be taken into account. Though for a first iteration
I think that would be too much :)
I am sorry I cannot offer much time but I would love
to lend a little hand.
Well, you your mail has been very very useful.
Thanks,
Fabio
paul
Le 6 sept. 2011 à 17:29, Fabio Mancinelli a écrit :
Hi everybody,
for the 3.2 release cycle I said that I was going to investigate a bit
the SOLR search engine and how to use/integrate it in the current
platform.
I wrote a document that you can find here:
http://dev.xwiki.org/xwiki/bin/view/Design/SOLRIntegration about some
of the things I looked at.
There is a lot of room for discussion/improvement but I think the
document is already a good starting point.
Feedback is welcome.
Thanks,
Fabio
_______________________________________________
devs mailing list
devs(a)xwiki.org
http://lists.xwiki.org/mailman/listinfo/devs
_______________________________________________
devs mailing list
devs(a)xwiki.org
http://lists.xwiki.org/mailman/listinfo/devs