Le 9 sept. 2011 à 11:17, Fabio Mancinelli a écrit :
- first, I
think you should describe a few application scenarios in more details; I think you'd
come with the [..]
Well, this investigation was a starting point for understanding
SOLR
and how it could be possibly used to improve XWiki search features.
I agree about the description of the application scenarios but I was
counting on the community to help on this as well :)
I just wrote in the document something interesting I found (I never
used SOLR before and I am discovering it now :))
We should link to existing usage of SOLR in the wild and at the competition.
Among others, it's the search-engine of
drupal.org ;-).
The problem is that using the EmbeddedSolrServer is
easier from an
integration point of view (it's just a matter of declaring some
dependencies in the pom.xml) while using a WAR version would mean to
"merge" SOLR web application with the XWiki one which could be a more
difficult task.
I would surely separate the wars!
So the first investigation focused on the
EmbeddedSolrServer.
This, however, doesn't prevent to use an external SOLR server on some
deployment :
[...]
The SolrJ APIs are the same for the two components so it's just a
matter of choosing the right implementation.
I agree no worry there.
- Your query
examples are pretty hairy I find. Joe-bo users want to "just type" and find
relevance ranked results. Solr supports this well with the DisMax query handler (it allows
to put a higher rank on title for example, than on body, than on attachments...). I would
say you need both (the solr web-app's default query handler allows both with an extra
prefix). Another major advantage, which the lucene plugin missed is that you can have one
field that is "stemmed" and a copy of it that is not. A match in the exact field
would rank higher.
Yes, I looked at it. The solrconfig.xml allows you to tune a lot of details.
My first idea was: "let's see what we can do with a minimal
solrconfig.xml, the one that could end up packaged with a standard XE
distribution if we decide to bundle solr"
then please make Dismax the default query type!
Include a checkbox for an advanced query so that you can change the query type for a query
with all sorts of field names details.
It's clear that, given the power of SOLR, we will
need at some point
to provide the user/administrator the mechanisms to tune the
configuration of SOLR (for example. a French site might be interested
in using a different type of analysers, tokenizer, etc. for the
analysis)
Though I think that it should be done in a way that the user interface
stays the same.
Correct. I would like to help you at this very point where "european" software
is far better than american ones: internationalization must be from version zero on. I
would like to provide a basic language dependent functionality so that a minimum fuzzyness
is supported in the default query type but that is avoided if you make precise queries.
Here's the proposal. We'd make fields such as:
- text: full-text, exact tokens (whitespace analyzer)
- text_standard: full-text, standard-analyzer (e.g. best for emails and URLs)
- text_fr: stemmed with the french analyzer (filled if the document is recognized to be
french)
- text_de: ...
- text_bits: makes any non-letter a token separator
Same with title_*
And the dismax's qf parameter would be something such as
title^3 title_standard^2 title_fr^1.5 title_en^1.5 title_de^1.5 title_bits^1.2 text^3
text_standard^2 text_fr^1.5 text_en^1.5 text_de^1.5 text_bits^1.2
This way if you ask for a chevaux you find a document with cheval but documents with
chevaux come first, especially if in the title. Documents with a URL that contains chevaux
would also match, but after that.
Enabling debug and explanation can give you the details of each such match which is useful
to understand.
- In all
applications I've worked on, indexing pages when they change is not enough because
they are pages that depend on others... this needs to be addressed at the application
level (think, e.g. about the dashboard, about "book" pages that enclose others):
re-index triggers.
This could be done in the component logic.
What I am saying is that you need a way for the application designers to add an
"indexing listener" that would support that type of callback.
Another
crucial aspect is to stimulate anyone working on a particular schema to be economic. The
biggest flaw of the xwiki-lucene-module is that it indexed and stored everything... that
meant that a single result document was quite big. Storing typically is probably not
useful.
Yep. If you store everything you will duplicate your XWiki database :)
The schema.xml is a delicate point because once it's decided it should
be freezed because the fields declared will then be used by other
component via the API to retrieve the returned information.
I've found interesting the fact that you can declare dynamic fields
which are associated to a given type using a prefix/suffix. This could
be used as a way to extending the schema at runtime if an application
needs to.
That seems to be useful only for language dependent fields as above indeed.
- particular scenarios will have particular UIs.
Would you sketch one that would be default for 3.2? Would authors be facets? spaces?
UI is another tricky point.
I am thinking about a "standard distribution", that is, how a UI
leveraging SOLR as a search engine should appear if SOLR is integrated
in XWiki by default.
So basically the basic scenario is: everything in the UI stays the
same and we just change the engine under the hood.
I would add the two following checkboxes:
- advanced (use qt=lucene then)
- debug (then show little links on each result which shows the result of the explanation:
response.getExplainMap())
However the fact that SOLR has a lot of interesting
features (e.g.,
facets) might drive the *standard* search UI towards some
improvements.
For example, as you suggested, spaces and authors could be interesting
facets, but I would say also dates.
True but note that 3 facets dimension is already big.
If time allows, highlighting is really quite fundamental as well for a trust in the search
engine something which has been quite low in the XWiki community and tools since the
switch to Lucene.
This is an open discussion.
From your question I also understand that you are suggesting a way to
customize the UI in order to take into account particular search
scenarios.
This would be great but I have no idea, at this point, about how to do
it, and if it's really interesting in adding this flexibility in the
standard distribution.
Well, I think the right way to do that is to leave sufficiently many java objects
available and documented.
I've indicated above a listener about indexing-decisions.
I think another aspect that is required is to leave it possible for applications to enrich
or make poorer the index documents before they go into the index.
Both of these tasks should be doable from Groovy.
My old suggestion would be to add a listener this way:
xwiki.solr.addIndexListener(xwiki.parseGroovyFromPage("MyApp.IndexListener"))
but maybe components do that better.
An IndexListener would implement an interface IndexListener with such methods as:
// note: no re-entrant!
// modifies the list of documents to be indexed
void notifyDocumentsWillBeIndexed(List docFullNames)
// modifies the SolrJ document
void notifyDocumentBeingIndexed(Document solrDoc)
another customization could be at query time but I am not sure it's that easy here (I
had to write a dedicated query processor).
Afterall if your scenario is that particular you can
always write an
application that uses a custom solrconfig.xml and schema and UI :)
Make sure that is possible without changes to the software!
Could you give details on how and where to install xwiki-platform-searchs-solr?
(I'm old fashioned, these modern xwiki installs seem to easy to me)
paul