WDYT?
Thanks,
Eduard
On Wed, Nov 21, 2012 at 10:49 PM, Eduard Moraru <enygma2002(a)gmail.com
I was counting on your feedback :)
On Wed, Nov 21, 2012 at 3:04 PM, Paul Libbrecht <paul(a)hoplahup.net>
wrote:
Hello Eduard,
it's nice of you to see you take this further.
This issue has already been previously [1] discussed during the GSoC
> project, but I am not particularly happy with the chosen approach.
> When handling multiple languages, there are generally[2][3] 3 different
> approaches:
>
> 1) Indexing the content in a single field (like title, doccontent,
> etc.)
> - This has the advantage that queries are clear and fast
> - The disadvantage is that you can not run very well tuned analyzers on
>
the
> fields, having to resort to (at best) basic tokenization and
>
lowercasing.
> 2) Indexing the content in multiple fields, one field for each language
> (like title_en, title_fr, doccontent_en, doccontent_fr, etc.)
> - This has the advantage that you can easily specify (as dynamic
> fields)
> that *_en fields are of type text_en (and analyzed by an
>
english-centered
> chain of analyzers); *_fr of type text_fr (focused on french, etc.),
>
thus
> making the results much better.
>
I would add one more field here: title_ws and text_ws where the full
text
is analyzed just as words (using the whitespace-tokenizer?).
A match there would even be preferred to a match in the below
text-fields.
(maybe that would be called title and text?)
- The disadvantage is that querying such a schema is a pain. If you
> want
> all the results in all languages, you end up with a big and expensive
> query.
>
Why is this an issue?
Dismax does it for you for free (thanks to the "form" parameter that
gives weight to each of the fields).
This is an issue only if you start to have more than 100 languages or
so...
Lucene, the underlying engine of solr, handles thousands of clauses in a
query without an issue (this is how prefix-queries are handled... they
are
expanded to a query for any of the term that matches the prefix, a
setting
deep somewhere, which is about 2000 avoids this to explode).
Sure, Solr is great when you want to do simple queries like "XWiki Open
Source", however, since in XWiki we also expose the Solr/Lucene query
APIs
to the platform, there will be (as as it is currently with Lucene) a lot
of
extensions wanting to do search using this API. These extensions (like
the
search suggest for example, rest search, etc) want to do something like
"title:'Open Source' AND type:document AND doccontent:XWiki". Because
option 2) is so messy in it's fields, it would mean that the extension
would have to come up with a query like "title_en:'Open Source' AND
type:document AND doccontent_en:XWiki" (assuming that it is only limited
to
the current -- english or whatever -- language; what happens if it wants
to
do that no matter what language? It will have to specify each combination
possible because we can't use generic field names).
Solr's approach works for using it in your web application's search
input,
in a specific usecase, where you have precisely specified the default
search fields and their boosts inside your schema.xml. However, as a
search
API, using option 2) you are making the life of anyone else wanting to
use
the Solr search API really hard. Also, your search application will work
nicely when the user enters a simple query in the input field, but an
advanced user will suffer the same fate when trying to write an advanced
query, thus not relying on the default query (computed by solr based on
schema.xml).
Also, based on your note above regarding improvements like title_ws and
such, again, all of these are very well suited for the search application
use case, together with the default query that you configure in
schema.xml,
making the search results perform really well. However, what does all
these
fields mean to another extension wanting to do search? Will it have to
handle all these implementation details to query for title, content and
such? I`m not sure how well this would work in practice.
Unrealistic idea(?): perhaps we should come up with an abstract search
language (Solr/Lucene clone) that parses the searched fields andhides the
complexities of all the indexed fields, allowing to write simple queries
like "title:XWiki", while this gets translated to "title_en:XWiki OR
title_fr:XWiki OR title_de:XWiki..." :)
Am I approaching this wrong by trying to have both a tweakable/tweaked
search application AND a search API? Are the two not compatible? Do we
have
to sacrifice search result performance (no language-specific stuff) to be
able to have a usable API?
If you want just some language, you have to read the right fields
> (ex title_en) instead of just getting a clear
field name (title).
>
You have to be careful, this is really only if you want to be specific.
In this case, it is likely that you also do not want so much stemming.
My experience, which was before dismax on
curriki.org, has made it so
that any query that is a bit specific is likely to not desire stemming.
Can you please elaborate on this? I`m not sure I understand the
problem.
-- Also, the schema.xml definition is a static one in this concern,
> requiring you to know beforehand which
languages you want to support
>
(for
> example when defining the default fields to search for). Adding a new
> language requires you to start editing the xml files by hand.
>
True but the available languages are almost all hand-coded.
You could generate the schema.xml based on the available languages if
not
hand-generated?
Basically I would have to output a zip with schema.xml, solrconfig.xml
and
then all the resources specific to all the selected languages (stopwords,
synonims, etc) for the languages that we can provide out of the box. For
other languages, the admin would have to get dirty with the xmls.
There's one catch with this approach which is new to me but seems to be
quite important to implement this approach: the
idf should be modified,
the
Similarity class should be, so that the total number of documents is the
total number of documents having that language.
See:
http://mail-archives.apache.**org/mod_mbox/lucene-solr-user/**
201211.mbox/%3Czarafa.**509ccb61.698a.**1d02345614818807(a)mail.**
openindex.io%3E<http://mail-archives.apache.org/mod_mbox/lucene-solr-use…
The solution sketched there sounds easy but I have not tried it.
3) Indexing the content in different Solr cores (indexes), one for each
> language. Each core requires it's on directory and configuration files.
> - The advantage is that queries are clean to write (like option 1) and
>
that
> you have a nice separation
> - The disadvantage is that it's difficult to get it right
>
(administrative
> issues) and then you also have the (considerable) problem of having to
>
fix
> the relevancy score of a query result that has entries from different
> cores; each core has it's own relevancy computed and does not consider
>
the
> others.
> - To make it even worst, it seems that you can not [5] also push to a
> remote Solr instance the configuration files when creating a new core
> programatically. However, if we are running an embedded Solr instance,
>
we
> could provide a way to generate the config files and write them to the
>
data
> directory.
>
Post-processing results is very very very dangerous as performance is at
risk (e.g. if a core does not answer)... I would tend to avoid that as
much
as possible.
Not really related, but this reminds me about the post processing that
I
do for checking view rights over the returned result, but that's another
discussion that we will probably need to have :)
Currently I have implemented option 1) in our existing Solr integration,
> which is also more or less compatible with
our existing Lucene queries,
>
but
> I would like to find a better solution that actually analyses the
>
content.
> During GSoC, option 2) was preferred but the implementation did not
> consider practical reasons like the ones described above (query
>
complexity,
> user configuration, etc.)
>
True, Savitha surfed the possibility of having different solr documents
per language.
I still could not be sure that this was not showing the document match
single in one language.
However, indicating which language it is matched into is probably
useful...
Already doing that.
Funnily, cross-language-retrieval is a mature research field but
retrieval for multilanguage user is not so!
On a related note, I have also watched an interesting presentation [3]
> about how Drupal handles its Solr integration and, particularly, a
>
plugin
> [4] that handles the multilingual aspect.
> The idea seen there is that you have this UI that helps you generate
> configuration files, depending you your needs. For instance, you
> (admin)
> check that you need search for language English, French and German and
>
the
> ui/extension gives you a zip with the configuration you need to use in
>
your
> (remote or embedded) solr instance. The configuration for each language
> comes preset with the analyzers you should use for it and the
> additional
> resources (stopwords.txt, synonims.txt, etc.).
> This approach helps with avoiding the need for admins to be forced to
>
edit
> xml files and could also still be useful for other cases, not only
>
option
> 2).
>
Generating sounds like an easy approach to me.
Yes, however I don`t like the fact that we can not do everything from
the
webapp and the admin needs to access the filesystem to install the given
configuration on the embedded/remote solr directory. Lucene does not have
this problem now. It just works with XWiki and everything is done from
XWiki UI. I feel that losing this commodity will not be very well
received
by users that now have some new install steps to get XWiki running.
Well, of course, for the embedded solr version, we could handle it like
we
do now and push the files directly from the webapp to the filesystem.
Since
embedded will be default, it should be OK and avoid the extra install
step.
Users with a remote solr machine should have the option to get the zip
instead.
Not sure if we can apply the new configuration without a restart, but
I`ll
have to look more into it. I know the multi-core architecture supports
something like this but will have to see the details.
All these problems basically come from the fact that there is no way to
> specify in the schema.xml that, based on the
value of a field (like the
> field "lang" that stores the document language), you want to run this
> or
> that group of analyzers.
>
Well, this is possible with ThreadLocal but is not necessarily a good
idea.
Also, it is very common that users formulate queries without formulating
their language and thus you need to "or" the user's queries through
multiple languages (e.g. given by the browser).
Perhaps a solution would be a custom kind of "AggregatorAnalyzer" that
> would call other analyzers at runtime, based on the value of the lang
> field. However, this solution could only be applied at index time, when
>
you
> have the lang information (in the solrDocument to be indexed), but when
>
you
> perform the query, you can not analyze the query text since you do not
>
know
> the language of the field you're querying (it was determined at runtime
>
-
> at index time) and thus do not know what operations to apply to the
>
query
> (to reduce it to the same form as the indexed values).
>
How would that look at query time?
That's what I was saying, that at query time, the searched term will
not
get analyzed by the right chain. When you search for a single language,
you
could add that language as a query filter and then you could apply the
right chain, but when searching in 2 or more (or no, meaning all)
languages
you are stuck.
I have also read another interesting analysis [6] on this problem that
> elaborates on the complexities and
limitations of each options. (Ignore
>
the
> Rosette stuff mentioned there)
>
> I have been thinking about this for some time now, but the solution is
> probably somewhere in between, finding an option that is acceptable
>
while
> not restrictive. I will probably also send a mail on the Solr list to
>
get
> some more input from there, but I get the feeling that whatever
>
solution we
> choose, it will most likely require the users to at least copy (or even
> edit) some files into some directories (configurations and/or jars),
>
since
> it does not seem to be easy/possible to do everything on-the-fly,
> programatically.
>
The only hard step is when changing the supported languages, I think.
In this case, when automatically generating the index, you need to warn
the user.
The admin UI should have a checkbox "use generated schema" or a textarea
for the schema.
Please see above regarding configuration generation. Basically, since
we
are going to support both embedded and remote solr instances, we could
support things like editing the schema from XWiki only for the embedded
instance, but not for the remote one. We might end up having separate UIs
for each case, since we might want to exploit the flexibility of the
embedded one as much as possible.
Those that want particular fields and tunings need to write their own
schema.
The same UI could also include whether to include a phonetic track or
not
(then require reindexing).
hope it helps.
Yes, very helpful so far. I`m counting on your expertise with
Lucene/Solr
on the details. My current approach is a practical one without previous
experience on the topic, so I`m still doing mostly guesswork in some
areas.
Thanks,
Eduard
paul
______________________________**_________________
devs mailing list
devs(a)xwiki.org
http://lists.xwiki.org/**mailman/listinfo/devs<http://lists.xwiki.org/ma…
______________________________**_________________