Hello Eduard,
it's nice of you to see you take this further.
This issue has already been previously [1] discussed
during the GSoC
project, but I am not particularly happy with the chosen approach.
When handling multiple languages, there are generally[2][3] 3 different
approaches:
1) Indexing the content in a single field (like title, doccontent, etc.)
- This has the advantage that queries are clear and fast
- The disadvantage is that you can not run very well tuned analyzers on the
fields, having to resort to (at best) basic tokenization and lowercasing.
2) Indexing the content in multiple fields, one field for each language
(like title_en, title_fr, doccontent_en, doccontent_fr, etc.)
- This has the advantage that you can easily specify (as dynamic fields)
that *_en fields are of type text_en (and analyzed by an english-centered
chain of analyzers); *_fr of type text_fr (focused on french, etc.), thus
making the results much better.
I would add one more field here: title_ws and text_ws where the full text is analyzed just
as words (using the whitespace-tokenizer?).
A match there would even be preferred to a match in the below text-fields.
(maybe that would be called title and text?)
- The disadvantage is that querying such a schema is a
pain. If you want
all the results in all languages, you end up with a big and expensive
query.
Why is this an issue?
Dismax does it for you for free (thanks to the "form" parameter that gives
weight to each of the fields).
This is an issue only if you start to have more than 100 languages or so...
Lucene, the underlying engine of solr, handles thousands of clauses in a query without an
issue (this is how prefix-queries are handled... they are expanded to a query for any of
the term that matches the prefix, a setting deep somewhere, which is about 2000 avoids
this to explode).
If you want just some language, you have to read the
right fields
(ex title_en) instead of just getting a clear field name (title).
You have to be careful, this is really only if you want to be specific. In this case, it
is likely that you also do not want so much stemming.
My experience, which was before dismax on
curriki.org, has made it so that any query that
is a bit specific is likely to not desire stemming.
-- Also, the schema.xml definition is a static one in
this concern,
requiring you to know beforehand which languages you want to support (for
example when defining the default fields to search for). Adding a new
language requires you to start editing the xml files by hand.
True but the available languages are almost all hand-coded.
You could generate the schema.xml based on the available languages if not hand-generated?
There's one catch with this approach which is new to me but seems to be quite
important to implement this approach: the idf should be modified, the Similarity class
should be, so that the total number of documents is the total number of documents having
that language.
See:
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201211.mbox/%3Cza…
The solution sketched there sounds easy but I have not tried it.
3) Indexing the content in different Solr cores
(indexes), one for each
language. Each core requires it's on directory and configuration files.
- The advantage is that queries are clean to write (like option 1) and that
you have a nice separation
- The disadvantage is that it's difficult to get it right (administrative
issues) and then you also have the (considerable) problem of having to fix
the relevancy score of a query result that has entries from different
cores; each core has it's own relevancy computed and does not consider the
others.
- To make it even worst, it seems that you can not [5] also push to a
remote Solr instance the configuration files when creating a new core
programatically. However, if we are running an embedded Solr instance, we
could provide a way to generate the config files and write them to the data
directory.
Post-processing results is very very very dangerous as performance is at risk (e.g. if a
core does not answer)... I would tend to avoid that as much as possible.
Currently I have implemented option 1) in our existing
Solr integration,
which is also more or less compatible with our existing Lucene queries, but
I would like to find a better solution that actually analyses the content.
During GSoC, option 2) was preferred but the implementation did not
consider practical reasons like the ones described above (query complexity,
user configuration, etc.)
True, Savitha surfed the possibility of having different solr documents per language.
I still could not be sure that this was not showing the document match single in one
language.
However, indicating which language it is matched into is probably useful...
Funnily, cross-language-retrieval is a mature research field but retrieval for
multilanguage user is not so!
On a related note, I have also watched an interesting
presentation [3]
about how Drupal handles its Solr integration and, particularly, a plugin
[4] that handles the multilingual aspect.
The idea seen there is that you have this UI that helps you generate
configuration files, depending you your needs. For instance, you (admin)
check that you need search for language English, French and German and the
ui/extension gives you a zip with the configuration you need to use in your
(remote or embedded) solr instance. The configuration for each language
comes preset with the analyzers you should use for it and the additional
resources (stopwords.txt, synonims.txt, etc.).
This approach helps with avoiding the need for admins to be forced to edit
xml files and could also still be useful for other cases, not only option
2).
Generating sounds like an easy approach to me.
All these problems basically come from the fact that
there is no way to
specify in the schema.xml that, based on the value of a field (like the
field "lang" that stores the document language), you want to run this or
that group of analyzers.
Well, this is possible with ThreadLocal but is not necessarily a good idea.
Also, it is very common that users formulate queries without formulating their language
and thus you need to "or" the user's queries through multiple languages
(e.g. given by the browser).
Perhaps a solution would be a custom kind of
"AggregatorAnalyzer" that
would call other analyzers at runtime, based on the value of the lang
field. However, this solution could only be applied at index time, when you
have the lang information (in the solrDocument to be indexed), but when you
perform the query, you can not analyze the query text since you do not know
the language of the field you're querying (it was determined at runtime -
at index time) and thus do not know what operations to apply to the query
(to reduce it to the same form as the indexed values).
How would that look at query time?
I have also read another interesting analysis [6] on
this problem that
elaborates on the complexities and limitations of each options. (Ignore the
Rosette stuff mentioned there)
I have been thinking about this for some time now, but the solution is
probably somewhere in between, finding an option that is acceptable while
not restrictive. I will probably also send a mail on the Solr list to get
some more input from there, but I get the feeling that whatever solution we
choose, it will most likely require the users to at least copy (or even
edit) some files into some directories (configurations and/or jars), since
it does not seem to be easy/possible to do everything on-the-fly,
programatically.
The only hard step is when changing the supported languages, I think.
In this case, when automatically generating the index, you need to warn the user.
The admin UI should have a checkbox "use generated schema" or a textarea for the
schema.
Those that want particular fields and tunings need to write their own schema.
The same UI could also include whether to include a phonetic track or not (then require
reindexing).
hope it helps.
paul