Hi devs,
This issue has already been previously [1] discussed during the GSoC
project, but I am not particularly happy with the chosen approach.
When handling multiple languages, there are generally[2][3] 3 different
approaches:
1) Indexing the content in a single field (like title, doccontent, etc.)
- This has the advantage that queries are clear and fast
- The disadvantage is that you can not run very well tuned analyzers on the
fields, having to resort to (at best) basic tokenization and lowercasing.
2) Indexing the content in multiple fields, one field for each language
(like title_en, title_fr, doccontent_en, doccontent_fr, etc.)
- This has the advantage that you can easily specify (as dynamic fields)
that *_en fields are of type text_en (and analyzed by an english-centered
chain of analyzers); *_fr of type text_fr (focused on french, etc.), thus
making the results much better.
- The disadvantage is that querying such a schema is a pain. If you want
all the results in all languages, you end up with a big and expensive
query. If you want just some language, you have to read the right fields
(ex title_en) instead of just getting a clear field name (title).
-- Also, the schema.xml definition is a static one in this concern,
requiring you to know beforehand which languages you want to support (for
example when defining the default fields to search for). Adding a new
language requires you to start editing the xml files by hand.
3) Indexing the content in different Solr cores (indexes), one for each
language. Each core requires it's on directory and configuration files.
- The advantage is that queries are clean to write (like option 1) and that
you have a nice separation
- The disadvantage is that it's difficult to get it right (administrative
issues) and then you also have the (considerable) problem of having to fix
the relevancy score of a query result that has entries from different
cores; each core has it's own relevancy computed and does not consider the
others.
- To make it even worst, it seems that you can not [5] also push to a
remote Solr instance the configuration files when creating a new core
programatically. However, if we are running an embedded Solr instance, we
could provide a way to generate the config files and write them to the data
directory.
Currently I have implemented option 1) in our existing Solr integration,
which is also more or less compatible with our existing Lucene queries, but
I would like to find a better solution that actually analyses the content.
During GSoC, option 2) was preferred but the implementation did not
consider practical reasons like the ones described above (query complexity,
user configuration, etc.)
On a related note, I have also watched an interesting presentation [3]
about how Drupal handles its Solr integration and, particularly, a plugin
[4] that handles the multilingual aspect.
The idea seen there is that you have this UI that helps you generate
configuration files, depending you your needs. For instance, you (admin)
check that you need search for language English, French and German and the
ui/extension gives you a zip with the configuration you need to use in your
(remote or embedded) solr instance. The configuration for each language
comes preset with the analyzers you should use for it and the additional
resources (stopwords.txt, synonims.txt, etc.).
This approach helps with avoiding the need for admins to be forced to edit
xml files and could also still be useful for other cases, not only option
2).
All these problems basically come from the fact that there is no way to
specify in the schema.xml that, based on the value of a field (like the
field "lang" that stores the document language), you want to run this or
that group of analyzers.
Perhaps a solution would be a custom kind of "AggregatorAnalyzer" that
would call other analyzers at runtime, based on the value of the lang
field. However, this solution could only be applied at index time, when you
have the lang information (in the solrDocument to be indexed), but when you
perform the query, you can not analyze the query text since you do not know
the language of the field you're querying (it was determined at runtime -
at index time) and thus do not know what operations to apply to the query
(to reduce it to the same form as the indexed values).
I have also read another interesting analysis [6] on this problem that
elaborates on the complexities and limitations of each options. (Ignore the
Rosette stuff mentioned there)
I have been thinking about this for some time now, but the solution is
probably somewhere in between, finding an option that is acceptable while
not restrictive. I will probably also send a mail on the Solr list to get
some more input from there, but I get the feeling that whatever solution we
choose, it will most likely require the users to at least copy (or even
edit) some files into some directories (configurations and/or jars), since
it does not seem to be easy/possible to do everything on-the-fly,
programatically.
Any input on this would be highly appreciated, specially if others have
more experience with Solr setups.
Thanks,
Eduard
----------
[1]
http://markmail.org/message/kaxaka7lsbgo57ms
[2]
http://lucidworks.lucidimagination.com/display/lweug/Multilingual+Indexing+…
[3]
http://drupalcity.de/session/language-specific-and-multilingual-full-text-s…
[4]
http://drupal.org/project/apachesolr_multilingual
[5]
http://stackoverflow.com/questions/4064880/create-new-core-directories-in-s…
[6]
http://info.basistech.com/blog/bid/171842/Indexing-Strategies-for-Multiling…