Re: [xwiki-devs] [DISCUSSION] Handling document translations in Solr Search

27 Nov 2012

Hi Edy,
On 11/26/2012 07:25 PM, Eduard Moraru wrote:
...
  Hi devs,
 Any other input on this matter?
 To summarize a bit, if we go with the multiple fields for each language, we
 end up with an index like:
 English version:
 id: xwiki:Main.SomeDocument_en
 language: en
 space: Main
 title_en: XWiki document
 doccontent_en: This is some content
 French version:
 id: xwiki:Main.SomeDocument_fr
 language: fr
 space: Main
 title_fr: XWiki document
 doccontent_fr: This is some content
 The Solr configuration is generated by some XWiki UI that returns a zip
 that the admin has to unpack in his (remote) Solr instance. This could be
 automated for the embedded instance. 
IMHO this is a "must", not a "could" for embedded instances.
...
    This operation is to be performed each
 time an admin changes the indexed languages (rarely or even only once). 
There could be a reminder in the admin language UI.
...

 Querying such a schema is a bit tricky when you are interested in more than
 one language, because you have to add all the clauses (title_en, title_fr,
 etc.) specific to the languages you are interested in. 
But that's an exotic use case already I think. The common case is to
query for the context language only.
...

 Some extra fields might also be added like title_ws (for whitespace
 tokenization only) that have various approaches to the indexing operation,
 with the aim of improving the relevancy.
 One solution to simplify the query for API clients would be to use fields
 like "title" and "doccontent" and to put as values very lightly (or
not at
 all) analyzed content, as Paul suggested. This would allow applications to
 write simple (and backwards compatible maybe) queries that will still work,
 but will not catch some of the nuances of specific languages. As far as
 I`ve seen until now, applications are not very interested in nuances, but
 rather in filtering the results, a task for which this solution might be
 well suited. Of course, nothing stops applications from using the *new* and
 more expressive fields that are properly analized.
 Thus, the search application will be the major beneficiary of these
 analyzed fields (title_en, title_fr, etc.), while still allowing
 applications to get their job done (trough generic, but less/not analized
 fields like "title", "doccontent", etc.). 
I think for applications this complexity/implementations details would
benefit being hidden behind a "query builder" interface of some sort, WDYT ?
Jerome
...

 WDYT?
 Thanks,
 Eduard
 On Wed, Nov 21, 2012 at 10:49 PM, Eduard Moraru &lt;enygma2002(a)gmail.com&gt;wrote;wrote:
  Hi Paul,
 I was counting on your feedback :)
 On Wed, Nov 21, 2012 at 3:04 PM, Paul Libbrecht &lt;paul(a)hoplahup.net&gt; wrote:
  Hello Eduard,
 it's nice of you to see you take this further.
  This issue has already been previously [1]
discussed during the GSoC
 project, but I am not particularly happy with the chosen approach.
 When handling multiple languages, there are generally[2][3] 3 different
 approaches:
 1) Indexing the content in a single field (like title, doccontent, etc.)
 - This has the advantage that queries are clear and fast
 - The disadvantage is that you can not run very well tuned analyzers on  the
  fields, having to resort to (at best) basic
tokenization and  lowercasing.
  2) Indexing the content in multiple fields, one
field for each language
 (like title_en, title_fr, doccontent_en, doccontent_fr, etc.)
 - This has the advantage that you can easily specify (as dynamic fields)
 that *_en fields are of type text_en (and analyzed by an  english-centered
  chain of analyzers); *_fr of type text_fr
(focused on french, etc.),  thus
  making the results much better.  I would
add one more field here: title_ws and text_ws where the full text
 is analyzed just as words (using the whitespace-tokenizer?).
 A match there would even be preferred to a match in the below text-fields.
 (maybe that would be called title and text?)
  - The disadvantage is that querying such a schema
is a pain. If you want
 all the results in all languages, you end up with a big and expensive
 query.  Why is this an issue?
 Dismax does it for you for free (thanks to the "form" parameter that
 gives weight to each of the fields).
 This is an issue only if you start to have more than 100 languages or
 so...
 Lucene, the underlying engine of solr, handles thousands of clauses in a
 query without an issue (this is how prefix-queries are handled... they are
 expanded to a query for any of the term that matches the prefix, a setting
 deep somewhere, which is about 2000 avoids this to explode).
  Sure, Solr is great when you want to do simple queries like "XWiki Open
 Source", however, since in XWiki we also expose the Solr/Lucene query APIs
 to the platform, there will be (as as it is currently with Lucene) a lot of
 extensions wanting to do search using this API. These extensions (like the
 search suggest for example, rest search, etc) want to do something like
 "title:'Open Source' AND type:document AND doccontent:XWiki". Because
 option 2) is so messy in it's fields, it would mean that the extension
 would have to come up with a query like "title_en:'Open Source' AND
 type:document AND doccontent_en:XWiki" (assuming that it is only limited to
 the current -- english or whatever -- language; what happens if it wants to
 do that no matter what language? It will have to specify each combination
 possible because we can't use generic field names).
 Solr's approach works for using it in your web application's search input,
 in a specific usecase, where you have precisely specified the default
 search fields and their boosts inside your schema.xml. However, as a search
 API, using option 2) you are making the life of anyone else wanting to use
 the Solr search API really hard. Also, your search application will work
 nicely when the user enters a simple query in the input field, but an
 advanced user will suffer the same fate when trying to write an advanced
 query, thus not relying on the default query (computed by solr based on
 schema.xml).
 Also, based on your note above regarding improvements like title_ws and
 such, again, all of these are very well suited for the search application
 use case, together with the default query that you configure in schema.xml,
 making the search results perform really well. However, what does all these
 fields mean to another extension wanting to do search? Will it have to
 handle all these implementation details to query for title, content and
 such? I`m not sure how well this would work in practice.
 Unrealistic idea(?): perhaps we should come up with an abstract search
 language (Solr/Lucene clone) that parses the searched fields andhides the
 complexities of all the indexed fields, allowing to write simple queries
 like "title:XWiki", while this gets translated to "title_en:XWiki OR
 title_fr:XWiki OR title_de:XWiki..." :)
 Am I approaching this wrong by trying to have both a tweakable/tweaked
 search application AND a search API? Are the two not compatible? Do we have
 to sacrifice search result performance (no language-specific stuff) to be
 able to have a usable API?
   If you
want just some language, you have to read the right fields
 (ex title_en) instead of just getting a clear field name (title).  You have to be
careful, this is really only if you want to be specific.
 In this case, it is likely that you also do not want so much stemming.
 My experience, which was before dismax on curriki.org, has made it so
 that any query that is a bit specific is likely to not desire stemming.
  Can you please elaborate on this? I`m not sure I understand the problem.
   -- Also,
the schema.xml definition is a static one in this concern,
 requiring you to know beforehand which languages you want to support  (for
  example when defining the default fields to
search for). Adding a new
 language requires you to start editing the xml files by hand.  True but the
available languages are almost all hand-coded.
 You could generate the schema.xml based on the available languages if not
 hand-generated?
  Basically I would have to output a zip with schema.xml, solrconfig.xml and
 then all the resources specific to all the selected languages (stopwords,
 synonims, etc) for the languages that we can provide out of the box. For
 other languages, the admin would have to get dirty with the xmls.
  There's one catch with this approach which is
new to me but seems to be
 quite important to implement this approach: the idf should be modified, the
 Similarity class should be, so that the total number of documents is the
 total number of documents having that language.
 See:
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201211.mbox/%3Cza…
 The solution sketched there sounds easy but I have not tried it.
  3) Indexing the content in different Solr cores
(indexes), one for each
 language. Each core requires it's on directory and configuration files.
 - The advantage is that queries are clean to write (like option 1) and  that
  you have a nice separation
 - The disadvantage is that it's difficult to get it right  (administrative
  issues) and then you also have the (considerable)
problem of having to  fix
  the relevancy score of a query result that has
entries from different
 cores; each core has it's own relevancy computed and does not consider  the
  others.
 - To make it even worst, it seems that you can not [5] also push to a
 remote Solr instance the configuration files when creating a new core
 programatically. However, if we are running an embedded Solr instance,  we
  could provide a way to generate the config files
and write them to the  data
  directory.  Post-processing results is very
very very dangerous as performance is at
 risk (e.g. if a core does not answer)... I would tend to avoid that as much
 as possible.
  Not really related, but this reminds me about the post processing that I
 do for checking view rights over the returned result, but that's another
 discussion that we will probably need to have :)
   Currently
I have implemented option 1) in our existing Solr integration,
 which is also more or less compatible with our existing Lucene queries,  but
  I would like to find a better solution that
actually analyses the  content.
  During GSoC, option 2) was preferred but the
implementation did not
 consider practical reasons like the ones described above (query  complexity,
  user configuration, etc.)  True, Savitha
surfed the possibility of having different solr documents
 per language.
 I still could not be sure that this was not showing the document match
 single in one language.
 However, indicating which language it is matched into is probably
 useful...
  Already doing that.
  Funnily, cross-language-retrieval is a mature
research field but
 retrieval for multilanguage user is not so!
  On a related note, I have also watched an
interesting presentation [3]
 about how Drupal handles its Solr integration and, particularly, a  plugin
  [4] that handles the multilingual aspect.
 The idea seen there is that you have this UI that helps you generate
 configuration files, depending you your needs. For instance, you (admin)
 check that you need search for language English, French and German and  the
  ui/extension gives you a zip with the
configuration you need to use in  your
  (remote or embedded) solr instance. The
configuration for each language
 comes preset with the analyzers you should use for it and the additional
 resources (stopwords.txt, synonims.txt, etc.).
 This approach helps with avoiding the need for admins to be forced to  edit
  xml files and could also still be useful for
other cases, not only  option
  2).  Generating sounds like an easy
approach to me.
  Yes, however I don`t like the fact that we can not do everything from the
 webapp and the admin needs to access the filesystem to install the given
 configuration on the embedded/remote solr directory. Lucene does not have
 this problem now. It just works with XWiki and everything is done from
 XWiki UI. I feel that losing this commodity will not be very well received
 by users that now have some new install steps to get XWiki running.
 Well, of course, for the embedded solr version, we could handle it like we
 do now and push the files directly from the webapp to the filesystem. Since
 embedded will be default, it should be OK and avoid the extra install step.
 Users with a remote solr machine should have the option to get the zip
 instead.
 Not sure if we can apply the new configuration without a restart, but I`ll
 have to look more into it. I know the multi-core architecture supports
 something like this but will have to see the details.
   All these
problems basically come from the fact that there is no way to
 specify in the schema.xml that, based on the value of a field (like the
 field "lang" that stores the document language), you want to run this or
 that group of analyzers.  Well, this is possible with ThreadLocal but is not
necessarily a good
 idea.
 Also, it is very common that users formulate queries without formulating
 their language and thus you need to "or" the user's queries through
 multiple languages (e.g. given by the browser).
  Perhaps a solution would be a custom kind of
"AggregatorAnalyzer" that
 would call other analyzers at runtime, based on the value of the lang
 field. However, this solution could only be applied at index time, when  you
  have the lang information (in the solrDocument to
be indexed), but when  you
  perform the query, you can not analyze the query
text since you do not  know
  the language of the field you're querying (it
was determined at runtime  -
  at index time) and thus do not know what
operations to apply to the  query
  (to reduce it to the same form as the indexed
values).  How would that look at query time?
  That's what I was saying, that at query time, the searched term will not
 get analyzed by the right chain. When you search for a single language, you
 could add that language as a query filter and then you could apply the
 right chain, but when searching in 2 or more (or no, meaning all) languages
 you are stuck.
   I have
also read another interesting analysis [6] on this problem that
 elaborates on the complexities and limitations of each options. (Ignore  the
  Rosette stuff mentioned there)
 I have been thinking about this for some time now, but the solution is
 probably somewhere in between, finding an option that is acceptable  while
  not restrictive. I will probably also send a mail
on the Solr list to  get
  some more input from there, but I get the feeling
that whatever  solution we
  choose, it will most likely require the users to
at least copy (or even
 edit) some files into some directories (configurations and/or jars),  since
  it does not seem to be easy/possible to do
everything on-the-fly,
 programatically.  The only hard step is when changing the supported languages, I
think.
 In this case, when automatically generating the index, you need to warn
 the user.
 The admin UI should have a checkbox "use generated schema" or a textarea
 for the schema.
  Please see above regarding configuration generation. Basically, since we
 are going to support both embedded and remote solr instances, we could
 support things like editing the schema from XWiki only for the embedded
 instance, but not for the remote one. We might end up having separate UIs
 for each case, since we might want to exploit the flexibility of the
 embedded one as much as possible.
  Those that want particular fields and tunings
need to write their own
 schema.
 The same UI could also include whether to include a phonetic track or not
 (then require reindexing). 
  hope it helps.
  Yes, very helpful so far. I`m counting on your expertise with Lucene/Solr
 on the details. My current approach is a practical one without previous
 experience on the topic, so I`m still doing mostly guesswork in some areas.
 Thanks,
 Eduard
  paul
 _______________________________________________
 devs mailing list
 devs(a)xwiki.org
 http://lists.xwiki.org/mailman/listinfo/devs

  _______________________________________________
 devs mailing list
 devs(a)xwiki.org
 http://lists.xwiki.org/mailman/listinfo/devs 

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [xwiki-devs] [DISCUSSION] Handling document translations in Solr Search