Re: [xwiki-devs] [DISCUSSION] Handling document translations in Solr Search

28 Nov 2012

...
  To summarize a bit, if we go with the multiple fields
for each language, we
 end up with an index like:
 English version:
 id: xwiki:Main.SomeDocument_en
 language: en
 space: Main
 title_en: XWiki document
 doccontent_en: This is some content
 French version:
 id: xwiki:Main.SomeDocument_fr
 language: fr
 space: Main
 title_fr: XWiki document
 doccontent_fr: This is some content 
Careful Eduard,
that means that searching for common words (e.g "direction") would show you the
same XWiki document as two different search results. I think you would want to combine the
two in one Solr document. This is the old problem of Savitha which, I think, she has not
solved.
I've always believed that the fact that these two documents exist implies that they
both have the same meaning and are translations of each other.
I like the idea to carry the whole analysis of a language defined by the document language
(it could be an object-field also).
...
  Some extra fields might also be added like title_ws
(for whitespace
 tokenization only) that have various approaches to the indexing operation,
 with the aim of improving the relevancy.
 One solution to simplify the query for API clients would be to use fields
 like "title" and "doccontent" and to put as values very lightly (or
not at
 all) analyzed content, as Paul suggested. This would allow applications to
 write simple (and backwards compatible maybe) queries that will still work,
 but will not catch some of the nuances of specific languages.  
It's a good idea to go backwards compatible with such a predictable behaviour as the
whitespace analyzer.
Le 27 nov. 2012 à 16:27, Jerome Velociter a écrit :
...
   Thus, the
search application will be the major beneficiary of these
 analyzed fields (title_en, title_fr, etc.), while still allowing
 applications to get their job done (trough generic, but less/not analized
 fields like "title", "doccontent", etc.). 
 I think for applications this complexity/implementations details would benefit being
hidden behind a "query builder" interface of some sort, WDYT ? 
Absolutely. Note also that such a query-expander (I believe this is the normal term)
already exists within the EDismax. I'd add the expand-along-language function (where
text becomes, if multlingual and browser that gives the languages en ro fr, text_en^3
text_ro^2 text_fr^1.
Le 27 nov. 2012 à 16:44, Ludovic Dubost a écrit :
...
  Maybe a solution would be to create one index per
language and index ALL
 content regardless of it's language using the language analyzer of that
 index. 
I fear that this will bring zillions of false positive because many people have a long
list of supported language.
Stemming is quite aggressive sometimes... for example searching for sitting will find all
"sit", but this should not be the case if choosing French alone (then only the
gathering of people is meant).
If a browser indicates fr de as languages, and searches for sitting, this would find
documents with attachments that contain this "sit" in any language, even though
searching for English was not activated.
...
  This later solution would be the only one that would
really work on file
 attachements as we have no information about the specific language of file
 attachements (or even XWiki objects) which are attached to the main
 document and not to the translated document. 
It is not entirely true that object fields and attachments do not carry a language.
But I agree that there may be installations where the admin would prefer that attachments
and objects are considered multilingual. Note that your solution is also doable with the
multi-field approach of above and does not require several indices.
There's the wiki language which one could apply to attachments and objects.
There could be object properties doing this (e.g. Curriki has this), it would need to be
customizable.
There's the language of documents of sections in several file formats (e.g. in PDF or
word files): while the current extractors do not honour this (I think), it could be used
to switch analyzers.
Again an option?
("index attachments in all languages", "index object fields in all
languages")
paul

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [xwiki-devs] [DISCUSSION] Handling document translations in Solr Search