To summarize a bit, if we go with the multiple fields
for each language, we
end up with an index like:
English version:
id: xwiki:Main.SomeDocument_en
language: en
space: Main
title_en: XWiki document
doccontent_en: This is some content
French version:
id: xwiki:Main.SomeDocument_fr
language: fr
space: Main
title_fr: XWiki document
doccontent_fr: This is some content
Careful Eduard,
that means that searching for common words (e.g "direction") would show you the
same XWiki document as two different search results. I think you would want to combine the
two in one Solr document. This is the old problem of Savitha which, I think, she has not
solved.
I've always believed that the fact that these two documents exist implies that they
both have the same meaning and are translations of each other.
I like the idea to carry the whole analysis of a language defined by the document language
(it could be an object-field also).
Some extra fields might also be added like title_ws
(for whitespace
tokenization only) that have various approaches to the indexing operation,
with the aim of improving the relevancy.
One solution to simplify the query for API clients would be to use fields
like "title" and "doccontent" and to put as values very lightly (or
not at
all) analyzed content, as Paul suggested. This would allow applications to
write simple (and backwards compatible maybe) queries that will still work,
but will not catch some of the nuances of specific languages.
It's a good idea to go backwards compatible with such a predictable behaviour as the
whitespace analyzer.
Le 27 nov. 2012 à 16:27, Jerome Velociter a écrit :
Thus, the
search application will be the major beneficiary of these
analyzed fields (title_en, title_fr, etc.), while still allowing
applications to get their job done (trough generic, but less/not analized
fields like "title", "doccontent", etc.).
I think for applications this complexity/implementations details would benefit being
hidden behind a "query builder" interface of some sort, WDYT ?
Absolutely. Note also that such a query-expander (I believe this is the normal term)
already exists within the EDismax. I'd add the expand-along-language function (where
text becomes, if multlingual and browser that gives the languages en ro fr, text_en^3
text_ro^2 text_fr^1.
Le 27 nov. 2012 à 16:44, Ludovic Dubost a écrit :
Maybe a solution would be to create one index per
language and index ALL
content regardless of it's language using the language analyzer of that
index.
I fear that this will bring zillions of false positive because many people have a long
list of supported language.
Stemming is quite aggressive sometimes... for example searching for sitting will find all
"sit", but this should not be the case if choosing French alone (then only the
gathering of people is meant).
If a browser indicates fr de as languages, and searches for sitting, this would find
documents with attachments that contain this "sit" in any language, even though
searching for English was not activated.
This later solution would be the only one that would
really work on file
attachements as we have no information about the specific language of file
attachements (or even XWiki objects) which are attached to the main
document and not to the translated document.
It is not entirely true that object fields and attachments do not carry a language.
But I agree that there may be installations where the admin would prefer that attachments
and objects are considered multilingual. Note that your solution is also doable with the
multi-field approach of above and does not require several indices.
There's the wiki language which one could apply to attachments and objects.
There could be object properties doing this (e.g. Curriki has this), it would need to be
customizable.
There's the language of documents of sections in several file formats (e.g. in PDF or
word files): while the current extractors do not honour this (I think), it could be used
to switch analyzers.
Again an option?
("index attachments in all languages", "index object fields in all
languages")
paul