Hi Ludovic,
Thanks for the reply. Please read below...
On Tue, Nov 27, 2012 at 5:44 PM, Ludovic Dubost <ludovic(a)xwiki.com> wrote:
Hi Edy,
I'm not a huge fan of the title_fr title_en title_morelanguages approach as
indeed it seems to be quite complex at the query level. I was more leaning
towards multiple indexes if we can query them globally but I understand
this is complex too.
Now let's see the use cases that are hugely important:
1/ Make sure that if you decide your wiki is monolingual:
- the indexing uses the specific language analyzer
- make sure the query uses the specific language analyzer
- make sure the search looks in all content even if the language setting
of the document is wrongly set (consider all documents being of the
specific language)
You mean that, if the wiki is monolingual, we should ignore the language
filter and hardcode it to "All languages", right?
However, what would be the advantage of this? Why would be want to pollute
the results with irrelevant documents (caused by a probable recent
configuration change that went from multi-lingual to mono-lignual)? Wasn't
that the whole reason why the admin switched to mono-lingual?
2/ Allow is a wiki is multi-lingual:
- search in the language you decide (maybe the UI should display a
language choice for the query)
We already support this, by using the "Filtered Search" option and
selecting the language.
- search in content that is analyzed in the proper language when the
content is declared in this language
- allow to specify if you want to restrict your search to documents
declared in the language of your query, versus search more widely in all
documents accross languages. If you search in only the language of the
query only one document can show up but it should point to the right
translation that matches, if you search in multiple languages then you can
show individual translations.
- allow technical users to search for all documents
across all languages
(where the language analysis does not really matter)
Do you mean as an API?
What exactly do you mean by "language analysis does not really matter"? Any
Example?
From an admin point of view it makes good sense for the admin to be able to
specify in a multilingual wiki which language analysis should be activated,
and then have this transmitted to SOLR to properly configure the engine.
Reindexing is ok when changing the configuration.
I believe in the end wether you use multiple fields with _fr _en or
multiple SOLR cores, as long as you can query accross SOLR cores is a bit
the same. If you cannot run a query merging multiple indexes then the first
solution is kind of absolutely necessary as it would be the only one
allowing to search across all languages.
Maybe a solution would be to create one index per language and index ALL
content regardless of it's language using the language analyzer of that
index. This would allow to have better results even though the users have
badly tagged the language of a document, and it's only the job of the UI to
limit the search to only the language of the query, or all documents.
So you could have a configuration in the admin that
says:
1/ Create an English Index
2/ Create an additional French index
The UI would allow to search in English and French, + would add a language
restriction for the documents.
Applying the language specific analyzers (for Chinese, for example) to all
the documents will just create a mess for all the documents that do not
match the analyzer's language. I`m not sure the results for the
badly-indexed languages will make any sense to users.
Also, this is very similar to the multi-core approach (one core per
language), just that you also add documents that are indexed with the wrong
analyzers. We have the same problem regarding merging relevance scores
across indexes (cores) that is a big turn-off for the original multi-core
approach.
In the future if we are able to "detect" the language of the documents we
could add a lucene field with the "detected" language instead of the
"provided" language of the documents, therefore increasing the quality of
searches only on documents of a specific language.
In the previous discussions (on the GSoC thread) we agreed that language in
XWiki is known before-hard, so no recognition is required, at least not at
document level.
This later solution would be the only one that would really work on file
attachements as we have no information about the specific language of file
attachements (or even XWiki objects) which are attached to the main
document and not to the translated document.
Yes, this is a problem right now. AFAIU, the plan [1] is to support
translated objects and maybe attachments as well. Until then, we could
either:
1) Use the original document's language to index the attachment's content
2) Use a language detection library to try to detect the attachment
content's language and index it accordingly.
The above could also be applied for objects and their properties.
----------
[1]
This later issues shows that a search on "only french content" should still
include the attachements because we have no idea if the attachments are
"french" or "english".
(The paragraphs below discuss on what currently exists and what could be
done, ignoring the possible language detection mentioned above)
Right now a document also indexes the object's properties in a field called
"objcontent". I do this for all translations, thus duplicating the field's
value in all translations. I can do the same for attachments. The purpose
is, indeed, to be able to find document translations based on hits in their
objects/attachments. If a language filter is used and there is a hit in an
object, only one document is returned. If there are no language filters,
all translations will be returned.
However, if we search for the object/property/attachment itself, it will
only be assigned to one language: the language of the original document.
This means that if we search for all languages, the object itself will be
found too (there is no language filter used). If we add a language filter
that is different from the object/property/attachment's original document
language, the object/property/attachment will not be found.
Maybe we can come up with some processing of the query in the search
application, that applies the language filter only for documents:
((-type:"OBJECT" OR -type:"OBJECT_PROPERTY" OR
-type:"ATTACHMENT") OR
lang:"<userSelectedLanguage>") -- writing it like this because the
default
operand is AND in the query filter clause that we use in the Search
application.
The problem with this is that that, when a language filter is used, the
object/property/attachments that are now included in the results might not
have the specified language and will pollute the results.
Thanks,
Eduard
Ludovic
2012/11/26 Eduard Moraru <enygma2002(a)gmail.com>
Hi devs,
Any other input on this matter?
To summarize a bit, if we go with the multiple fields for each language,
we
end up with an index like:
English version:
id: xwiki:Main.SomeDocument_en
language: en
space: Main
title_en: XWiki document
doccontent_en: This is some content
French version:
id: xwiki:Main.SomeDocument_fr
language: fr
space: Main
title_fr: XWiki document
doccontent_fr: This is some content
The Solr configuration is generated by some XWiki UI that returns a zip
that the admin has to unpack in his (remote) Solr instance. This could be
automated for the embedded instance. This operation is to be performed
each
time an admin changes the indexed languages
(rarely or even only once).
Querying such a schema is a bit tricky when you are interested in more
than
one language, because you have to add all the
clauses (title_en,
title_fr,
etc.) specific to the languages you are
interested in.
Some extra fields might also be added like title_ws (for whitespace
tokenization only) that have various approaches to the indexing
operation,
with the aim of improving the relevancy.
One solution to simplify the query for API clients would be to use fields
like "title" and "doccontent" and to put as values very lightly (or
not
at
all) analyzed content, as Paul suggested. This
would allow applications
to
write simple (and backwards compatible maybe)
queries that will still
work,
but will not catch some of the nuances of
specific languages. As far as
I`ve seen until now, applications are not very interested in nuances, but
rather in filtering the results, a task for which this solution might be
well suited. Of course, nothing stops applications from using the *new*
and
more expressive fields that are properly
analized.
Thus, the search application will be the major beneficiary of these
analyzed fields (title_en, title_fr, etc.), while still allowing
applications to get their job done (trough generic, but less/not analized
fields like "title", "doccontent", etc.).
WDYT?
Thanks,
Eduard
On Wed, Nov 21, 2012 at 10:49 PM, Eduard Moraru <enygma2002(a)gmail.com
wrote:
Hi Paul,
I was counting on your feedback :)
On Wed, Nov 21, 2012 at 3:04 PM, Paul Libbrecht <paul(a)hoplahup.net>
wrote:
>
> Hello Eduard,
>
> it's nice of you to see you take this further.
>
> > This issue has already been previously [1] discussed during the GSoC
> > project, but I am not particularly happy with the chosen approach.
> > When handling multiple languages, there are generally[2][3] 3
different
> > approaches:
> >
> > 1) Indexing the content in a single field (like title, doccontent,
etc.)
> > - This has the advantage that queries
are clear and fast
> > - The disadvantage is that you can not run very well tuned analyzers
on
> the
> > fields, having to resort to (at best) basic tokenization and
> lowercasing.
> >
> > 2) Indexing the content in multiple fields, one field for each
language
> > (like title_en, title_fr, doccontent_en,
doccontent_fr, etc.)
> > - This has the advantage that you can easily specify (as dynamic
fields)
> > that *_en fields are of type text_en
(and analyzed by an
> english-centered
> > chain of analyzers); *_fr of type text_fr (focused on french, etc.),
> thus
> > making the results much better.
>
> I would add one more field here: title_ws and text_ws where the full
text
> is analyzed just as words (using the
whitespace-tokenizer?).
> A match there would even be preferred to a match in the below
text-fields.
>
> (maybe that would be called title and text?)
>
> > - The disadvantage is that querying such a schema is a pain. If you
want
>> > all the results in all languages, you end up with a big and
expensive
>> > query.
>>
>> Why is this an issue?
>> Dismax does it for you for free (thanks to the "form" parameter that
>> gives weight to each of the fields).
>> This is an issue only if you start to have more than 100 languages or
>> so...
>> Lucene, the underlying engine of solr, handles thousands of clauses
in a
> query
without an issue (this is how prefix-queries are handled... they
are
> expanded to a query for any of the term that
matches the prefix, a
setting
deep
somewhere, which is about 2000 avoids this to explode).
Sure, Solr is great when you want to do simple queries like "XWiki Open
Source", however, since in XWiki we also expose the Solr/Lucene query
APIs
> to the platform, there will be (as as it is currently with Lucene) a
lot
of
extensions wanting to do search using this API.
These extensions (like
the
> search suggest for example, rest search, etc) want to do something like
> "title:'Open Source' AND type:document AND doccontent:XWiki".
Because
> option 2) is so messy in it's fields, it would mean that the extension
> would have to come up with a query like "title_en:'Open Source' AND
> type:document AND doccontent_en:XWiki" (assuming that it is only
limited
to
> the current -- english or whatever -- language; what happens if it
wants
to
> do that no matter what language? It will have to specify each
combination
possible
because we can't use generic field names).
Solr's approach works for using it in your web application's search
input,
in a specific usecase, where you have precisely
specified the default
search fields and their boosts inside your schema.xml. However, as a
search
API, using option 2) you are making the life of
anyone else wanting to
use
> the Solr search API really hard. Also, your search application will
work
> nicely when the user enters a simple query
in the input field, but an
> advanced user will suffer the same fate when trying to write an
advanced
> query, thus not relying on the default query
(computed by solr based on
> schema.xml).
>
> Also, based on your note above regarding improvements like title_ws and
> such, again, all of these are very well suited for the search
application
use case,
together with the default query that you configure in
schema.xml,
making the search results perform really well.
However, what does all
these
> fields mean to another extension wanting to do search? Will it have to
> handle all these implementation details to query for title, content and
> such? I`m not sure how well this would work in practice.
>
> Unrealistic idea(?): perhaps we should come up with an abstract search
> language (Solr/Lucene clone) that parses the searched fields andhides
the
> complexities of all the indexed fields,
allowing to write simple
queries
like
"title:XWiki", while this gets translated to "title_en:XWiki OR
title_fr:XWiki OR title_de:XWiki..." :)
Am I approaching this wrong by trying to have both a tweakable/tweaked
search application AND a search API? Are the two not compatible? Do we
have
> to sacrifice search result performance (no language-specific stuff) to
be
> able to have a usable API?
>
>
>> > If you want just some language, you have to read the right fields
>> > (ex title_en) instead of just getting a clear field name (title).
>>
>> You have to be careful, this is really only if you want to be
specific.
>> In this case, it is likely that you also
do not want so much stemming.
>> My experience, which was before dismax on
curriki.org, has made it so
>> that any query that is a bit specific is likely to not desire
stemming.
>>
>
> Can you please elaborate on this? I`m not sure I understand the
problem.
>
>
>>
>> > -- Also, the schema.xml definition is a static one in this concern,
>> > requiring you to know beforehand which languages you want to support
>> (for
>> > example when defining the default fields to search for). Adding a
new
> >
language requires you to start editing the xml files by hand.
>
> True but the available languages are almost all hand-coded.
> You could generate the schema.xml based on the available languages if
not
hand-generated?
Basically I would have to output a zip with schema.xml, solrconfig.xml
and
> then all the resources specific to all the selected languages
(stopwords,
> synonims, etc) for the languages that we can
provide out of the box.
For
> other languages, the admin would have to get
dirty with the xmls.
>
>
>>
>> There's one catch with this approach which is new to me but seems to
be
>> quite important to implement this
approach: the idf should be
modified,
the
>> Similarity class should be, so that the total number of documents is
the
> total
number of documents having that language.
> See:
>
>
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201211.mbox/%3Cza…
> The
solution sketched there sounds easy but I have not tried it.
>
> > 3) Indexing the content in different Solr cores (indexes), one for
each
> > language. Each core requires it's on
directory and configuration
files.
>> > - The advantage is that queries are clean to write (like option 1)
and
>> that
>> > you have a nice separation
>> > - The disadvantage is that it's difficult to get it right
>> (administrative
>> > issues) and then you also have the (considerable) problem of having
to
>> fix
>> > the relevancy score of a query result that has entries from
different
>> > cores; each core has it's own
relevancy computed and does not
consider
>> the
>> > others.
>> > - To make it even worst, it seems that you can not [5] also push to
a
>> > remote Solr instance the
configuration files when creating a new
core
>> > programatically. However, if we are
running an embedded Solr
instance,
>> we
>> > could provide a way to generate the config files and write them to
the
>> data
>> > directory.
>>
>> Post-processing results is very very very dangerous as performance is
at
> risk
(e.g. if a core does not answer)... I would tend to avoid that as
much
>> as possible.
>>
>
> Not really related, but this reminds me about the post processing that
I
> do for checking view rights over the
returned result, but that's
another
discussion that we will probably need to have :)
>
> > Currently I have implemented option 1) in our existing Solr
integration,
> > which is also more or less compatible
with our existing Lucene
queries,
>> but
>> > I would like to find a better solution that actually analyses the
>> content.
>> >
>> > During GSoC, option 2) was preferred but the implementation did not
>> > consider practical reasons like the ones described above (query
>> complexity,
>> > user configuration, etc.)
>>
>> True, Savitha surfed the possibility of having different solr
documents
>> per language.
>> I still could not be sure that this was not showing the document match
>> single in one language.
>>
>> However, indicating which language it is matched into is probably
>> useful...
>>
>
> Already doing that.
>
>
>> Funnily, cross-language-retrieval is a mature research field but
>> retrieval for multilanguage user is not so!
>>
>> > On a related note, I have also watched an interesting presentation
[3]
> >
about how Drupal handles its Solr integration and, particularly, a
> plugin
> > [4] that handles the multilingual aspect.
> > The idea seen there is that you have this UI that helps you generate
> > configuration files, depending you your needs. For instance, you
(admin)
>> > check that you need search for language English, French and German
and
>> the
>> > ui/extension gives you a zip with the configuration you need to use
in
> your
> > (remote or embedded) solr instance. The configuration for each
language
> > comes preset with the analyzers you
should use for it and the
additional
>> > resources (stopwords.txt, synonims.txt, etc.).
>> > This approach helps with avoiding the need for admins to be forced
to
>> edit
>> > xml files and could also still be useful for other cases, not only
>> option
>> > 2).
>>
>> Generating sounds like an easy approach to me.
>>
>
> Yes, however I don`t like the fact that we can not do everything from
the
> webapp and the admin needs to access the
filesystem to install the
given
> configuration on the embedded/remote solr
directory. Lucene does not
have
this
problem now. It just works with XWiki and everything is done from
XWiki UI. I feel that losing this commodity will not be very well
received
by users that now have some new install steps to
get XWiki running.
Well, of course, for the embedded solr version, we could handle it like
we
do now and push the files directly from the
webapp to the filesystem.
Since
embedded will be default, it should be OK and
avoid the extra install
step.
Users with a remote solr machine should have the
option to get the zip
instead.
Not sure if we can apply the new configuration without a restart, but
I`ll
have to look more into it. I know the multi-core
architecture supports
something like this but will have to see the details.
>
> > All these problems basically come from the fact that there is no way
to
> > specify in the schema.xml that, based on
the value of a field (like
the
>> > field "lang" that stores the document language), you want to run
this
or
>> > that group of analyzers.
>>
>> Well, this is possible with ThreadLocal but is not necessarily a good
>> idea.
>> Also, it is very common that users formulate queries without
formulating
>> their language and thus you need to
"or" the user's queries through
>> multiple languages (e.g. given by the browser).
>>
>> > Perhaps a solution would be a custom kind of "AggregatorAnalyzer"
that
>> > would call other analyzers at
runtime, based on the value of the
lang
> >
field. However, this solution could only be applied at index time,
when
> you
> > have the lang information (in the solrDocument to be indexed), but
when
>> you
>> > perform the query, you can not analyze the query text since you do
not
> know
> > the language of the field you're querying (it was determined at
runtime
>> -
>> > at index time) and thus do not know what operations to apply to the
>> query
>> > (to reduce it to the same form as the indexed values).
>>
>> How would that look at query time?
>>
>
> That's what I was saying, that at query time, the searched term will
not
get
analyzed by the right chain. When you search for a single language,
you
could add that language as a query filter and
then you could apply the
right chain, but when searching in 2 or more (or no, meaning all)
languages
> you are stuck.
>
>>
>> > I have also read another interesting analysis [6] on this problem
that
> >
elaborates on the complexities and limitations of each options.
(Ignore
>> the
>> > Rosette stuff mentioned there)
>> >
>> > I have been thinking about this for some time now, but the solution
is
>> > probably somewhere in between,
finding an option that is acceptable
>> while
>> > not restrictive. I will probably also send a mail on the Solr list
to
> get
> > some more input from there, but I get the feeling that whatever
> solution we
> > choose, it will most likely require the users to at least copy (or
even
>> > edit) some files into some directories (configurations and/or jars),
>> since
>> > it does not seem to be easy/possible to do everything on-the-fly,
>> > programatically.
>>
>> The only hard step is when changing the supported languages, I think.
>> In this case, when automatically generating the index, you need to
warn
>> the user.
>> The admin UI should have a checkbox "use generated schema" or a
textarea
>> for the schema.
>>
>
> Please see above regarding configuration generation. Basically, since
we
> are going to support both embedded and
remote solr instances, we could
> support things like editing the schema from XWiki only for the embedded
> instance, but not for the remote one. We might end up having separate
UIs
for each
case, since we might want to exploit the flexibility of the
embedded one as much as possible.
>
> Those that want particular fields and tunings need to write their own
> schema.
>
> The same UI could also include whether to include a phonetic track or
not
>> (then require reindexing).
>
>
>> hope it helps.
>>
>
> Yes, very helpful so far. I`m counting on your expertise with
Lucene/Solr
on the
details. My current approach is a practical one without previous
experience on the topic, so I`m still doing mostly guesswork in some
areas.
Thanks,
Eduard
paul
_______________________________________________
devs mailing list
devs(a)xwiki.org
http://lists.xwiki.org/mailman/listinfo/devs
_______________________________________________
devs mailing list
devs(a)xwiki.org
http://lists.xwiki.org/mailman/listinfo/devs
--
Ludovic Dubost
Founder and CEO
Blog:
http://blog.ludovic.org/
XWiki:
http://www.xwiki.com
Skype: ldubost GTalk: ldubost
_______________________________________________
devs mailing list
devs(a)xwiki.org
http://lists.xwiki.org/mailman/listinfo/devs