[xwiki-devs] [GSoC] Solr multilingual support.

List overview All Threads
Download

newer

older

Re: [xwiki-devs] [GSoC] Responsive...

[xwiki-devs] [GSOC] Basic Solr...

savitha sundaramurthy

4 Jul 2012 4 Jul '12

5:05 a.m.

Hi devs, Here are my thoughts on the configuration for multi lingual support. Solr uses different analysers and stemmers to index wiki content. This is configured in a XML file, schema.xml. The wiki content with english language is indexed with text_en field type whereas french with text_fr field type. The language of the document is fetched and appended to the field. ( fieldName +"_"+ language : title_en, fulltext_en, space_en ). Configurations below:  <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EnglishMinimalStemFilterFactory"/> </analyzer> </fieldType>  <fieldType name="text_fr" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/>  <filter class="solr.ElisionFilterFactory" ignoreCase="true" articles="lang/contractions_fr.txt"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_fr.txt" format="snowball" enablePositionIncrements="true"/> <filter class="solr.FrenchLightStemFilterFactory"/>   </analyzer> </fieldType> In the case of a document having multilingual text, say english and french. There is no way to find the list of languages used in the document. Is it good to use a language detection tool, http://code.google.com/p/language-detection/ to get the list of languages, if they are more than two use a multilingual field type ? title_ml, space_ml, fulltext_ml, ml for multilingual.  <fieldType name="text_ml" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/>  <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_fr.txt" format="snowball" enablePositionIncrements="true"/> <filter class="solr.EnglishMinimalStemFilterFactory"/> <filter class="solr.FrenchLightStemFilterFactory"/> <filter class="solr.SpanishLightStemFilterFactory"/> </analyzer> </fieldType> The list of analysers should match the languages supported by XWik instance. Am planning to use language detection only to check whether text from multiple languages exist. Will investigate if its possible to configure the analysers on the fly based on the languages returned by the language-detection tool. Please suggest,if this is a right approach ? -- Thanks, Savitha.s

Show replies by date

Paul Libbrecht

4 Jul 4 Jul

7:21 a.m.

Savitha, Multilingual pages are expected to be made of document translations: each of the page content is in one language which the author indicates and your indexer can read. This should be your primary source of language detection and you should not need an automatic language detector which is highly error-prone. Your analyzers seem to be correct and I feel it is correct to index languages in different fields. I would recommend that you also use a default-text field (text_intl) which is only mildly tokenized (whitespace, lowercase, ...) and that you add search into this field with much lower boost. As you say, you need "pre-processing of queries": I call this query expansion but whatever the name I fully agree this is a necessary step, and one that is insufficiently documented (on the solr side) and one that should be subclassable by applications. A part of it which is nicely documented is the Edismax qf parameters. It can contain, for example: title_en^3 title_fr^2 title_es^1.8 title_intl^1.7 text_en^1.5 text_fr^1.4 text_es^1.3 text_intl^1 you configure it into the solrconfig.xml which should also be adjustable I think. I am still fearing that facetting by language is going to fail because you need to consider an XWiki page in multiple language as multiple documents in the search results which the user does not want (and which would break the principle of being a translation). Paul Le 4 juil. 2012 à 07:05, savitha sundaramurthy a écrit :

...

savitha sundaramurthy

5 Jul 5 Jul

2:27 a.m.

Hello Paul, I completely understand your point. But I'm wondering on indexing a wiki page which has multiple languages in it. For eg: http://ec2-50-19-181-163.compute-1.amazonaws.com:8080/xwiki/bin/view/Search… I'm thinking of a way to find the list of languages used in the page and if more than two language exist , I could use a multilingual field type. Sample configuration snippet: title_ml, space_ml, fulltext_ml, ml for multilingual.  <fieldType name="text_ml" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/>  <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_fr.txt" format="snowball" enablePositionIncrements=" true"/> <filter class="solr.EnglishMinimalStemFilterFactory"/> <filter class="solr.FrenchLightStemFilterFactory"/> <filter class="solr.SpanishLightStemFilterFactory"/> </analyzer> </fieldType> The list of analysers should match the languages supported by XWiki instance. If the possibility of language detection tool is ruled out, I'm quite lost on how to find if a XWiki document has two or more language in it( not referring to translation of the Wiki page). Thanks a lot, Savitha S. On Wed, Jul 4, 2012 at 12:21 AM, Paul Libbrecht <paul(a)hoplahup.net> wrote:

...

positionIncrementGap="100">

<analyzer> <tokenizer class="solr.StandardTokenizerFactory"/>  <filter class="solr.ElisionFilterFactory" ignoreCase="true" articles="lang/contractions_fr.txt"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_fr.txt" format="snowball" enablePositionIncrements="true"/> <filter class="solr.FrenchLightStemFilterFactory"/>  <!-- more aggressive: <filter

class="solr.SnowballPorterFilterFactory"

language="French"/> --> </analyzer> </fieldType> In the case of a document having multilingual text, say english and

french.

There is no way to find the list of languages used in the document. Is it good to use a language detection tool, http://code.google.com/p/language-detection/ to get the list of

languages,

if they are more than two use a multilingual field type ? title_ml, space_ml, fulltext_ml, ml for multilingual.  <fieldType name="text_ml" class="solr.TextField"

positionIncrementGap="100">

<analyzer> <tokenizer class="solr.StandardTokenizerFactory"/>  <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_fr.txt" format="snowball" enablePositionIncrements="true"/> <filter class="solr.EnglishMinimalStemFilterFactory"/> <filter class="solr.FrenchLightStemFilterFactory"/> <filter class="solr.SpanishLightStemFilterFactory"/> </analyzer> </fieldType> The list of analysers should match the languages supported by XWik

instance.

Am planning to use language detection only to check whether text from multiple languages exist. Will investigate if its possible to configure

the

analysers on the fly based on the languages returned by the language-detection tool. Please suggest,if this is a right approach ? -- Thanks, Savitha.s _______________________________________________ devs mailing list devs(a)xwiki.org http://lists.xwiki.org/mailman/listinfo/devs

_______________________________________________ devs mailing list devs(a)xwiki.org http://lists.xwiki.org/mailman/listinfo/devs

-- Thanks, Savi

Paul Libbrecht

7:21 p.m.

Savitha, I may have been evil into suggesting that page with body:

...

This is a test page. We'd put some English words. Some typos as well: Eglish. Monday Tuesday Thursday Monday Monday Monday Et un peu de français pour embêter le monde. And a little greek: lambda in greek: λαμβδα

I think this is a pathological case and we could ignore it. Why are you saying that "in this case I could use the multilingual analyzer"? The stemmer you suggest below is very likely to have unexpected issues I have the impression. However a "neutral text field" (I called it a multilingual field) would make sense: no analysis beyond token-separation and lowercasing. A dismax configuration would prefer a match in the neutral-text-field (thus preferring unstemmed matches) to a stemmed match. What do others feel? Would it be useful to employ a strategy that would work for many languages within the same page as opposed to a language per translation? thanks in advance Paul Le 5 juil. 2012 à 04:27, savitha sundaramurthy a écrit :

...

positionIncrementGap="100">

class="solr.SnowballPorterFilterFactory"

language="French"/> --> </analyzer> </fieldType> In the case of a document having multilingual text, say english and

french.

There is no way to find the list of languages used in the document. Is it good to use a language detection tool, http://code.google.com/p/language-detection/ to get the list of

languages,

if they are more than two use a multilingual field type ? title_ml, space_ml, fulltext_ml, ml for multilingual.  <fieldType name="text_ml" class="solr.TextField"

positionIncrementGap="100">

instance.

Am planning to use language detection only to check whether text from multiple languages exist. Will investigate if its possible to configure

the

_______________________________________________ devs mailing list devs(a)xwiki.org http://lists.xwiki.org/mailman/listinfo/devs

-- Thanks, Savi _______________________________________________ devs mailing list devs(a)xwiki.org http://lists.xwiki.org/mailman/listinfo/devs

Guillaume Lerouge

6 Jul 6 Jul

8:45 a.m.

Hi Paul, On Thu, Jul 5, 2012 at 9:21 PM, Paul Libbrecht <paul(a)hoplahup.net> wrote:

...

Savitha, I may have been evil into suggesting that page with body:

I think this is a pathological case and we could ignore it.

I agree that this is not a representative use case. Better would be to create the same page in English only, with a couple translations.

...

Why are you saying that "in this case I could use the multilingual analyzer"? The stemmer you suggest below is very likely to have unexpected issues I have the impression. However a "neutral text field" (I called it a multilingual field) would make sense: no analysis beyond token-separation and lowercasing. A dismax configuration would prefer a match in the neutral-text-field (thus preferring unstemmed matches) to a stemmed match. What do others feel? Would it be useful to employ a strategy that would work for many languages within the same page as opposed to a language per translation?

Given the scope of a GSoC, I'd say no. The 2 use cases I see on projects are the following: - Wiki in one language -> use the right stemmer (if the wiki is setup in French, use the French stemmer by default) - Wiki with multilingual activated -> search documents that match the context language (with the right stemmer obviously) and let the user expand to other languages if no match is found in context language The several-languages-in-one-page use case has been pretty much inexistent in my experience. Guillaume thanks in advance

...

Paul Le 5 juil. 2012 à 04:27, savitha sundaramurthy a écrit :

Hello Paul, I completely understand your point. But I'm wondering on indexing a wiki page which has multiple languages in it. For eg:

http://ec2-50-19-181-163.compute-1.amazonaws.com:8080/xwiki/bin/view/Search…

I'm thinking of a way to find the list of languages used in the page and

more than two language exist , I could use a multilingual field type. Sample configuration snippet: title_ml, space_ml, fulltext_ml, ml for multilingual.  <fieldType name="text_ml" class="solr.TextField"

positionIncrementGap="100">

enablePositionIncrements="

true"/> <filter class="solr.EnglishMinimalStemFilterFactory"/> <filter class="solr.FrenchLightStemFilterFactory"/> <filter class="solr.SpanishLightStemFilterFactory"/> </analyzer> </fieldType> The list of analysers should match the languages supported by XWiki instance. If the possibility of language detection tool is ruled out, I'm quite

lost

on how to find if a XWiki document has two or more language in it( not referring to translation of the Wiki page). Thanks a lot, Savitha S. On Wed, Jul 4, 2012 at 12:21 AM, Paul Libbrecht <paul(a)hoplahup.net>

wrote:

> Savitha, > > Multilingual pages are expected to be made of document translations:

each

> of the page content is in one language which the author indicates and

your

> indexer can read. This should be your primary source of language

detection

> and you should not need an automatic language detector which is highly > error-prone. > > Your analyzers seem to be correct and I feel it is correct to index > languages in different fields. > I would recommend that you also use a default-text field (text_intl)

which

> is only mildly tokenized (whitespace, lowercase, ...) and that you add > search into this field with much lower boost. > > As you say, you need "pre-processing of queries": I call this query > expansion but whatever the name I fully agree this is a necessary step,

and

> one that is insufficiently documented (on the solr side) and one that > should be subclassable by applications. > > A part of it which is nicely documented is the Edismax qf parameters. It > can contain, for example: > title_en^3 title_fr^2 title_es^1.8 title_intl^1.7 text_en^1.5 > text_fr^1.4 text_es^1.3 text_intl^1 > you configure it into the solrconfig.xml which should also be

adjustable I

> think. > > I am still fearing that facetting by language is going to fail because

you

> need to consider an XWiki page in multiple language as multiple

documents

> in the search results which the user does not want (and which would

break

> the principle of being a translation). > > Paul > > > > > > > Le 4 juil. 2012 à 07:05, savitha sundaramurthy a écrit : > >> Hi devs, >> >> Here are my thoughts on the configuration for multi lingual support. >> >> Solr uses different analysers and stemmers to index wiki content. This

>> configured in a XML file, schema.xml. >> >> The wiki content with english language is indexed with text_en field

type

>> whereas french with text_fr field type. The language of the document is >> fetched and appended to the field. ( fieldName +"_"+ language :

title_en,

fulltext_en, space_en ). Configurations below:  <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EnglishMinimalStemFilterFactory"/> </analyzer> </fieldType>  <fieldType name="text_fr" class="solr.TextField"

positionIncrementGap="100">

class="solr.SnowballPorterFilterFactory"

language="French"/> --> </analyzer> </fieldType> In the case of a document having multilingual text, say english and

french.

There is no way to find the list of languages used in the document. Is it good to use a language detection tool, http://code.google.com/p/language-detection/ to get the list of

languages,

if they are more than two use a multilingual field type ? title_ml, space_ml, fulltext_ml, ml for multilingual.  <fieldType name="text_ml" class="solr.TextField"

positionIncrementGap="100">

instance.

Am planning to use language detection only to check whether text from multiple languages exist. Will investigate if its possible to configure

the

_______________________________________________ devs mailing list devs(a)xwiki.org http://lists.xwiki.org/mailman/listinfo/devs

-- Thanks, Savi _______________________________________________ devs mailing list devs(a)xwiki.org http://lists.xwiki.org/mailman/listinfo/devs

_______________________________________________ devs mailing list devs(a)xwiki.org http://lists.xwiki.org/mailman/listinfo/devs

savitha sundaramurthy

7:34 p.m.

Paul and Guillaume, Thanks for pointing it out. And also the idea of neutral text field looks good. I'm implementing it. On Fri, Jul 6, 2012 at 1:45 AM, Guillaume Lerouge <guillaume(a)xwiki.com>wrote;wrote:

...

Hi Paul, On Thu, Jul 5, 2012 at 9:21 PM, Paul Libbrecht <paul(a)hoplahup.net> wrote:

Savitha, I may have been evil into suggesting that page with body:

I think this is a pathological case and we could ignore it.

I agree that this is not a representative use case. Better would be to create the same page in English only, with a couple translations.

languages

within the same page as opposed to a language per translation?

Paul Le 5 juil. 2012 à 04:27, savitha sundaramurthy a écrit :

Hello Paul, I completely understand your point. But I'm wondering on indexing a wiki page which has multiple languages in it. For eg:

http://ec2-50-19-181-163.compute-1.amazonaws.com:8080/xwiki/bin/view/Search…

> > I'm thinking of a way to find the list of languages used in the page

and

positionIncrementGap="100">

enablePositionIncrements="

lost

wrote:

> Savitha, > > Multilingual pages are expected to be made of document translations:

each

> of the page content is in one language which the author indicates and

your

> indexer can read. This should be your primary source of language

detection

which >> is only mildly tokenized (whitespace, lowercase, ...) and that you add >> search into this field with much lower boost. >> >> As you say, you need "pre-processing of queries": I call this query >> expansion but whatever the name I fully agree this is a necessary

step,

and >> one that is insufficiently documented (on the solr side) and one that >> should be subclassable by applications. >> >> A part of it which is nicely documented is the Edismax qf parameters.

> can contain, for example: > title_en^3 title_fr^2 title_es^1.8 title_intl^1.7 text_en^1.5 > text_fr^1.4 text_es^1.3 text_intl^1 > you configure it into the solrconfig.xml which should also be

adjustable I

> think. > > I am still fearing that facetting by language is going to fail because

you

> need to consider an XWiki page in multiple language as multiple

documents

> in the search results which the user does not want (and which would

break >> the principle of being a translation). >> >> Paul >> >> >> >> >> >> >> Le 4 juil. 2012 à 07:05, savitha sundaramurthy a écrit : >> >>> Hi devs, >>> >>> Here are my thoughts on the configuration for multi lingual support. >>> >>> Solr uses different analysers and stemmers to index wiki content.

This

>> configured in a XML file, schema.xml. >> >> The wiki content with english language is indexed with text_en field

type >>> whereas french with text_fr field type. The language of the document

>> fetched and appended to the field. ( fieldName +"_"+ language :

title_en, >>> fulltext_en, space_en ). >>> >>> Configurations below: >>> >>>  >>> <fieldType name="text_en" class="solr.TextField" >>> positionIncrementGap="100"> >>> <analyzer type="index"> >>> <tokenizer class="solr.StandardTokenizerFactory"/> >>> <filter class="solr.StopFilterFactory" ignoreCase="true" >>> words="stopwords.txt" enablePositionIncrements="true" /> >>> <filter class="solr.SynonymFilterFactory" >>> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> >>> <filter class="solr.LowerCaseFilterFactory"/> >>> <filter class="solr.EnglishMinimalStemFilterFactory"/> >>> </analyzer> >>> </fieldType> >>> >>>  >>> <fieldType name="text_fr" class="solr.TextField" >> positionIncrementGap="100"> >>> <analyzer> >>> <tokenizer class="solr.StandardTokenizerFactory"/> >>>  >>> <filter class="solr.ElisionFilterFactory" ignoreCase="true" >>> articles="lang/contractions_fr.txt"/> >>> <filter class="solr.LowerCaseFilterFactory"/> >>> <filter class="solr.StopFilterFactory" ignoreCase="true" >>> words="lang/stopwords_fr.txt" format="snowball" >>> enablePositionIncrements="true"/> >>> <filter class="solr.FrenchLightStemFilterFactory"/> >>>  >>>  >>> </analyzer> >>> </fieldType> >>> >>> >>> In the case of a document having multilingual text, say english and >> french. >>> There is no way to find the list of languages used in the document. >>> Is it good to use a language detection tool, >>> http://code.google.com/p/language-detection/ to get the list of >> languages, >>> if they are more than two use a multilingual field type ? >>> >>> title_ml, space_ml, fulltext_ml, ml for multilingual. >>> >>>  >>> <fieldType name="text_ml" class="solr.TextField" >> positionIncrementGap="100"> >>> <analyzer> >>> <tokenizer class="solr.StandardTokenizerFactory"/> >>>  >>> <filter class="solr.LowerCaseFilterFactory"/> >>> <filter class="solr.StopFilterFactory" ignoreCase="true" >>> words="lang/stopwords_fr.txt" format="snowball" >>> enablePositionIncrements="true"/> >>> <filter class="solr.EnglishMinimalStemFilterFactory"/> >>> <filter class="solr.FrenchLightStemFilterFactory"/> >>> <filter class="solr.SpanishLightStemFilterFactory"/> >>> </analyzer> >>> </fieldType> >>> >>> The list of analysers should match the languages supported by XWik >> instance. >>> >>> Am planning to use language detection only to check whether text from >>> multiple languages exist. Will investigate if its possible to

configure

the > analysers on the fly based on the languages returned by the > language-detection tool. > > Please suggest,if this is a right approach ? > > -- > Thanks, > Savitha.s > _______________________________________________ > devs mailing list > devs(a)xwiki.org > http://lists.xwiki.org/mailman/listinfo/devs _______________________________________________ devs mailing list devs(a)xwiki.org http://lists.xwiki.org/mailman/listinfo/devs

-- Thanks, Savi _______________________________________________ devs mailing list devs(a)xwiki.org http://lists.xwiki.org/mailman/listinfo/devs

_______________________________________________ devs mailing list devs(a)xwiki.org http://lists.xwiki.org/mailman/listinfo/devs

-- Thanks, Savi

4850

days inactive

4852

days old

xwiki-devs@xwiki.org

Manage subscription

5 comments

3 participants

tags (0)

participants (3)

Guillaume Lerouge
Paul Libbrecht
savitha sundaramurthy