Re: [xwiki-devs] [GSoC] Solr multilingual support.

6 Jul 2012

 Paul and Guillaume,
                        Thanks for pointing it out. And also the idea
of neutral
text field looks good. I'm implementing it.
On Fri, Jul 6, 2012 at 1:45 AM, Guillaume Lerouge &lt;guillaume(a)xwiki.com&gt;wrote;wrote:
...
  Hi Paul,
 On Thu, Jul 5, 2012 at 9:21 PM, Paul Libbrecht &lt;paul(a)hoplahup.net&gt; wrote:
  Savitha,
 I may have been evil into suggesting that page with body:
  This is a test page.
 We'd put some English words.
 Some typos as well: Eglish.
 Monday Tuesday Thursday Monday Monday Monday
 Et un peu de français pour embêter le monde.
 And a little greek: lambda in greek: λαμβδα 
 I think this is a pathological case and we could ignore it.

 I agree that this is not a representative use case. Better would be to
 create the same page in English only, with a couple translations.
  Why are you saying that "in this case I
could use the multilingual
 analyzer"?
 The stemmer you suggest below is very likely to have unexpected issues I
 have the impression.
 However a "neutral text field" (I called it a multilingual field) would
 make sense: no analysis beyond token-separation and lowercasing. A dismax
 configuration would prefer a match in the neutral-text-field (thus
 preferring unstemmed matches) to a stemmed match.
 What do others feel?
 Would it be useful to employ a strategy that would work for many  languages
  within the same page as opposed to a language per
translation?

 Given the scope of a GSoC, I'd say no. The 2 use cases I see on projects
 are the following:
    - Wiki in one language -> use the right stemmer (if the wiki is setup in
    French, use the French stemmer by default)
    - Wiki with multilingual activated -> search documents that match the
    context language (with the right stemmer obviously) and let the user
 expand
    to other languages if no match is found in context language
 The several-languages-in-one-page use case has been pretty much inexistent
 in my experience.
 Guillaume
 thanks in advance

 Paul
 Le 5 juil. 2012 à 04:27, savitha sundaramurthy a écrit :
  Hello Paul,
            I completely understand your point. But I'm wondering on
 indexing a wiki page which has multiple languages in it.
 For eg:

http://ec2-50-19-181-163.compute-1.amazonaws.com:8080/xwiki/bin/view/Search…
  >
 > I'm thinking of a way to find the list of languages used in the page  and
  if
  more than two language exist , I could use a
multilingual field type.
 Sample configuration snippet:
 title_ml, space_ml, fulltext_ml, ml for multilingual.
 <!-- Multilingual -->
 <fieldType name="text_ml" class="solr.TextField" 
positionIncrementGap="100">
  <analyzer>
     <tokenizer class="solr.StandardTokenizerFactory"/>
     <!-- removes l', etc -->
     <filter class="solr.LowerCaseFilterFactory"/>
     <filter class="solr.StopFilterFactory" ignoreCase="true"
 words="lang/stopwords_fr.txt" format="snowball" 
enablePositionIncrements="
  true"/>
     <filter class="solr.EnglishMinimalStemFilterFactory"/>
     <filter class="solr.FrenchLightStemFilterFactory"/>
     <filter class="solr.SpanishLightStemFilterFactory"/>
  </analyzer>
 </fieldType>
 The list of analysers should match the languages supported by XWiki
 instance.
 If the possibility of language detection tool is ruled out, I'm quite  lost
  on how to find if a XWiki document has two or
more language in it( not
 referring to translation of the Wiki page).
 Thanks a lot,
 Savitha S.
 On Wed, Jul 4, 2012 at 12:21 AM, Paul Libbrecht &lt;paul(a)hoplahup.net&gt;  wrote:

> Savitha,
>
> Multilingual pages are expected to be made of document translations:  each
 > of the page content is in one language which
the author indicates and  your
 > indexer can read. This should be your primary
source of language  detection
 > and you should not need an automatic language
detector which is highly
> error-prone.
>
> Your analyzers seem to be correct and I feel it is correct to index
> languages in different fields.
> I would recommend that you also use a default-text field (text_intl)  which
 >> is only mildly tokenized (whitespace, lowercase, ...) and that you add
 >> search into this field with much lower boost.
 >>
 >> As you say, you need "pre-processing of queries": I call this query
 >> expansion but whatever the name I fully agree this is a necessary  step,
  and
 >> one that is insufficiently documented (on the solr side) and one that
 >> should be subclassable by applications.
 >>
 >> A part of it which is nicely documented is the Edismax qf parameters.  It
  > can
contain, for example:
>  title_en^3 title_fr^2 title_es^1.8 title_intl^1.7 text_en^1.5
> text_fr^1.4 text_es^1.3 text_intl^1
> you configure it into the solrconfig.xml which should also be  adjustable I
 > think.
>
> I am still fearing that facetting by language is going to fail because  you
 > need to consider an XWiki page in multiple
language as multiple  documents
 > in the search results which the user does not
want (and which would  break
 >> the principle of being a translation).
 >>
 >> Paul
 >>
 >>
 >>
 >>
 >>
 >>
 >> Le 4 juil. 2012 à 07:05, savitha sundaramurthy a écrit :
 >>
 >>> Hi devs,
 >>>
 >>> Here are my thoughts on the configuration for multi lingual support.
 >>>
 >>> Solr uses different analysers and stemmers to index wiki content. 
This
  is
 >> configured in a XML file, schema.xml.
>>
>> The wiki content with english language is indexed with text_en field  type
 >>> whereas french with text_fr field type. The language of the document 
is
  >>
fetched and appended to the field. ( fieldName +"_"+ language :  title_en,
 >>> fulltext_en, space_en ).
 >>>
 >>> Configurations below:
 >>>
 >>> <!-- English -->
 >>>   <fieldType name="text_en" class="solr.TextField"
 >>> positionIncrementGap="100">
 >>>     <analyzer type="index">
 >>>       <tokenizer class="solr.StandardTokenizerFactory"/>
 >>>       <filter class="solr.StopFilterFactory"
ignoreCase="true"
 >>> words="stopwords.txt" enablePositionIncrements="true"
/>
 >>>       <filter class="solr.SynonymFilterFactory"
 >>> synonyms="index_synonyms.txt" ignoreCase="true"
expand="false"/>
 >>>       <filter class="solr.LowerCaseFilterFactory"/>
 >>>       <filter class="solr.EnglishMinimalStemFilterFactory"/>
 >>>     </analyzer>
 >>>   </fieldType>
 >>>
 >>> <!-- French -->
 >>> <fieldType name="text_fr" class="solr.TextField"
 >> positionIncrementGap="100">
 >>> <analyzer>
 >>>    <tokenizer class="solr.StandardTokenizerFactory"/>
 >>>    <!-- removes l', etc -->
 >>>    <filter class="solr.ElisionFilterFactory"
ignoreCase="true"
 >>> articles="lang/contractions_fr.txt"/>
 >>>    <filter class="solr.LowerCaseFilterFactory"/>
 >>>    <filter class="solr.StopFilterFactory"
ignoreCase="true"
 >>> words="lang/stopwords_fr.txt" format="snowball"
 >>> enablePositionIncrements="true"/>
 >>>    <filter class="solr.FrenchLightStemFilterFactory"/>
 >>>    <!-- less aggressive: <filter
 >>> class="solr.FrenchMinimalStemFilterFactory"/> -->
 >>>    <!-- more aggressive: <filter
 >> class="solr.SnowballPorterFilterFactory"
 >>> language="French"/> -->
 >>> </analyzer>
 >>> </fieldType>
 >>>
 >>>
 >>> In the case of a document having multilingual text, say english and
 >> french.
 >>> There is no way to find the list of languages used in the document.
 >>> Is it good to use  a language detection tool,
 >>> http://code.google.com/p/language-detection/ to get the list of
 >> languages,
 >>> if they are more than two use a multilingual field type ?
 >>>
 >>> title_ml, space_ml, fulltext_ml, ml for multilingual.
 >>>
 >>> <!-- Multilingual -->
 >>> <fieldType name="text_ml" class="solr.TextField"
 >> positionIncrementGap="100">
 >>> <analyzer>
 >>>    <tokenizer class="solr.StandardTokenizerFactory"/>
 >>>    <!-- removes l', etc -->
 >>>    <filter class="solr.LowerCaseFilterFactory"/>
 >>>    <filter class="solr.StopFilterFactory"
ignoreCase="true"
 >>> words="lang/stopwords_fr.txt" format="snowball"
 >>> enablePositionIncrements="true"/>
 >>>    <filter class="solr.EnglishMinimalStemFilterFactory"/>
 >>>    <filter class="solr.FrenchLightStemFilterFactory"/>
 >>>    <filter class="solr.SpanishLightStemFilterFactory"/>
 >>> </analyzer>
 >>> </fieldType>
 >>>
 >>> The list of analysers should match the languages supported by XWik
 >> instance.
 >>>
 >>> Am planning to use language detection only to check whether text from
 >>> multiple languages exist. Will investigate if its possible to 
configure
    the
> analysers on the fly based on the languages returned by the
> language-detection tool.
>
> Please suggest,if this is a right approach ?
>
> --
> Thanks,
> Savitha.s
> _______________________________________________
> devs mailing list
> devs(a)xwiki.org
> http://lists.xwiki.org/mailman/listinfo/devs
 _______________________________________________
 devs mailing list
 devs(a)xwiki.org
 http://lists.xwiki.org/mailman/listinfo/devs

 --
 Thanks,
 Savi
 _______________________________________________
 devs mailing list
 devs(a)xwiki.org
 http://lists.xwiki.org/mailman/listinfo/devs 
 _______________________________________________
 devs mailing list
 devs(a)xwiki.org
 http://lists.xwiki.org/mailman/listinfo/devs
  _______________________________________________
 devs mailing list
 devs(a)xwiki.org
 http://lists.xwiki.org/mailman/listinfo/devs

--
Thanks,
Savi

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [xwiki-devs] [GSoC] Solr multilingual support.