Re: [xwiki-devs] [GSoC] Solr multilingual support.

5 Jul 2012

Hello Paul,
            I completely understand your point. But I'm wondering on
indexing a wiki page which has multiple languages in it.
For eg:
http://ec2-50-19-181-163.compute-1.amazonaws.com:8080/xwiki/bin/view/Search…
I'm thinking of a way to find the list of languages used in the page and if
more than two language exist , I could use a multilingual field type.
Sample configuration snippet:
title_ml, space_ml, fulltext_ml, ml for multilingual.
<!-- Multilingual -->
<fieldType name="text_ml" class="solr.TextField"
positionIncrementGap="100">
 <analyzer>
     <tokenizer class="solr.StandardTokenizerFactory"/>
     <!-- removes l', etc -->
     <filter class="solr.LowerCaseFilterFactory"/>
     <filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_fr.txt" format="snowball"
enablePositionIncrements="
true"/>
     <filter class="solr.EnglishMinimalStemFilterFactory"/>
     <filter class="solr.FrenchLightStemFilterFactory"/>
     <filter class="solr.SpanishLightStemFilterFactory"/>
  </analyzer>
</fieldType>
The list of analysers should match the languages supported by XWiki
instance.
If the possibility of language detection tool is ruled out, I'm quite lost
on how to find if a XWiki document has two or more language in it( not
referring to translation of the Wiki page).
Thanks a lot,
Savitha S.
On Wed, Jul 4, 2012 at 12:21 AM, Paul Libbrecht &lt;paul(a)hoplahup.net&gt; wrote:
...
  Savitha,
 Multilingual pages are expected to be made of document translations: each
 of the page content is in one language which the author indicates and your
 indexer can read. This should be your primary source of language detection
 and you should not need an automatic language detector which is highly
 error-prone.
 Your analyzers seem to be correct and I feel it is correct to index
 languages in different fields.
 I would recommend that you also use a default-text field (text_intl) which
 is only mildly tokenized (whitespace, lowercase, ...) and that you add
 search into this field with much lower boost.
 As you say, you need "pre-processing of queries": I call this query
 expansion but whatever the name I fully agree this is a necessary step, and
 one that is insufficiently documented (on the solr side) and one that
 should be subclassable by applications.
 A part of it which is nicely documented is the Edismax qf parameters. It
 can contain, for example:
   title_en^3 title_fr^2 title_es^1.8 title_intl^1.7 text_en^1.5
 text_fr^1.4 text_es^1.3 text_intl^1
 you configure it into the solrconfig.xml which should also be adjustable I
 think.
 I am still fearing that facetting by language is going to fail because you
 need to consider an XWiki page in multiple language as multiple documents
 in the search results which the user does not want (and which would break
 the principle of being a translation).
 Paul
 Le 4 juil. 2012 à 07:05, savitha sundaramurthy a écrit :
  Hi devs,
 Here are my thoughts on the configuration for multi lingual support.
 Solr uses different analysers and stemmers to index wiki content. This is
 configured in a XML file, schema.xml.
 The wiki content with english language is indexed with text_en field type
 whereas french with text_fr field type. The language of the document is
 fetched and appended to the field. ( fieldName +"_"+ language : title_en,
 fulltext_en, space_en ).
 Configurations below:
 <!-- English -->
    <fieldType name="text_en" class="solr.TextField"
 positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
 words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.SynonymFilterFactory"
 synonyms="index_synonyms.txt" ignoreCase="true"
expand="false"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishMinimalStemFilterFactory"/>
      </analyzer>
    </fieldType>
 <!-- French -->
 <fieldType name="text_fr" class="solr.TextField" 
positionIncrementGap="100">
  <analyzer>
     <tokenizer class="solr.StandardTokenizerFactory"/>
     <!-- removes l', etc -->
     <filter class="solr.ElisionFilterFactory" ignoreCase="true"
 articles="lang/contractions_fr.txt"/>
     <filter class="solr.LowerCaseFilterFactory"/>
     <filter class="solr.StopFilterFactory" ignoreCase="true"
 words="lang/stopwords_fr.txt" format="snowball"
 enablePositionIncrements="true"/>
     <filter class="solr.FrenchLightStemFilterFactory"/>
     <!-- less aggressive: <filter
 class="solr.FrenchMinimalStemFilterFactory"/> -->
     <!-- more aggressive: <filter 
class="solr.SnowballPorterFilterFactory"
  language="French"/> -->
  </analyzer>
 </fieldType>
 In the case of a document having multilingual text, say english and  french.
  There is no way to find the list of languages
used in the document.
 Is it good to use  a language detection tool,
 http://code.google.com/p/language-detection/ to get the list of  languages,
  if they are more than two use a multilingual
field type ?
 title_ml, space_ml, fulltext_ml, ml for multilingual.
 <!-- Multilingual -->
 <fieldType name="text_ml" class="solr.TextField" 
positionIncrementGap="100">
  <analyzer>
     <tokenizer class="solr.StandardTokenizerFactory"/>
     <!-- removes l', etc -->
     <filter class="solr.LowerCaseFilterFactory"/>
     <filter class="solr.StopFilterFactory" ignoreCase="true"
 words="lang/stopwords_fr.txt" format="snowball"
 enablePositionIncrements="true"/>
     <filter class="solr.EnglishMinimalStemFilterFactory"/>
     <filter class="solr.FrenchLightStemFilterFactory"/>
     <filter class="solr.SpanishLightStemFilterFactory"/>
  </analyzer>
 </fieldType>
 The list of analysers should match the languages supported by XWik  instance.

 Am planning to use language detection only to check whether text from
 multiple languages exist. Will investigate if its possible to configure  the
  analysers on the fly based on the languages
returned by the
 language-detection tool.
 Please suggest,if this is a right approach ?
 --
 Thanks,
 Savitha.s
 _______________________________________________
 devs mailing list
 devs(a)xwiki.org
 http://lists.xwiki.org/mailman/listinfo/devs 
 _______________________________________________
 devs mailing list
 devs(a)xwiki.org
 http://lists.xwiki.org/mailman/listinfo/devs
 
--
Thanks,
Savi

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [xwiki-devs] [GSoC] Solr multilingual support.