Hi devs,
Here are my thoughts on the configuration for multi lingual support.
Solr uses different analysers and stemmers to index wiki content. This is
configured in a XML file, schema.xml.
The wiki content with english language is indexed with text_en field type
whereas french with text_fr field type. The language of the document is
fetched and appended to the field. ( fieldName +"_"+ language : title_en,
fulltext_en, space_en ).
Configurations below:
<!-- English -->
<fieldType name="text_en" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true"
expand="false"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
</analyzer>
</fieldType>
<!-- French -->
<fieldType name="text_fr" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<!-- removes l', etc -->
<filter class="solr.ElisionFilterFactory" ignoreCase="true"
articles="lang/contractions_fr.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_fr.txt" format="snowball"
enablePositionIncrements="true"/>
<filter class="solr.FrenchLightStemFilterFactory"/>
<!-- less aggressive: <filter
class="solr.FrenchMinimalStemFilterFactory"/> -->
<!-- more aggressive: <filter
class="solr.SnowballPorterFilterFactory"
language="French"/> -->
</analyzer>
</fieldType>
In the case of a document having multilingual text, say english and french.
There is no way to find the list of languages used in the document.
Is it good to use a language detection tool,
http://code.google.com/p/language-detection/ to get the list of languages,
if they are more than two use a multilingual field type ?
title_ml, space_ml, fulltext_ml, ml for multilingual.
<!-- Multilingual -->
<fieldType name="text_ml" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<!-- removes l', etc -->
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_fr.txt" format="snowball"
enablePositionIncrements="true"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<filter class="solr.FrenchLightStemFilterFactory"/>
<filter class="solr.SpanishLightStemFilterFactory"/>
</analyzer>
</fieldType>
The list of analysers should match the languages supported by XWik instance.
Am planning to use language detection only to check whether text from
multiple languages exist. Will investigate if its possible to configure the
analysers on the fly based on the languages returned by the
language-detection tool.
Please suggest,if this is a right approach ?
--
Thanks,
Savitha.s