Hello Paul,
I completely understand your point. But I'm wondering on
indexing a wiki page which has multiple languages in it.
For eg:
I'm thinking of a way to find the list of languages used in the page and if
more than two language exist , I could use a multilingual field type.
Sample configuration snippet:
title_ml, space_ml, fulltext_ml, ml for multilingual.
<!-- Multilingual -->
<fieldType name="text_ml" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<!-- removes l', etc -->
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_fr.txt" format="snowball"
enablePositionIncrements="
true"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<filter class="solr.FrenchLightStemFilterFactory"/>
<filter class="solr.SpanishLightStemFilterFactory"/>
</analyzer>
</fieldType>
The list of analysers should match the languages supported by XWiki
instance.
If the possibility of language detection tool is ruled out, I'm quite lost
on how to find if a XWiki document has two or more language in it( not
referring to translation of the Wiki page).
Thanks a lot,
Savitha S.
On Wed, Jul 4, 2012 at 12:21 AM, Paul Libbrecht <paul(a)hoplahup.net> wrote:
Savitha,
Multilingual pages are expected to be made of document translations: each
of the page content is in one language which the author indicates and your
indexer can read. This should be your primary source of language detection
and you should not need an automatic language detector which is highly
error-prone.
Your analyzers seem to be correct and I feel it is correct to index
languages in different fields.
I would recommend that you also use a default-text field (text_intl) which
is only mildly tokenized (whitespace, lowercase, ...) and that you add
search into this field with much lower boost.
As you say, you need "pre-processing of queries": I call this query
expansion but whatever the name I fully agree this is a necessary step, and
one that is insufficiently documented (on the solr side) and one that
should be subclassable by applications.
A part of it which is nicely documented is the Edismax qf parameters. It
can contain, for example:
title_en^3 title_fr^2 title_es^1.8 title_intl^1.7 text_en^1.5
text_fr^1.4 text_es^1.3 text_intl^1
you configure it into the solrconfig.xml which should also be adjustable I
think.
I am still fearing that facetting by language is going to fail because you
need to consider an XWiki page in multiple language as multiple documents
in the search results which the user does not want (and which would break
the principle of being a translation).
Paul
Le 4 juil. 2012 à 07:05, savitha sundaramurthy a écrit :
Hi devs,
Here are my thoughts on the configuration for multi lingual support.
Solr uses different analysers and stemmers to index wiki content. This is
configured in a XML file, schema.xml.
The wiki content with english language is indexed with text_en field type
whereas french with text_fr field type. The language of the document is
fetched and appended to the field. ( fieldName +"_"+ language : title_en,
fulltext_en, space_en ).
Configurations below:
<!-- English -->
<fieldType name="text_en" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true"
expand="false"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
</analyzer>
</fieldType>
<!-- French -->
<fieldType name="text_fr" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<!-- removes l', etc -->
<filter class="solr.ElisionFilterFactory" ignoreCase="true"
articles="lang/contractions_fr.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_fr.txt" format="snowball"
enablePositionIncrements="true"/>
<filter class="solr.FrenchLightStemFilterFactory"/>
<!-- less aggressive: <filter
class="solr.FrenchMinimalStemFilterFactory"/> -->
<!-- more aggressive: <filter
class="solr.SnowballPorterFilterFactory"
language="French"/> -->
</analyzer>
</fieldType>
In the case of a document having multilingual text, say english and
french.
There is no way to find the list of languages
used in the document.
Is it good to use a language detection tool,
http://code.google.com/p/language-detection/ to get the list of
languages,
if they are more than two use a multilingual
field type ?
title_ml, space_ml, fulltext_ml, ml for multilingual.
<!-- Multilingual -->
<fieldType name="text_ml" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<!-- removes l', etc -->
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_fr.txt" format="snowball"
enablePositionIncrements="true"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<filter class="solr.FrenchLightStemFilterFactory"/>
<filter class="solr.SpanishLightStemFilterFactory"/>
</analyzer>
</fieldType>
The list of analysers should match the languages supported by XWik
instance.
Am planning to use language detection only to check whether text from
multiple languages exist. Will investigate if its possible to configure
the
analysers on the fly based on the languages
returned by the
language-detection tool.
Please suggest,if this is a right approach ?
--
Thanks,
Savitha.s
_______________________________________________
devs mailing list
devs(a)xwiki.org
http://lists.xwiki.org/mailman/listinfo/devs
_______________________________________________
devs mailing list
devs(a)xwiki.org
http://lists.xwiki.org/mailman/listinfo/devs