Paul and Guillaume,
Thanks for pointing it out. And also the idea
of neutral
text field looks good. I'm implementing it.
On Fri, Jul 6, 2012 at 1:45 AM, Guillaume Lerouge <guillaume(a)xwiki.com>wrote;wrote:
Hi Paul,
On Thu, Jul 5, 2012 at 9:21 PM, Paul Libbrecht <paul(a)hoplahup.net> wrote:
Savitha,
I may have been evil into suggesting that page with body:
This is a test page.
We'd put some English words.
Some typos as well: Eglish.
Monday Tuesday Thursday Monday Monday Monday
Et un peu de français pour embêter le monde.
And a little greek: lambda in greek: λαμβδα
I think this is a pathological case and we could ignore it.
I agree that this is not a representative use case. Better would be to
create the same page in English only, with a couple translations.
Why are you saying that "in this case I
could use the multilingual
analyzer"?
The stemmer you suggest below is very likely to have unexpected issues I
have the impression.
However a "neutral text field" (I called it a multilingual field) would
make sense: no analysis beyond token-separation and lowercasing. A dismax
configuration would prefer a match in the neutral-text-field (thus
preferring unstemmed matches) to a stemmed match.
What do others feel?
Would it be useful to employ a strategy that would work for many
languages
within the same page as opposed to a language per
translation?
Given the scope of a GSoC, I'd say no. The 2 use cases I see on projects
are the following:
- Wiki in one language -> use the right stemmer (if the wiki is setup in
French, use the French stemmer by default)
- Wiki with multilingual activated -> search documents that match the
context language (with the right stemmer obviously) and let the user
expand
to other languages if no match is found in context language
The several-languages-in-one-page use case has been pretty much inexistent
in my experience.
Guillaume
thanks in advance
Paul
Le 5 juil. 2012 à 04:27, savitha sundaramurthy a écrit :
Hello Paul,
I completely understand your point. But I'm wondering on
indexing a wiki page which has multiple languages in it.
For eg:
http://ec2-50-19-181-163.compute-1.amazonaws.com:8080/xwiki/bin/view/Search…
>
> I'm thinking of a way to find the list of languages used in the page
and
if
more than two language exist , I could use a
multilingual field type.
Sample configuration snippet:
title_ml, space_ml, fulltext_ml, ml for multilingual.
<!-- Multilingual -->
<fieldType name="text_ml" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<!-- removes l', etc -->
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_fr.txt" format="snowball"
enablePositionIncrements="
true"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<filter class="solr.FrenchLightStemFilterFactory"/>
<filter class="solr.SpanishLightStemFilterFactory"/>
</analyzer>
</fieldType>
The list of analysers should match the languages supported by XWiki
instance.
If the possibility of language detection tool is ruled out, I'm quite
lost
on how to find if a XWiki document has two or
more language in it( not
referring to translation of the Wiki page).
Thanks a lot,
Savitha S.
On Wed, Jul 4, 2012 at 12:21 AM, Paul Libbrecht <paul(a)hoplahup.net>
wrote:
> Savitha,
>
> Multilingual pages are expected to be made of document translations:
each
> of the page content is in one language which
the author indicates and
your
> indexer can read. This should be your primary
source of language
detection
> and you should not need an automatic language
detector which is highly
> error-prone.
>
> Your analyzers seem to be correct and I feel it is correct to index
> languages in different fields.
> I would recommend that you also use a default-text field (text_intl)
which
>> is only mildly tokenized (whitespace, lowercase, ...) and that you add
>> search into this field with much lower boost.
>>
>> As you say, you need "pre-processing of queries": I call this query
>> expansion but whatever the name I fully agree this is a necessary
step,
and
>> one that is insufficiently documented (on the solr side) and one that
>> should be subclassable by applications.
>>
>> A part of it which is nicely documented is the Edismax qf parameters.
It
> can
contain, for example:
> title_en^3 title_fr^2 title_es^1.8 title_intl^1.7 text_en^1.5
> text_fr^1.4 text_es^1.3 text_intl^1
> you configure it into the solrconfig.xml which should also be
adjustable I
> think.
>
> I am still fearing that facetting by language is going to fail because
you
> need to consider an XWiki page in multiple
language as multiple
documents
> in the search results which the user does not
want (and which would
break
>> the principle of being a translation).
>>
>> Paul
>>
>>
>>
>>
>>
>>
>> Le 4 juil. 2012 à 07:05, savitha sundaramurthy a écrit :
>>
>>> Hi devs,
>>>
>>> Here are my thoughts on the configuration for multi lingual support.
>>>
>>> Solr uses different analysers and stemmers to index wiki content.
This
is
>> configured in a XML file, schema.xml.
>>
>> The wiki content with english language is indexed with text_en field
type
>>> whereas french with text_fr field type. The language of the document
is
>>
fetched and appended to the field. ( fieldName +"_"+ language :
title_en,
>>> fulltext_en, space_en ).
>>>
>>> Configurations below:
>>>
>>> <!-- English -->
>>> <fieldType name="text_en" class="solr.TextField"
>>> positionIncrementGap="100">
>>> <analyzer type="index">
>>> <tokenizer class="solr.StandardTokenizerFactory"/>
>>> <filter class="solr.StopFilterFactory"
ignoreCase="true"
>>> words="stopwords.txt" enablePositionIncrements="true"
/>
>>> <filter class="solr.SynonymFilterFactory"
>>> synonyms="index_synonyms.txt" ignoreCase="true"
expand="false"/>
>>> <filter class="solr.LowerCaseFilterFactory"/>
>>> <filter
class="solr.EnglishMinimalStemFilterFactory"/>
>>> </analyzer>
>>> </fieldType>
>>>
>>> <!-- French -->
>>> <fieldType name="text_fr" class="solr.TextField"
>> positionIncrementGap="100">
>>> <analyzer>
>>> <tokenizer class="solr.StandardTokenizerFactory"/>
>>> <!-- removes l', etc -->
>>> <filter class="solr.ElisionFilterFactory"
ignoreCase="true"
>>> articles="lang/contractions_fr.txt"/>
>>> <filter class="solr.LowerCaseFilterFactory"/>
>>> <filter class="solr.StopFilterFactory"
ignoreCase="true"
>>> words="lang/stopwords_fr.txt" format="snowball"
>>> enablePositionIncrements="true"/>
>>> <filter class="solr.FrenchLightStemFilterFactory"/>
>>> <!-- less aggressive: <filter
>>> class="solr.FrenchMinimalStemFilterFactory"/> -->
>>> <!-- more aggressive: <filter
>> class="solr.SnowballPorterFilterFactory"
>>> language="French"/> -->
>>> </analyzer>
>>> </fieldType>
>>>
>>>
>>> In the case of a document having multilingual text, say english and
>> french.
>>> There is no way to find the list of languages used in the document.
>>> Is it good to use a language detection tool,
>>>
http://code.google.com/p/language-detection/ to get the list of
>> languages,
>>> if they are more than two use a multilingual field type ?
>>>
>>> title_ml, space_ml, fulltext_ml, ml for multilingual.
>>>
>>> <!-- Multilingual -->
>>> <fieldType name="text_ml" class="solr.TextField"
>> positionIncrementGap="100">
>>> <analyzer>
>>> <tokenizer class="solr.StandardTokenizerFactory"/>
>>> <!-- removes l', etc -->
>>> <filter class="solr.LowerCaseFilterFactory"/>
>>> <filter class="solr.StopFilterFactory"
ignoreCase="true"
>>> words="lang/stopwords_fr.txt" format="snowball"
>>> enablePositionIncrements="true"/>
>>> <filter class="solr.EnglishMinimalStemFilterFactory"/>
>>> <filter class="solr.FrenchLightStemFilterFactory"/>
>>> <filter class="solr.SpanishLightStemFilterFactory"/>
>>> </analyzer>
>>> </fieldType>
>>>
>>> The list of analysers should match the languages supported by XWik
>> instance.
>>>
>>> Am planning to use language detection only to check whether text from
>>> multiple languages exist. Will investigate if its possible to
configure
the
> analysers on the fly based on the languages returned by the
> language-detection tool.
>
> Please suggest,if this is a right approach ?
>
> --
> Thanks,
> Savitha.s
> _______________________________________________
> devs mailing list
> devs(a)xwiki.org
>
http://lists.xwiki.org/mailman/listinfo/devs
_______________________________________________
devs mailing list
devs(a)xwiki.org
http://lists.xwiki.org/mailman/listinfo/devs
--
Thanks,
Savi
_______________________________________________
devs mailing list
devs(a)xwiki.org
http://lists.xwiki.org/mailman/listinfo/devs
_______________________________________________
devs mailing list
devs(a)xwiki.org
http://lists.xwiki.org/mailman/listinfo/devs
_______________________________________________
devs mailing list
devs(a)xwiki.org
http://lists.xwiki.org/mailman/listinfo/devs