Hi Paul,
On Thu, Jul 5, 2012 at 9:21 PM, Paul Libbrecht <paul(a)hoplahup.net> wrote:
Savitha,
I may have been evil into suggesting that page with body:
This is a test page.
We'd put some English words.
Some typos as well: Eglish.
Monday Tuesday Thursday Monday Monday Monday
Et un peu de français pour embêter le monde.
And a little greek: lambda in greek: λαμβδα
I think this is a pathological case and we could ignore it.
I agree that this is not a representative use case. Better would be to
create the same page in English only, with a couple translations.
Why are you saying that "in this case I could use
the multilingual
analyzer"?
The stemmer you suggest below is very likely to have unexpected issues I
have the impression.
However a "neutral text field" (I called it a multilingual field) would
make sense: no analysis beyond token-separation and lowercasing. A dismax
configuration would prefer a match in the neutral-text-field (thus
preferring unstemmed matches) to a stemmed match.
What do others feel?
Would it be useful to employ a strategy that would work for many languages
within the same page as opposed to a language per translation?
Given the scope of a GSoC, I'd say no. The 2 use cases I see on projects
are the following:
- Wiki in one language -> use the right stemmer (if the wiki is setup in
French, use the French stemmer by default)
- Wiki with multilingual activated -> search documents that match the
context language (with the right stemmer obviously) and let the user expand
to other languages if no match is found in context language
The several-languages-in-one-page use case has been pretty much inexistent
in my experience.
Guillaume
thanks in advance
Paul
Le 5 juil. 2012 à 04:27, savitha sundaramurthy a écrit :
Hello Paul,
I completely understand your point. But I'm wondering on
indexing a wiki page which has multiple languages in it.
For eg:
http://ec2-50-19-181-163.compute-1.amazonaws.com:8080/xwiki/bin/view/Search…
I'm thinking of a way to find the list of languages used in the page and
if
more than two language exist , I could use a
multilingual field type.
Sample configuration snippet:
title_ml, space_ml, fulltext_ml, ml for multilingual.
<!-- Multilingual -->
<fieldType name="text_ml" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<!-- removes l', etc -->
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_fr.txt" format="snowball"
enablePositionIncrements="
true"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<filter class="solr.FrenchLightStemFilterFactory"/>
<filter class="solr.SpanishLightStemFilterFactory"/>
</analyzer>
</fieldType>
The list of analysers should match the languages supported by XWiki
instance.
If the possibility of language detection tool is ruled out, I'm quite
lost
on how to find if a XWiki document has two or
more language in it( not
referring to translation of the Wiki page).
Thanks a lot,
Savitha S.
On Wed, Jul 4, 2012 at 12:21 AM, Paul Libbrecht <paul(a)hoplahup.net>
wrote:
> Savitha,
>
> Multilingual pages are expected to be made of document translations:
each
> of the page content is in one language which
the author indicates and
your
> indexer can read. This should be your primary
source of language
detection
> and you should not need an automatic language
detector which is highly
> error-prone.
>
> Your analyzers seem to be correct and I feel it is correct to index
> languages in different fields.
> I would recommend that you also use a default-text field (text_intl)
which
> is only mildly tokenized (whitespace,
lowercase, ...) and that you add
> search into this field with much lower boost.
>
> As you say, you need "pre-processing of queries": I call this query
> expansion but whatever the name I fully agree this is a necessary step,
and
> one that is insufficiently documented (on the
solr side) and one that
> should be subclassable by applications.
>
> A part of it which is nicely documented is the Edismax qf parameters. It
> can contain, for example:
> title_en^3 title_fr^2 title_es^1.8 title_intl^1.7 text_en^1.5
> text_fr^1.4 text_es^1.3 text_intl^1
> you configure it into the solrconfig.xml which should also be
adjustable I
> think.
>
> I am still fearing that facetting by language is going to fail because
you
> need to consider an XWiki page in multiple
language as multiple
documents
> in the search results which the user does not
want (and which would
break
> the principle of being a translation).
>
> Paul
>
>
>
>
>
>
> Le 4 juil. 2012 à 07:05, savitha sundaramurthy a écrit :
>
>> Hi devs,
>>
>> Here are my thoughts on the configuration for multi lingual support.
>>
>> Solr uses different analysers and stemmers to index wiki content. This
is
>> configured in a XML file, schema.xml.
>>
>> The wiki content with english language is indexed with text_en field
type
>> whereas french with text_fr field type.
The language of the document is
>> fetched and appended to the field. ( fieldName +"_"+ language :
title_en,
fulltext_en, space_en ).
Configurations below:
<!-- English -->
<fieldType name="text_en" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true"
expand="false"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
</analyzer>
</fieldType>
<!-- French -->
<fieldType name="text_fr" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<!-- removes l', etc -->
<filter class="solr.ElisionFilterFactory" ignoreCase="true"
articles="lang/contractions_fr.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_fr.txt" format="snowball"
enablePositionIncrements="true"/>
<filter class="solr.FrenchLightStemFilterFactory"/>
<!-- less aggressive: <filter
class="solr.FrenchMinimalStemFilterFactory"/> -->
<!-- more aggressive: <filter
class="solr.SnowballPorterFilterFactory"
language="French"/> -->
</analyzer>
</fieldType>
In the case of a document having multilingual text, say english and
french.
There is no way to find the list of languages
used in the document.
Is it good to use a language detection tool,
http://code.google.com/p/language-detection/ to get the list of
languages,
if they are more than two use a multilingual
field type ?
title_ml, space_ml, fulltext_ml, ml for multilingual.
<!-- Multilingual -->
<fieldType name="text_ml" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<!-- removes l', etc -->
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_fr.txt" format="snowball"
enablePositionIncrements="true"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<filter class="solr.FrenchLightStemFilterFactory"/>
<filter class="solr.SpanishLightStemFilterFactory"/>
</analyzer>
</fieldType>
The list of analysers should match the languages supported by XWik
instance.
Am planning to use language detection only to check whether text from
multiple languages exist. Will investigate if its possible to configure
the
analysers on the fly based on the languages
returned by the
language-detection tool.
Please suggest,if this is a right approach ?
--
Thanks,
Savitha.s
_______________________________________________
devs mailing list
devs(a)xwiki.org
http://lists.xwiki.org/mailman/listinfo/devs
_______________________________________________
devs mailing list
devs(a)xwiki.org
http://lists.xwiki.org/mailman/listinfo/devs
--
Thanks,
Savi
_______________________________________________
devs mailing list
devs(a)xwiki.org
http://lists.xwiki.org/mailman/listinfo/devs
_______________________________________________
devs mailing list
devs(a)xwiki.org
http://lists.xwiki.org/mailman/listinfo/devs