There is 1 comment.
 
 
XWiki Platform / cid:jira-generated-image-avatar-61e44c1c-3848-4a85-b6ca-bc833f599e16 XWIKI-18419 Open

Simplified Chinese content is not supported by the Solr search

 
View issue   ·   Add comment
 

1 comment

 
cid:jira-generated-image-avatar-11588c99-5f52-4331-8871-773714786bb3 KevinGao on 21/Jun/24 10:41
 
IMPORTANT:Lucene support Chinese content search [ officially |https://solr . apache.org/guide/solr/9_4/indexing-guide/language-analysis.html#simplified-chinese]. So is Solr.

 

In xwiki, you should operate manully to support Chinese content search by these step:

1、make sure the version of  Lucene(Solr) in the xwiki you choose. 

check {*}[permanentDirectory]/store/solr/search/conf/solrconfig.xml{*}, you would find content like:

`<luceneMatchVersion>9.8.0</luceneMatchVersion>`

2、download the smartcn jar package related to corresponding lucene version in [https://repo1.maven.org/maven2/org/apache/lucene]

*lucene 9+ choose: lucene-analysis-smartcn/*

*lucene 4to8: choose: lucene-analyzers-smartcn/*

_example: for lucene 9.8.0 , I download [https://repo1.maven.org/maven2/org/apache/lucene/lucene-analysis-smartcn/9.8.0/lucene-analysis-smartcn-9.8.0.jar]_

3、 put lucene-analysis-smartcn-X.X.X.jar in *[permanentDirectory]/store/solr/search/lib* (make sure the read permission for applicaiton running user)

4、edit {*}[permanentDirectory]{*}/store/solr/search/conf/managed-schema.xml  (in xwiki 15 and older, it maybe named `managed-schema`, but it's also a XML file)

add :

 
{code:java}
    <!-- smartcn tokenizer -->
    <dynamicField name="*_zh"  type="text_smartcn"    indexed="true"  stored="true" multiValued="true" />
    <dynamicField name="*_zh_CN"  type="text_smartcn"    indexed="true"  stored="true" multiValued="true" />
    <dynamicField name="*_zh_TW"  type="text_smartcn"    indexed="true"  stored="true" multiValued="true" />
  
    <!-- smartcn tokenizer -->
    <fieldType name="text_smartcn" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="org.apache.lucene.analysis.cn.smart.HMMChineseTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="org.apache.lucene.analysis.cn.smart.HMMChineseTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
       </analyzer>
    </fieldType> {code}
note: if there is exception with in untested version, you may need to lookup lucene officail website to make sure the HMMChineseTokenizerFactory {color:#172b4d}position{color}.

5、the end, restart xwiki, then reindex the xwiki in AdminPage/Search

Now, Chinese content search will be fine (in language zh_CN and zh).

!image-2024-06-21-16-29-42-721.png|width=379,height=221!

 

reference:[https://jeshs.github.io/2020/10/xwiki%E7%9A%84%E9%85%8D%E7%BD%AE%E5%92%8C%E6%8F%92%E4%BB%B6/]