Simplified Chinese content is not supported by the Solr search

1 comment

	KevinGao on 21/Jun/24 10:41

	IMPORTANT：Lucene support Chinese content search [ officially \|https://solr . apache.org/guide/solr/9_4/indexing-guide/language-analysis.html#simplified-chinese]. So is Solr. In xwiki, you should operate manully to support Chinese content search by these step: 1、make sure the version of Lucene(Solr) in the xwiki you choose. check {}[permanentDirectory]/store/solr/search/conf/solrconfig.xml{}, you would find content like: `<luceneMatchVersion>9.8.0</luceneMatchVersion>` 2、download the smartcn jar package related to corresponding lucene version in [https://repo1.maven.org/maven2/org/apache/lucene] lucene 9+ choose: lucene-analysis-smartcn/ lucene 4to8: choose: lucene-analyzers-smartcn/ _example: for lucene 9.8.0 , I download [https://repo1.maven.org/maven2/org/apache/lucene/lucene-analysis-smartcn/9.8.0/lucene-analysis-smartcn-9.8.0.jar]_ 3、 put lucene-analysis-smartcn-X.X.X.jar in [permanentDirectory]/store/solr/search/lib （make sure the read permission for applicaiton running user） 4、edit {}[permanentDirectory]{}/store/solr/search/conf/managed-schema.xml (in xwiki 15 and older, it maybe named `managed-schema`, but it's also a XML file) add : {code:java} <!-- smartcn tokenizer --> <dynamicField name="_zh" type="text_smartcn" indexed="true" stored="true" multiValued="true" /> <dynamicField name="_zh_CN" type="text_smartcn" indexed="true" stored="true" multiValued="true" /> <dynamicField name="*_zh_TW" type="text_smartcn" indexed="true" stored="true" multiValued="true" /> <!-- smartcn tokenizer --> <fieldType name="text_smartcn" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="org.apache.lucene.analysis.cn.smart.HMMChineseTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="org.apache.lucene.analysis.cn.smart.HMMChineseTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> {code} note: if there is exception with in untested version, you may need to lookup lucene officail website to make sure the HMMChineseTokenizerFactory {color:#172b4d}position{color}. 5、the end, restart xwiki, then reindex the xwiki in AdminPage/Search Now, Chinese content search will be fine （in language zh_CN and zh）. !image-2024-06-21-16-29-42-721.png\|width=379,height=221! reference：[https://jeshs.github.io/2020/10/xwiki%E7%9A%84%E9%85%8D%E7%BD%AE%E5%92%8C%E6%8F%92%E4%BB%B6/]

This message was sent by Atlassian Jira (v9.3.0#930000-sha1:287aeb6)

If image attachments aren't displayed, see this article.