Hi guys,
I just noticed (while updating the screen shots for the Solr Search UI
documentation [1]) that searching for "blog" doesn't match
"Blog.News"
from the category of BlogIntroduction any more as indicated in [2].
Debug mode view shows me that "Blog.News" is indexed as "blog.new"
which means the text is not split in "blog" and "news" as I would
have
expected in this case.
After checking the Solr schema configuration I understood that this is
normal considering that we use the Standard Tokenizer [3] for English
text which has this exception:
"Periods (dots) that are not followed by whitespace are kept as part
of the token, including Internet domain names."
Further investigation showed that before 6.0M1 we used the Word
Delimiter Filter [4] for English text but I changed this with
XWIKI-8911 when upgrading to Solr 4.7.0.
I then noticed that the Solr schema has both text_en and
text_en_splitting types, the later with this comment:
A text field with defaults appropriate for English, plus aggressive
word-splitting and autophrase features enabled. This field is just
like text_en, except it adds WordDelimiterFilter to enable splitting
and matching of words on case-change, alpha numeric boundaries, and
non-alphanumeric chars. This means certain compound word cases will
work, for example query "wi fi" will match document "WiFi" or
"wi-fi".
So in case someone wants to use this type instead for English text he
needs to change the type in:
<dynamicField name="*_en" type="text_en" indexed="true"
stored="true"
multiValued="true" />
The question is whether we should use this type by default or not. As
explained in the comment above, there are downsides.
Thanks,
Marius
[1]
http://extensions.xwiki.org/xwiki/bin/view/Extension/Solr+Search+Application
[2]
http://extensions.xwiki.org/xwiki/bin/download/Extension/Solr+Search+Applic…
[3]
https://cwiki.apache.org/confluence/display/solr/Tokenizers#Tokenizers-Stan…
[4]
https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#Filter…