Re: [xwiki-devs] [Solr] Word delimiter filter on English text

5 May 2015

Hi,
The question is about content fields (document contet, textarea content,
etc.) and not about the document's space name and document name fields,
which will still match in both approaches, right?
As far as I`ve understood it, text_en gets less matches than
text_en_splitting, but text_en has better support for cases where in
text_en_splitting you would have to use a phrase query to get the match
(e.g. "Blog.News", "xwiki.com", etc.).
IMO, text_en_splitting sounds more adapted to real life uses and to the
fuzziness of user queries. If we want explicit matches for "xwiki.com" or
"Blog.News" within a document's content, phrase queries can still be used,
right? (i.e. quoting the explicit string).
Thanks,
Eduard
On Tue, May 5, 2015 at 12:55 PM, Marius Dumitru Florea <
mariusdumitru.florea(a)xwiki.com&gt; wrote:
...
  Hi guys,
 I just noticed (while updating the screen shots for the Solr Search UI
 documentation [1]) that searching for "blog" doesn't match
"Blog.News"
 from the category of BlogIntroduction any more as indicated in [2].
 Debug mode view shows me that "Blog.News" is indexed as "blog.new"
 which means the text is not split in "blog" and "news" as I would
have
 expected in this case.
 After checking the Solr schema configuration I understood that this is
 normal considering that we use the Standard Tokenizer [3] for English
 text which has this exception:
 "Periods (dots) that are not followed by whitespace are kept as part
 of the token, including Internet domain names."
 Further investigation showed that before 6.0M1 we used the Word
 Delimiter Filter [4] for English text but I changed this with
 XWIKI-8911 when upgrading to Solr 4.7.0.
 I then noticed that the Solr schema has both text_en and
 text_en_splitting types, the later with this comment:
 A text field with defaults appropriate for English, plus aggressive
 word-splitting and autophrase features enabled. This field is just
 like text_en, except it adds WordDelimiterFilter to enable splitting
 and matching of words on case-change, alpha numeric boundaries, and
 non-alphanumeric chars. This means certain compound word cases will
 work, for example query "wi fi" will match document "WiFi" or
"wi-fi".
 So in case someone wants to use this type instead for English text he
 needs to change the type in:
 <dynamicField name="*_en" type="text_en" indexed="true"
stored="true"
 multiValued="true" />
 The question is whether we should use this type by default or not. As
 explained in the comment above, there are downsides.
 Thanks,
 Marius
 [1]
 http://extensions.xwiki.org/xwiki/bin/view/Extension/Solr+Search+Application
 [2]
http://extensions.xwiki.org/xwiki/bin/download/Extension/Solr+Search+Applic…
 [3]
 https://cwiki.apache.org/confluence/display/solr/Tokenizers#Tokenizers-Stan…
 [4]
https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#Filter…
 _______________________________________________
 devs mailing list
 devs(a)xwiki.org
 http://lists.xwiki.org/mailman/listinfo/devs

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [xwiki-devs] [Solr] Word delimiter filter on English text