[xwiki-devs] [Solr] Word delimiter filter on English text - xwiki-devs@xwiki.org

Eduard Moraru

5 May 5 May

2:12 p.m.

Hi, The question is about content fields (document contet, textarea content, etc.) and not about the document's space name and document name fields, which will still match in both approaches, right? As far as I`ve understood it, text_en gets less matches than text_en_splitting, but text_en has better support for cases where in text_en_splitting you would have to use a phrase query to get the match (e.g. "Blog.News", "xwiki.com", etc.). IMO, text_en_splitting sounds more adapted to real life uses and to the fuzziness of user queries. If we want explicit matches for "xwiki.com" or "Blog.News" within a document's content, phrase queries can still be used, right? (i.e. quoting the explicit string). Thanks, Eduard On Tue, May 5, 2015 at 12:55 PM, Marius Dumitru Florea < mariusdumitru.florea(a)xwiki.com> wrote:

...

Hi guys, I just noticed (while updating the screen shots for the Solr Search UI documentation [1]) that searching for "blog" doesn't match "Blog.News" from the category of BlogIntroduction any more as indicated in [2]. Debug mode view shows me that "Blog.News" is indexed as "blog.new" which means the text is not split in "blog" and "news" as I would have expected in this case. After checking the Solr schema configuration I understood that this is normal considering that we use the Standard Tokenizer [3] for English text which has this exception: "Periods (dots) that are not followed by whitespace are kept as part of the token, including Internet domain names." Further investigation showed that before 6.0M1 we used the Word Delimiter Filter [4] for English text but I changed this with XWIKI-8911 when upgrading to Solr 4.7.0. I then noticed that the Solr schema has both text_en and text_en_splitting types, the later with this comment: A text field with defaults appropriate for English, plus aggressive word-splitting and autophrase features enabled. This field is just like text_en, except it adds WordDelimiterFilter to enable splitting and matching of words on case-change, alpha numeric boundaries, and non-alphanumeric chars. This means certain compound word cases will work, for example query "wi fi" will match document "WiFi" or "wi-fi". So in case someone wants to use this type instead for English text he needs to change the type in: <dynamicField name="*_en" type="text_en" indexed="true" stored="true" multiValued="true" /> The question is whether we should use this type by default or not. As explained in the comment above, there are downsides. Thanks, Marius [1] http://extensions.xwiki.org/xwiki/bin/view/Extension/Solr+Search+Application [2] http://extensions.xwiki.org/xwiki/bin/download/Extension/Solr+Search+Applic… [3] https://cwiki.apache.org/confluence/display/solr/Tokenizers#Tokenizers-Stan… [4] https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#Filter… _______________________________________________ devs mailing list devs(a)xwiki.org http://lists.xwiki.org/mailman/listinfo/devs

Reply

Sergiu Dumitriu

3:57 p.m.

I agree with Paul. The way I usually do searches is: - each field gets indexed several times, including: -- exact matches ^5n (field == query) -- prefix matches ^1.5n (field ^= query) -- same spelling ^1.8n (query words in field) -- fuzzy matching ^n (aggressive tokenization and stemming) -- stub matching ^.5n (query tokens are prefixes of indexed tokens) -- and three catch-all fields where every other field gets copied, with spelling, fuzzy and stub variants - where n is a factor based on the field's importance: page title and name have the highest boost, a catch-all field has the lowest boost - search with edismax, pf with double the boost (2n) on exact,prefix,spelling,fuzzy and qf on spelling,fuzzy,stub On 05/05/2015 08:28 AM, Paul Libbrecht wrote:

...

Eddy, We want both or? Dies the query not use edismax? If yes, we should make it search the field text_en with higher weight than text_en_splitting by setting the boost parameter to ‎ text_en^2 text_eb_splitting^1 Or? Paul -- fat fingered on my z10 -- Message d'origine De: Eduard Moraru Envoyé: Dienstag, 5. Mai 2015 14:13 À: XWiki Developers Répondre à: XWiki Developers Objet: Re: [xwiki-devs] [Solr] Word delimiter filter on English text Hi, The question is about content fields (document contet, textarea content, etc.) and not about the document's space name and document name fields, which will still match in both approaches, right? As far as I`ve understood it, text_en gets less matches than text_en_splitting, but text_en has better support for cases where in text_en_splitting you would have to use a phrase query to get the match (e.g. "Blog.News", "xwiki.com", etc.). IMO, text_en_splitting sounds more adapted to real life uses and to the fuzziness of user queries. If we want explicit matches for "xwiki.com" or "Blog.News" within a document's content, phrase queries can still be used, right? (i.e. quoting the explicit string). Thanks, Eduard On Tue, May 5, 2015 at 12:55 PM, Marius Dumitru Florea < mariusdumitru.florea(a)xwiki.com> wrote:

Hi guys, I just noticed (while updating the screen shots for the Solr Search UI documentation [1]) that searching for "blog" doesn't match "Blog.News" from the category of BlogIntroduction any more as indicated in [2]. Debug mode view shows me that "Blog.News" is indexed as "blog.new" which means the text is not split in "blog" and "news" as I would have expected in this case. After checking the Solr schema configuration I understood that this is normal considering that we use the Standard Tokenizer [3] for English text which has this exception: "Periods (dots) that are not followed by whitespace are kept as part of the token, including Internet domain names." Further investigation showed that before 6.0M1 we used the Word Delimiter Filter [4] for English text but I changed this with XWIKI-8911 when upgrading to Solr 4.7.0. I then noticed that the Solr schema has both text_en and text_en_splitting types, the later with this comment: A text field with defaults appropriate for English, plus aggressive word-splitting and autophrase features enabled. This field is just like text_en, except it adds WordDelimiterFilter to enable splitting and matching of words on case-change, alpha numeric boundaries, and non-alphanumeric chars. This means certain compound word cases will work, for example query "wi fi" will match document "WiFi" or "wi-fi". So in case someone wants to use this type instead for English text he needs to change the type in: <dynamicField name="*_en" type="text_en" indexed="true" stored="true" multiValued="true" /> The question is whether we should use this type by default or not. As explained in the comment above, there are downsides. Thanks, Marius [1] http://extensions.xwiki.org/xwiki/bin/view/Extension/Solr+Search+Application [2] http://extensions.xwiki.org/xwiki/bin/download/Extension/Solr+Search+Applic… [3] https://cwiki.apache.org/confluence/display/solr/Tokenizers#Tokenizers-Stan… [4] https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#Filter… _______________________________________________ devs mailing list devs(a)xwiki.org http://lists.xwiki.org/mailman/listinfo/devs

_______________________________________________ devs mailing list devs(a)xwiki.org http://lists.xwiki.org/mailman/listinfo/devs _______________________________________________ devs mailing list devs(a)xwiki.org http://lists.xwiki.org/mailman/listinfo/devs

-- Sergiu Dumitriu http://purl.org/net/sergiu/

Reply

Sergiu Dumitriu

8 May 8 May

9:39 a.m.

Well, my usecase is not the same, since I'm indexing ontologies and the end purpose is to find the best matching terms. A few numbers though: - 4MB ontology with 11k terms ends up as 16M index (including spellcheck, and most fields are also stored), searches take ~40ms including the XWiki overhead, ~10ms just in Solr - 180MB ontology with 24k terms -> 100M index, ~15ms Solr search time For smaller indexes, it does seem to use more disk space than the source, but Lucene is good at indexing larger data sets, and after a while the index grows slower than the data. For me it is worth the extra disk space, since every user is amazed by how good the search is at finding the relevant terms, overcoming typos, synonyms, and abbreviations, plus autocomplete while typing. In XWiki, not all fields should be indexed in all the ways, since it doesn't make sense to expect an exact match on a large textarea or the document content. On 05/07/2015 09:57 AM, Marius Dumitru Florea wrote:

...

Hi Sergiu, Can you tell us the effect on the index size (and speed in the end) if each field (e.g. document title, a String or TextArea property) is indexed in 5 different ways (5 separate fields in the index)? It is worth having this configuration by default? Thanks, Marius On Tue, May 5, 2015 at 4:57 PM, Sergiu Dumitriu <sergiu(a)xwiki.org> wrote: > I agree with Paul. > > The way I usually do searches is: > > - each field gets indexed several times, including: > -- exact matches ^5n (field == query) > -- prefix matches ^1.5n (field ^= query) > -- same spelling ^1.8n (query words in field) > -- fuzzy matching ^n (aggressive tokenization and stemming) > -- stub matching ^.5n (query tokens are prefixes of indexed tokens) > -- and three catch-all fields where every other field gets copied, with > spelling, fuzzy and stub variants > - where n is a factor based on the field's importance: page title and > name have the highest boost, a catch-all field has the lowest boost > - search with edismax, pf with double the boost (2n) on > exact,prefix,spelling,fuzzy and qf on spelling,fuzzy,stub >

-- Sergiu Dumitriu http://purl.org/net/sergiu/

Reply

Marius Dumitru Florea

11:22 a.m.

On Fri, May 8, 2015 at 10:39 AM, Sergiu Dumitriu <sergiu(a)xwiki.org> wrote:

...

Well, my usecase is not the same, since I'm indexing ontologies and the end purpose is to find the best matching terms. A few numbers though: - 4MB ontology with 11k terms ends up as 16M index (including spellcheck, and most fields are also stored), searches take ~40ms including the XWiki overhead, ~10ms just in Solr - 180MB ontology with 24k terms -> 100M index, ~15ms Solr search time For smaller indexes, it does seem to use more disk space than the source, but Lucene is good at indexing larger data sets, and after a while the index grows slower than the data.

...

For me it is worth the extra disk space, since every user is amazed by how good the search is at finding the relevant terms, overcoming typos, synonyms, and abbreviations, plus autocomplete while typing.

You do this for multiple languages or just for English? In other words, do you have text_fr_splitting, text_es_splitting etc.? Thanks Sergiu, I'll definitely take this into account. Marius

...

In XWiki, not all fields should be indexed in all the ways, since it doesn't make sense to expect an exact match on a large textarea or the document content. On 05/07/2015 09:57 AM, Marius Dumitru Florea wrote:

Hi Sergiu, Can you tell us the effect on the index size (and speed in the end) if each field (e.g. document title, a String or TextArea property) is indexed in 5 different ways (5 separate fields in the index)? It is worth having this configuration by default? Thanks, Marius On Tue, May 5, 2015 at 4:57 PM, Sergiu Dumitriu <sergiu(a)xwiki.org> wrote: > I agree with Paul. > > The way I usually do searches is: > > - each field gets indexed several times, including: > -- exact matches ^5n (field == query) > -- prefix matches ^1.5n (field ^= query) > -- same spelling ^1.8n (query words in field) > -- fuzzy matching ^n (aggressive tokenization and stemming) > -- stub matching ^.5n (query tokens are prefixes of indexed tokens) > -- and three catch-all fields where every other field gets copied, with > spelling, fuzzy and stub variants > - where n is a factor based on the field's importance: page title and > name have the highest boost, a catch-all field has the lowest boost > - search with edismax, pf with double the boost (2n) on > exact,prefix,spelling,fuzzy and qf on spelling,fuzzy,stub >

-- Sergiu Dumitriu http://purl.org/net/sergiu/ _______________________________________________ devs mailing list devs(a)xwiki.org http://lists.xwiki.org/mailman/listinfo/devs

Reply

Marius Dumitru Florea

7 May 7 May

3:49 p.m.

On Tue, May 5, 2015 at 3:12 PM, Eduard Moraru <enygma2002(a)gmail.com> wrote:

...

Hi,

...

The question is about content fields (document contet, textarea content, etc.) and not about the document's space name and document name fields, which will still match in both approaches, right?

The question is about the fields that are indexed depending on the document locale.

...

As far as I`ve understood it, text_en gets less matches than text_en_splitting, but text_en has better support for cases where in text_en_splitting you would have to use a phrase query to get the match (e.g. "Blog.News", "xwiki.com", etc.).

With text_en_splitting a search for "Blog.News" will also match "blog news" because the phrase from the query is analyzed in the same way it would have been indexed.

...

IMO, text_en_splitting sounds more adapted to real life uses and to the fuzziness of user queries. If we want explicit matches for "xwiki.com" or "Blog.News" within a document's content, phrase queries can still be used, right? (i.e. quoting the explicit string). Thanks, Eduard On Tue, May 5, 2015 at 12:55 PM, Marius Dumitru Florea < mariusdumitru.florea(a)xwiki.com> wrote:

Hi guys, I just noticed (while updating the screen shots for the Solr Search UI documentation [1]) that searching for "blog" doesn't match "Blog.News" from the category of BlogIntroduction any more as indicated in [2]. Debug mode view shows me that "Blog.News" is indexed as "blog.new" which means the text is not split in "blog" and "news" as I would have expected in this case. After checking the Solr schema configuration I understood that this is normal considering that we use the Standard Tokenizer [3] for English text which has this exception: "Periods (dots) that are not followed by whitespace are kept as part of the token, including Internet domain names." Further investigation showed that before 6.0M1 we used the Word Delimiter Filter [4] for English text but I changed this with XWIKI-8911 when upgrading to Solr 4.7.0. I then noticed that the Solr schema has both text_en and text_en_splitting types, the later with this comment: A text field with defaults appropriate for English, plus aggressive word-splitting and autophrase features enabled. This field is just like text_en, except it adds WordDelimiterFilter to enable splitting and matching of words on case-change, alpha numeric boundaries, and non-alphanumeric chars. This means certain compound word cases will work, for example query "wi fi" will match document "WiFi" or "wi-fi". So in case someone wants to use this type instead for English text he needs to change the type in: <dynamicField name="*_en" type="text_en" indexed="true" stored="true" multiValued="true" /> The question is whether we should use this type by default or not. As explained in the comment above, there are downsides. Thanks, Marius [1] http://extensions.xwiki.org/xwiki/bin/view/Extension/Solr+Search+Application [2] http://extensions.xwiki.org/xwiki/bin/download/Extension/Solr+Search+Applic… [3] https://cwiki.apache.org/confluence/display/solr/Tokenizers#Tokenizers-Stan… [4] https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#Filter… _______________________________________________ devs mailing list devs(a)xwiki.org http://lists.xwiki.org/mailman/listinfo/devs

_______________________________________________ devs mailing list devs(a)xwiki.org http://lists.xwiki.org/mailman/listinfo/devs

Reply