[xwiki-devs] [Solr] What do we search for?

List overview All Threads
Download

newer

older

[xwiki-devs] [Reminder] BFD#46

[xwiki-devs] help

Marius Dumitru Florea

11 Oct 2013 11 Oct '13

11:55 a.m.

Hi devs, This is a very important question so think carefully. Let me explain: In XWiki (model) we have a few entity types. There are *wikis* which have *spaces* which have *documents*. A document can have *objects* and *attachments*. A document can also define a *class*. At the same time we like to say that in XWiki "everything is a document" because everything revolves around documents. The document is the central notion. We can query the database (using HQL or XWQL) for any of the previously mentioned entities but what should a Solr query return (semantically)? In other words: * are you searching for an object without caring about the document that holds the object? Same for an object property. * how often are you searching for an attachment without caring about the document that holds the attachment? * are you searching for a class or for the document that defines that class? * are you searching for a wiki without caring about the documents it contains? Same for a space. IMO the result of a Solr query should be, semantically, a list of documents. But maybe I'm wrong. ----------------------- Technical Details ----------------------- Unlike a relational database, Solr/Lucene index has a single 'table'. So normally you index a single entity type. Each row in the index represents an entity of that type. As a consequence the result of a Solr query is semantically a list of entities of that type. In our case the entity type is (naturally) *document*. If you want to index more entity types (e.g. index attachments and objects _separately_, not as part of a document) then, since there is only one 'table' in the index, you need to add a 'type' column that specifies the type of entity you have on each row (e.g. type=document, type=attachment, type=object etc.). The result of a Solr query is now, semantically, a list of different entity types, unless you filter by a specific type. It smells like a hack to me. Let's imagine what happens if we want to search for blog posts that has a specific tag. With the first approach this is easy because all the (indexed) information is on a single row. With the second approach this is considerably more complex because the information is spread on multiple rows: * one row with type=document for the blog post document * one row with type=object for the blog post object * one row with type=object for the tab object In a relational database when you have the information spread in multiple places (tables) you do joins. Fortunately (you would says) Solr supports joins. In this particular case we would have to perform 2 joins which means: index X index X index where X represents the cartesian product. The document name would be the join key. Pretty complex even before trying to write this in Solr query syntax.. So basically the question becomes: is it worth indexing more entities _separately_ instead of indexing just documents (with info about their objects and attachments) considering the complexity that it brings in writing Solr queries? Do we search for objects and attachments alone as separate entities often enough to justify this complexity? My answer is no. Thanks, Marius

Show replies by date

Thomas Mortagne

11 Oct 11 Oct

12:05 p.m.

On Fri, Oct 11, 2013 at 11:55 AM, Marius Dumitru Florea <mariusdumitru.florea(a)xwiki.com> wrote:

...

Sounds good in theory but storing several entities in the same entry has complexity of it's own that needs to be discussed before deciding. How do you plan to store several tag objects of the same document in a single document entry for example ?

...

Thanks, Marius _______________________________________________ devs mailing list devs(a)xwiki.org http://lists.xwiki.org/mailman/listinfo/devs

-- Thomas Mortagne

Marius Dumitru Florea

12:14 p.m.

On Fri, Oct 11, 2013 at 1:05 PM, Thomas Mortagne <thomas.mortagne(a)xwiki.com> wrote:

...

On Fri, Oct 11, 2013 at 11:55 AM, Marius Dumitru Florea <mariusdumitru.florea(a)xwiki.com> wrote:

...

Sounds good in theory but storing several entities in the same entry has complexity of it's own that needs to be discussed before deciding.

I agree.

...

How do you plan to store several tag objects of the same document in a single document entry for example ?

I haven't thought very much but I was thinking about using multiValued="true" maybe in combination with dynamic fields. I need to think about this. Thanks, Marius

...

Thanks, Marius _______________________________________________ devs mailing list devs(a)xwiki.org http://lists.xwiki.org/mailman/listinfo/devs

-- Thomas Mortagne _______________________________________________ devs mailing list devs(a)xwiki.org http://lists.xwiki.org/mailman/listinfo/devs

Guillaume Lerouge

12:09 p.m.

Hi, I don't want to answer this too broadly (I don't have the technical chops to make a really informed comment). Here's however what I can state from my experience with XWiki projects: - When searching for an attached file, we always want to know (and display) the document to which that file is attached - When searching for an object, we're always looking for the document which the object is part of, especially since we don't have an "object-only" or "property-only" view anyway - When searching for a class, again, we don't have a displayer for that class in view mode outside of the document holding the class - We don't really search for a space right now since technically it's just a collection of pages anyway - Searching for a wiki would be done through the wiki index, other than that you're just searching for documents (some of which might happen to be in a wiki) All of which would tend to agree with Marius' suggestion. In terms of UX impact, I think this would mean that documents should always be returned in search results, with attachments indented under the document itself (instead of having separate entries for attachments and documents as we do now). Guillaume On Fri, Oct 11, 2013 at 11:55 AM, Marius Dumitru Florea < mariusdumitru.florea(a)xwiki.com> wrote:

...

Marius Dumitru Florea

12:17 p.m.

On Fri, Oct 11, 2013 at 1:09 PM, Guillaume Lerouge <guillaume(a)xwiki.com> wrote:

...

In terms of UX impact, I think this would mean that documents should always be returned in search results, with attachments indented under the document itself (instead of having separate entries for attachments and documents as we do now).

Yes, the results would always be documents but for each result we would display where the search term has been matched: * in document title * in document content * in the attachment name * in the attachment content * in an object property * etc. Thanks, Marius

...

Guillaume On Fri, Oct 11, 2013 at 11:55 AM, Marius Dumitru Florea < mariusdumitru.florea(a)xwiki.com> wrote:

_______________________________________________ devs mailing list devs(a)xwiki.org http://lists.xwiki.org/mailman/listinfo/devs

Ludovic Dubost

12:26 p.m.

Hi,

...

From my point of view we usually search mostly for two types of things:

- documents - attachements But we should be able to filter these results on multiple property values of any object. This is true also for documents and for attachments. It is also interesting to be able to present results differently depending on the document we get (if it's has meeting document or a user document we display things differently) Being able to search for attachments separately is very important. As for objects most of the time we search for documents that have this specific object. There is however a use case I see where it could be interesting to search in individual objects. For example this is the case for comments. It could be interesting to make a search in all comments. Another example could be tasks. Suppose you add tasks inside documents associated to some content of the document (like annotations). You might want to be able to make some nice search on all the tasks and then display a link to the document in which the task is but not the other way around. Now I think this use case could be optional, so we don't necessarly need to index all objects of all classes. We could have some config which tells to make an index for all comments objects or all task objects. I think we already had an object index in lucene and I don't remember if we have ever used it. I don't think we need an index on all properties. Ludovic 2013/10/11 Marius Dumitru Florea <mariusdumitru.florea(a)xwiki.com>

...

-- Ludovic Dubost Founder and CEO Blog: http://blog.ludovic.org/ XWiki: http://www.xwiki.com Skype: ldubost GTalk: ldubost

Ludovic Dubost

12:28 p.m.

As for the searching for "wiki" example, this is a good case where in really we just need to search for documents since one wiki = one document. The only case where we could need to search for objects is when 1 document = multiple objects and it's not that common and even less comment that we need to make full take searches on the object entities separately. Ludovic 2013/10/11 Ludovic Dubost <ludovic(a)xwiki.com>

...

Hi, From my point of view we usually search mostly for two types of things: - documents - attachements But we should be able to filter these results on multiple property values of any object. This is true also for documents and for attachments. It is also interesting to be able to present results differently depending on the document we get (if it's has meeting document or a user document we display things differently) Being able to search for attachments separately is very important. As for objects most of the time we search for documents that have this specific object. There is however a use case I see where it could be interesting to search in individual objects. For example this is the case for comments. It could be interesting to make a search in all comments. Another example could be tasks. Suppose you add tasks inside documents associated to some content of the document (like annotations). You might want to be able to make some nice search on all the tasks and then display a link to the document in which the task is but not the other way around. Now I think this use case could be optional, so we don't necessarly need to index all objects of all classes. We could have some config which tells to make an index for all comments objects or all task objects. I think we already had an object index in lucene and I don't remember if we have ever used it. I don't think we need an index on all properties. Ludovic 2013/10/11 Marius Dumitru Florea <mariusdumitru.florea(a)xwiki.com>

-- Ludovic Dubost Founder and CEO Blog: http://blog.ludovic.org/ XWiki: http://www.xwiki.com Skype: ldubost GTalk: ldubost

Marius Dumitru Florea

2:48 p.m.

On Fri, Oct 11, 2013 at 1:26 PM, Ludovic Dubost <ludovic(a)xwiki.com> wrote:

...

Being able to search for attachments separately is very important.

This was possible with the old Lucene index because we were indexing attachments in separate rows _but_ we were duplicating all document fields on the attachment row. So if you had a document with 2 attachments then you had 3 rows associated in the Lucene index: one for the document itself and 2 for the attachments but the document fields were duplicated twice. Of course we can say we don't care about the index size (do we? :) ) but we must be careful to not get duplicated results because of the duplicated information in the index. Thanks, Marius

...

As for objects most of the time we search for documents that have this specific object. There is however a use case I see where it could be interesting to search in individual objects. For example this is the case for comments. It could be interesting to make a search in all comments. Another example could be tasks. Suppose you add tasks inside documents associated to some content of the document (like annotations). You might want to be able to make some nice search on all the tasks and then display a link to the document in which the task is but not the other way around. Now I think this use case could be optional, so we don't necessarly need to index all objects of all classes. We could have some config which tells to make an index for all comments objects or all task objects. I think we already had an object index in lucene and I don't remember if we have ever used it. I don't think we need an index on all properties. Ludovic 2013/10/11 Marius Dumitru Florea <mariusdumitru.florea(a)xwiki.com>

-- Ludovic Dubost Founder and CEO Blog: http://blog.ludovic.org/ XWiki: http://www.xwiki.com Skype: ldubost GTalk: ldubost _______________________________________________ devs mailing list devs(a)xwiki.org http://lists.xwiki.org/mailman/listinfo/devs

Eduard Moraru

13 Oct 13 Oct

11:27 p.m.

Hi, The initial idea with modelling things this way in the Solr index was to allow devs to be able to search for other entities than documents (objects and their properties) using Solr instead of having to use (HS/XW)QL. An interesting use case from my POV was when a dev wanted to search for a piece of code (but could be just a word/string) that is stored in an XWiki object. We keep assuming that everybody stores one object in one document, but it`s obviously not the case every time and, some app devs might want to store their application`s entities as multiple objects inside a single page (since we always have that dilemma when starting to dev an application). Also, another example was mentioned by Ludovic with XWikiComments. The problem in the previous mapping was that we would find out the document where the object is stored, but we would not know in which actual object that result is found, since we indexed property values in the form of CassName.propertyName:value. So telling the dev that the string/code/etc he's looking for is somewhere in that document does not help much. Using Solr, we might now know which field (propertyName) that was (with the highlighting component), but we would probably need to add some object ID/number information in there (like CassName.0.propertyName:value) and that might mess up the query syntax. We would have to write queries like CassName.*.propertyName:searchedWord and I don`t remember if Solr supports the usage of wildcards in field names (AFAIR, only in field values, except for the catch-all *:* construction). AFAIR, this was the main reason why we had to use dynamic fields and the whole multilingual work. I also had my doubts about mapping properties as first class Lucene documents (or "tables" as previously referred), but, now that I think of it, it provides a solution for the example above. Maybe using just objects would have sufficed as well. I don`t know, more examples may exist or not, but it's good that we`ve started talking about it. Marius, if you`re interested, I`m available this starting week for some brainstorming on the subject, if that would help. Just let me know. Thanks, Eduard On Fri, Oct 11, 2013 at 3:48 PM, Marius Dumitru Florea < mariusdumitru.florea(a)xwiki.com> wrote:

...

On Fri, Oct 11, 2013 at 1:26 PM, Ludovic Dubost <ludovic(a)xwiki.com> wrote:

depending

on the document we get (if it's has meeting document or a user document

display things differently)

Being able to search for attachments separately is very important.

make

a search in all comments. Another example could be tasks. Suppose you add tasks inside documents associated to some content of the document (like annotations). You might want to be able to make some nice search on all the tasks and then display a link to the document in which the task is but not the

other

way around. Now I think this use case could be optional, so we don't necessarly need

index all objects of all classes. We could have some config which tells

make an index for all comments objects or all task objects. I think we already had an object index in lucene and I don't remember if we have

ever

used it. I don't think we need an index on all properties. Ludovic 2013/10/11 Marius Dumitru Florea <mariusdumitru.florea(a)xwiki.com>

_______________________________________________ devs mailing list devs(a)xwiki.org http://lists.xwiki.org/mailman/listinfo/devs

Marius Dumitru Florea

14 Oct 14 Oct

6 p.m.

...

Ludovic Dubost

13 Nov 13 Nov

7:08 p.m.

Hi Marius, I have a quick question when starting reading your proposal. I don't see anything about multi language indexing. I remember in the current SOLR implementation that there are multiple fields for each language. Would there be a fields for each language indexed for each property ? Ludovic 2013/10/14 Marius Dumitru Florea <mariusdumitru.florea(a)xwiki.com>

...

class?

* are you searching for a wiki without caring about the documents it contains? Same for a space. IMO the result of a Solr query should be, semantically, a list of documents. But maybe I'm wrong. ----------------------- Technical Details ----------------------- Unlike a relational database, Solr/Lucene index has a single 'table'. So normally you index a single entity type. Each row in the index represents an entity of that type. As a consequence the result of a Solr query is semantically a list of entities of that type. In our case the entity type is (naturally) *document*. If you want to index more entity types (e.g. index attachments and objects _separately_, not as part of a document) then, since there is only one 'table' in the index, you need to add a 'type' column that specifies the type of entity you have on each row (e.g. type=document, type=attachment, type=object etc.). The result of a Solr query is now, semantically, a list of different entity types, unless you filter by a specific type. It smells like a hack to me. Let's imagine what happens if we want to search for blog posts that has a specific tag. With the first approach this is easy because all the (indexed) information is on a single row. With the second approach this is considerably more complex because the information is spread on multiple rows: * one row with type=document for the blog post document * one row with type=object for the blog post object * one row with type=object for the tab object In a relational database when you have the information spread in multiple places (tables) you do joins. Fortunately (you would says) Solr supports joins. In this particular case we would have to perform 2 joins which means: index X index X index where X represents the cartesian product. The document name would be the join key. Pretty complex even before trying to write this in Solr query syntax.. So basically the question becomes: is it worth indexing more entities _separately_ instead of indexing just documents (with info about their objects and attachments) considering the complexity that it brings in writing Solr queries? Do we search for objects and attachments alone as separate entities often enough to justify this complexity? My answer is no. Thanks, Marius

_______________________________________________ devs mailing list devs(a)xwiki.org http://lists.xwiki.org/mailman/listinfo/devs

-- Ludovic Dubost Founder and CEO Blog: http://blog.ludovic.org/ XWiki: http://www.xwiki.com Skype: ldubost GTalk: ldubost

Marius Dumitru Florea

14 Nov 14 Nov

5:28 p.m.

On Wed, Nov 13, 2013 at 8:08 PM, Ludovic Dubost <ludovic(a)xwiki.com> wrote:

...

Yes. Right now I'm struggling to find a way to define an alias for a group of dynamic fields. For document title we have this in solrconfig.xml <str name="f.title.qf">title__ title_ar title_bg title_ca ...</str> which makes 'title' an alias for all its translations and allows us to write title:text in the search query. I need to do the same, but dynamically, for each object property: property_Blog.BlogPostClass_title = property_Blog.BlogPostClass_title__, property_Blog.BlogPostClass_title_en, property_Blog.BlogPostClass_title_fr, ... I'll keep you posted. Thanks, Marius

...

Ludovic 2013/10/14 Marius Dumitru Florea <mariusdumitru.florea(a)xwiki.com>

class?

_______________________________________________ devs mailing list devs(a)xwiki.org http://lists.xwiki.org/mailman/listinfo/devs

Paul Libbrecht

15 Nov 15 Nov

12:02 a.m.

Marius, I would suggest to generate the schema and config, reloading every time there's a class change. Alternatively, and that's how solr-drupal works, you would define the fields by prefix but I am not sure the aliassing would work. I believe that the query-expansion step, from title:x to title-en:x title-ft:x, etc… is best to be controlled early so that applications can change that somehow. In curriki, this is done with a custom query-component which uses the query-parser (with a default-field which does not exist) then rewrites the query objects (which is a fairly easy game). Hope it helps. Le 14 nov. 2013 à 17:28, Marius Dumitru Florea <mariusdumitru.florea(a)xwiki.com> a écrit :

...

On Wed, Nov 13, 2013 at 8:08 PM, Ludovic Dubost <ludovic(a)xwiki.com> wrote:

Ludovic 2013/10/14 Marius Dumitru Florea <mariusdumitru.florea(a)xwiki.com>

class?

_______________________________________________ devs mailing list devs(a)xwiki.org http://lists.xwiki.org/mailman/listinfo/devs

Marius Dumitru Florea

6:48 a.m.

On Fri, Nov 15, 2013 at 1:02 AM, Paul Libbrecht <paul(a)hoplahup.net> wrote:

...

Marius,

...

I would suggest to generate the schema and config, reloading every time there's a class change.

That would mean re-indexing everything right? It would take to much time.

...

Alternatively, and that's how solr-drupal works, you would define the fields by prefix but I am not sure the aliassing would work.

...

I believe that the query-expansion step, from title:x to title-en:x title-ft:x, etc… is best to be controlled early so that applications can change that somehow. In curriki, this is done with a custom query-component which uses the query-parser (with a default-field which does not exist) then rewrites the query objects (which is a fairly easy game).

That's actually what I'm currently investigating. I'll try to extend the ExtendedDismaxQParserPlugin, let it do its query parsing and then expand the query with more query objects when the "field" name matches some pattern (e.g. property_*) Thanks, Marius

...

Hope it helps. Le 14 nov. 2013 à 17:28, Marius Dumitru Florea <mariusdumitru.florea(a)xwiki.com> a écrit :

On Wed, Nov 13, 2013 at 8:08 PM, Ludovic Dubost <ludovic(a)xwiki.com> wrote:

Ludovic 2013/10/14 Marius Dumitru Florea <mariusdumitru.florea(a)xwiki.com>

I started writing http://dev.xwiki.org/xwiki/bin/view/Design/SolrSchema . I need help with two things: * test cases http://dev.xwiki.org/xwiki/bin/view/Design/SolrSchema#HTestCases * if time permits, review the proposal, especially http://dev.xwiki.org/xwiki/bin/view/Design/SolrSchema#HAMixedApproach . Thanks, Marius On Fri, Oct 11, 2013 at 12:55 PM, Marius Dumitru Florea <mariusdumitru.florea(a)xwiki.com> wrote: > Hi devs, > > This is a very important question so think carefully. Let me explain: > > In XWiki (model) we have a few entity types. There are *wikis* which > have *spaces* which have *documents*. A document can have *objects* > and *attachments*. A document can also define a *class*. > > At the same time we like to say that in XWiki "everything is a > document" because everything revolves around documents. The document > is the central notion. > > We can query the database (using HQL or XWQL) for any of the > previously mentioned entities but what should a Solr query return > (semantically)? In other words: > > * are you searching for an object without caring about the document > that holds the object? Same for an object property. > * how often are you searching for an attachment without caring about > the document that holds the attachment? > * are you searching for a class or for the document that defines that class? > * are you searching for a wiki without caring about the documents it > contains? Same for a space. > > IMO the result of a Solr query should be, semantically, a list of > documents. But maybe I'm wrong. > > ----------------------- > Technical Details > ----------------------- > > Unlike a relational database, Solr/Lucene index has a single 'table'. > So normally you index a single entity type. Each row in the index > represents an entity of that type. As a consequence the result of a > Solr query is semantically a list of entities of that type. In our > case the entity type is (naturally) *document*. > > If you want to index more entity types (e.g. index attachments and > objects _separately_, not as part of a document) then, since there is > only one 'table' in the index, you need to add a 'type' column that > specifies the type of entity you have on each row (e.g. type=document, > type=attachment, type=object etc.). The result of a Solr query is now, > semantically, a list of different entity types, unless you filter by a > specific type. It smells like a hack to me. > > Let's imagine what happens if we want to search for blog posts that > has a specific tag. With the first approach this is easy because all > the (indexed) information is on a single row. With the second approach > this is considerably more complex because the information is spread on > multiple rows: > > * one row with type=document for the blog post document > * one row with type=object for the blog post object > * one row with type=object for the tab object > > In a relational database when you have the information spread in > multiple places (tables) you do joins. Fortunately (you would says) > Solr supports joins. In this particular case we would have to perform > 2 joins which means: > > index X index X index > > where X represents the cartesian product. The document name would be > the join key. Pretty complex even before trying to write this in Solr > query syntax.. > > So basically the question becomes: is it worth indexing more entities > _separately_ instead of indexing just documents (with info about their > objects and attachments) considering the complexity that it brings in > writing Solr queries? Do we search for objects and attachments alone > as separate entities often enough to justify this complexity? My > answer is no. > > Thanks, > Marius _______________________________________________ devs mailing list devs(a)xwiki.org http://lists.xwiki.org/mailman/listinfo/devs

_______________________________________________ devs mailing list devs(a)xwiki.org http://lists.xwiki.org/mailman/listinfo/devs

Paul Libbrecht

8:01 a.m.

Hello Marius,

...

I would suggest to generate the schema and config, reloading every time there's a class change.

That would mean re-indexing everything right? It would take to much time.

No for most cases. A Lucene index is "just" a heap of "terms". If you change the schema in that you add a new field, the impact on the index is zero. If you rename a field, you need to reindex. If you change the type of a field (or its analyzer) then you have to reindex. If you delete a field, you leave some dirt, you'd have to reindex if you rewake this field name.

...

I am not sure it's best practice, but as an application developer, I would enjoy if this code was in a Groovy page. paul

Marius Dumitru Florea

28 Nov 28 Nov

4:40 p.m.

Here's a short summary of what I implemented in the end: * I'm using an encoding scheme similar to the URL-encoding to support special characters in field names. I didn't use directly the URL-encoding because '+' (plus) and '%' (percent) have special meaning in Solr query syntax. Also, I didn't want to encode Unicode letters. E.g. "Somé Spâce.Bob's Claß" is encoded as "Somé$20Spâce.Bob$27s$20Claß" * I wanted to be able to extract the class and property reference from a field name in order to display the location where the search text has been found. I couldn't use the default class / property reference serialization syntax because '\' and '^' have special meaning in the Solr query syntax. So I implemented a simple serialization syntax that uses only '.' as entity separator and the dot is escaped by repeating it. E.g. "wiki:Some\.Space.My\.Class^color" is serialized as "wiki.Some..Space.My..Class.color" * I added the following fields to a document's index: object : all types of objects found on the indexed document object.Space.Class : collects values from all Space.Class properties property.Space.Class.propName : indexes the values of Space.Class^propName (multiple values if there are multiple objects of type Space.Class) * object.* and property.* are multilingual fields so they are indexed in multiple languages. I added support for dynamic aliases (for dynamic fields) so we can write object:Blog.BlogPostClass AND property.Blog.BlogPostClass.title:text AND object.XWiki.TagClass:news and it will be expanded into object:Blog.BlogPostClass AND (property.Blog.BlogPostClass.title_en:text OR property.Blog.BlogPostClass.title_fr:text OR ...) AND (object.XWiki.TagClass_en:news OR object.XWiki.TagClass_fr:news OR ...) NOTE: Solr doesn't support dynamic fields as default fields, i.e. as fields that are matched when you search for free text (without field:value in the query). This is not a problem for the search results, as dynamic fields like object.* and property.* are copied and aggregated in 'objcontent' which is a default field. The issue is that we can't know what is exactly the XClass property that was matched, we just know that the free search text was found inside an object. WDYT? I can still make adjustments before 5.3 final if you think something is wrong. Thanks, Marius On Fri, Nov 15, 2013 at 9:01 AM, Paul Libbrecht <paul(a)hoplahup.net> wrote:

...

Hello Marius,

I would suggest to generate the schema and config, reloading every time there's a class change.

That would mean re-indexing everything right? It would take to much time.

I am not sure it's best practice, but as an application developer, I would enjoy if this code was in a Groovy page. paul _______________________________________________ devs mailing list devs(a)xwiki.org http://lists.xwiki.org/mailman/listinfo/devs

4371

days inactive

4419

days old

xwiki-devs@xwiki.org

Manage subscription

15 comments

6 participants

tags (0)

participants (6)

Eduard Moraru
Guillaume Lerouge
Ludovic Dubost
Marius Dumitru Florea
Paul Libbrecht
Thomas Mortagne