Here's a short summary of what I implemented in the end:
* I'm using an encoding scheme similar to the URL-encoding to support
special characters in field names. I didn't use directly the
URL-encoding because '+' (plus) and '%' (percent) have special meaning
in Solr query syntax. Also, I didn't want to encode Unicode letters.
E.g. "Somé Spâce.Bob's Claß" is encoded as
"Somé$20Spâce.Bob$27s$20Claß"
* I wanted to be able to extract the class and property reference from
a field name in order to display the location where the search text
has been found. I couldn't use the default class / property reference
serialization syntax because '\' and '^' have special meaning in the
Solr query syntax. So I implemented a simple serialization syntax that
uses only '.' as entity separator and the dot is escaped by repeating
it.
E.g. "wiki:Some\.Space.My\.Class^color" is serialized as
"wiki.Some..Space.My..Class.color"
* I added the following fields to a document's index:
object : all types of objects found on the indexed document
object.Space.Class : collects values from all Space.Class properties
property.Space.Class.propName : indexes the values of
Space.Class^propName (multiple values if there are multiple objects of
type Space.Class)
* object.* and property.* are multilingual fields so they are indexed
in multiple languages. I added support for dynamic aliases (for
dynamic fields) so we can write
object:Blog.BlogPostClass AND property.Blog.BlogPostClass.title:text
AND object.XWiki.TagClass:news
and it will be expanded into
object:Blog.BlogPostClass AND
(property.Blog.BlogPostClass.title_en:text OR
property.Blog.BlogPostClass.title_fr:text OR ...) AND
(object.XWiki.TagClass_en:news OR object.XWiki.TagClass_fr:news OR
...)
NOTE: Solr doesn't support dynamic fields as default fields, i.e. as
fields that are matched when you search for free text (without
field:value in the query). This is not a problem for the search
results, as dynamic fields like object.* and property.* are copied and
aggregated in 'objcontent' which is a default field. The issue is that
we can't know what is exactly the XClass property that was matched, we
just know that the free search text was found inside an object.
WDYT? I can still make adjustments before 5.3 final if you think
something is wrong.
Thanks,
Marius
On Fri, Nov 15, 2013 at 9:01 AM, Paul Libbrecht <paul(a)hoplahup.net> wrote:
Hello Marius,
I would
suggest to generate the schema and config, reloading every time there's a class
change.
That would mean re-indexing everything right? It would take to much time.
No for most cases.
A Lucene index is "just" a heap of "terms".
If you change the schema in that you add a new field, the impact on the index is zero.
If you rename a field, you need to reindex.
If you change the type of a field (or its analyzer) then you have to reindex.
If you delete a field, you leave some dirt, you'd have to reindex if you rewake this
field name.
I believe
that the query-expansion step, from title:x to title-en:x title-ft:x, etc… is best to be
controlled early so that applications can change that somehow. In curriki, this is done
with a custom query-component which uses the query-parser (with a default-field which does
not exist) then rewrites the query objects (which is a fairly easy game).
That's actually what I'm currently investigating. I'll try to extend
the ExtendedDismaxQParserPlugin, let it do its query parsing and then
expand the query with more query objects when the "field" name matches
some pattern (e.g. property_*)
I am not sure it's best practice, but as an application developer, I would enjoy if
this code was in a Groovy page.
paul
_______________________________________________
devs mailing list
devs(a)xwiki.org
http://lists.xwiki.org/mailman/listinfo/devs