Hi Paul,
Thanks for your comments. I hope Savitha can chime in too at some point :)
See below.
On Aug 29, 2012, at 11:29 AM, Paul Libbrecht <paul(a)hoplahup.net> wrote:
Hello community,
here are my guts responses, I have not synched with Savitha.
Le 28 août 2012 à 14:04, Vincent Massol a écrit :
Search-api:
* Is the Search API supposed to be independent of SOLR?
I think this can be aimed at but it does not mean it will find a valid implementation for
each function.
* Search interface is strange, it has
implementation details such as: getImplementation(), initialize(),
These two methods are not implementation details to me.
Maybe initialize() is a natural thing to happen outside in the component lifecycle, so it
should be hidden.
It definitely should. The fact that such implementation needs some initialization and how
the initialization is done is completely implementation-dependent.
Did you mean getImplementation() should rather be a
cast?
The interface should have absolutely no knowledge if its implementation. When you use a
given implementation you know which one you're using.
BTW I've done a search and it's not used anywhere which proves it's not needed
;)
* It also has
other concerns such as getStatus(), getStatusAsJson(),
These are useful!
I never said they were not.
I am not sure a unified UI is possible but if yes,
they have to be there.
You're now talking about UI. But we're here talking about API not UI. This is
definitely the wrong place and wrong module for getStatusAsJson(). What if I want XML?
What you need is somewhere to return a Status object and then have a StatusSerializer
interface and then a JSONStatusSerializer implementation, or something like that.
Now re the getStatus() method, it feels wrong because this is stateful so it shouldn't
be part of the Search interface. What happens is that the Search interface should be used
to call search*() and that should return an object such as SearchResponse.
BTW the implementation also shows it's wrong:
@Override
public Map<String, AbstractDocumentIndexerStatus> getStatus()
{
return indexer.getStatus();
}
This shows that the status is actually the indexer status and if someone needs the status
he should get it from the indexer, not from the Search interface!
> getVelocityUtils(), getSearchRequest()
Same as above, they shouldn't be there.
getVelocityUtils() shouldn't exist. If we need some Velocity-specific code the
architecture we have is to create some Velocity Tool class and register them as Velocity
tools.
* Why do we
need a Search interface? Why not instead use the Query module and introduce a new query
type? (note return List from Query.execute() probably needs to be clarified). Replace
SearchRequest with Query impl
List is definitely a problem, one needs at least something that says the total match
count, the availability and parameters of a previous and next result page and probably
some debug information.
Yes but that's needed too for the Query module so it can be refactored.
I also note that the practice of using strings as
single entry point for queries is bad and that Savitha partially changed it.
It is bad because of the possibility of injecting queries aside which, for examples,
makes it so that there's no way in the current lucene plugin to force an added
condition (e.g. to create a UI to query all tasks by adding the query of the existence of
a task object).
I find the Lucene Query object very workable for this; that of Solr as well. I do not
think they are compatible.
Cool, we've also identified we need this for the Query module.
Also, the normal Solr way to do is actually use a
query-parser which will operate such transformations as the stemming of a query (you query
for "arbr" when searching for "arbres" for example). All search
engines I've worked on use a customizable query expansion mechanism where a part of
the user-query is turned into a decomposed query, then expanded into the various fields
(query for title with bigger boost, query for precise matches with bigger boost, allow
matches in other languages…).
Ok all this makes me think we definitely need to implement v2 of the Query module so that
it supports HQL, XWQL and SOLR queries.
I really think this is the right way and that we shouldn't invent a new way of doing
the same thing.
* Naming of
interfaces are a bit strange. For example: BuildIndex; should it be IndexBuilder instead?
What about DeleteIndex, should it be IndexDeleter?
* I don't think we need deleteDocumentIndex(), deleteWikiIndex(), deleteSpaceIndex(),
etc. We need a single deleteEntity(EntityReference reference, EntityType type). Same for
IndexBuilder.
* Why is there a DocumentIndexer interface?
I believe that the reason here is that it is of utmost importance to allow the indexing
process to be customized.
This can be done, on the one hand, in the Solr schema.
But that is not enough and it is quite common for an application to fetch related data
(for example to fetch the quality from an external source, for example to grab related
documents) and index it along. Thus an interface where, for each page, added data can be
injected sounds like a strict requirement to me.
That's not my question. I'm talking design here. I was asking why we need the
DocumentIndexer interface when we already can do the same thing with the other interfaces
such as IndexBuilder, IndexDeleter, etc.
Why is a
Document different from other entities? For ex I can see DocumentIndexer.deleteIndex() why
not IndexDeleter.deleteEntity(documentRef)?
* Why is there a need for RebuildIndex (which I assume is IndexRebuilder) and why cannot
we use the IndexBuilder?
An index Rebuilder object is necessary because index-rebuilding may take several days. It
needs to display a monitoring status (e.g the queue size) and probably added actions.
And why isn't this the case for an IndexBuilder which has exactly the same amount of
work when you start for the first time?
* Why the need
for SearchIndex?
Search-solrj:
* solrj server in embedded mode is used.
This should be flexible, either embedded or not.
Well that's my question because using solrj in non embedded mode is more complicated
to integrate and is not covered ATM. I'd like to see it covered in the Architecture
and define how it would integrate with XWiki Platform.
*
Shouldn't use system property but the xwiki configuration instead for the solrj home
(see below in misc)
A challenge, but hopefully responded.
That's easy.
[...]
(responded elsewhere)
Misc:
* all API to review and improve/stabilize
* typos to fix
* licenses to fix
* pom to fix
Maybe you can be more precise?
There are lots of issues which is why I didn't specify them. Let's say this is
just a reminder to review them.
Let me give you just one. In one pom file I can see:
<?xml version="1.0" encoding="UTF-8"?>
<!-- * * See the NOTICE file distributed with this work for additional *
information regarding copyright ownership. * * This is free software; you
can redistribute it and/or modify it * under the terms of the GNU Lesser
General Public License as * published by the Free Software Foundation; either
version 2.1 of * the License, or (at your option) any later version. * *
This software is distributed in the hope that it will be useful, * but WITHOUT
ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS
FOR A PARTICULAR PURPOSE. See the GNU * Lesser General Public License for
more details. * * You should have received a copy of the GNU Lesser General
Public * License along with this software; if not, write to the Free * Software
Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA * 02110-1301 USA,
or see the FSF site:
http://www.fsf.org. * -->
The good news is that this won't build at all since we have a license checker. If it
builds it means the build is not correct.
Another example is that there are a lot of stuff to remove from the pom.xml files since
those are already specified in the xwiki top level pom which these modules should depend
on.
Thanks
-Vincent
thanks for your comments.
Paul