Re: [xwiki-devs] XWiki on Cassandra

5 Aug 2011

On 08/05/2011 03:11 AM, Ludovic Dubost wrote:
...
  2011/8/3 Caleb James DeLisle
&lt;calebdelisle(a)lavabit.com&gt;om>:

 On 08/03/2011 04:53 AM, Ludovic Dubost wrote:
  Hi Caleb,
 Exiting news indeed. This looks great and there seems indeed to have
 quite a few things already working.
 I have one question concerning the mention of the multiple XWiki nodes
 connecting in different location to multiple Cassandra nodes. This
 would also mean that there is some tweaking in the XWiki Cache or a
 new "cluster" mode which allows WAN communication between instances. 
 Indeed the xwiki cluster code would need to be able to invalidate cache entries over a
WAN connection.
 The Cassandra code already supports operating over a WAN and it is "eventually
consistent" which in practice means all nodes are up to date within a few seconds.
  Otherwise you could be editing or viewing an
older version than what's
 really in the cassandra store.
 Have you looked at this already ? If you have touched the XWiki Cache
 maybe that's why you have performance issues. It is importante to
 cache the XWikiPreferences document, as it is highly requested. One of
 the things I did on the Google Store work I did a while ago, is to
 have a special additional cache in the XWikiContext which would make
 sure we don't check the MemCache that was containing the most recent
 version number of XWiki documents. This allowed to have a decent
 performance. Only the first access to a document in a given HTTP
 request would trigger a version number verification. 
 I have not needed to make any changes to the cache, caching the context is an interesting
idea. 
 If you have a normal cache you don't need it. You need this if you
 have another central service which operates as an absolute knowledge
 of the "right" version of an XWikiDocument. This is what I did with
 google's cache service which was fast enough for that but not fast
 enough to be a "real" XWiki cache. With cassandra, maybe this is
 cassandra itself that would do that. You will need this if cassandra
 nodes cannot guarantee to be up-to-date right away on an update. In
 this case cache invalidation won't really work so you'll need a cache
 that has some "expiration time" and goes check the cassandra database
 from time to time and for important operations (before modifications). 
Cassandra is by it's nature "eventually consistent" so it doesn't
promise to be
up to date right away. Since "eventually" means in practice usually within a few
seconds, I judged Cassandra to be acceptable for a XWiki. I think cache
invalidation over a wan could be done the same way as long as tcp connections
were used and nodes flushed their caches if the tcp connection broke.
...

  What I would most like to do is follow the line
of execution and find the biggest costs and mitigate them.
 This would mean patches which could be merged back into master.
 The system is quite fast when run on my desktop (which has a lot of ram) so it seems to
be associated with the resource constraints of the system.
  What are you running it on ? XWiki SAS can provide you with a decent
 VM to run the service if you think it's resource related. 
I'm using an OpenVZ vps with 1 gigabyte of ram, It seems to be memory constrained
with 2 jvm processes (Cassandra doesn't seem to start if run in the same process as
XWiki, to be investigated further) running but I am hesitant to ask for more since
that is a lot of memory and I think we can do better.
...

 I was wondering how you handle the queries used by the XWiki core and
 the default XAR application ? In the end I believe we need to move
 core queries to XWQL to have compatibility accross stores. 
 What I have been doing so far is using named queries which I can implement both in HQL
and JDOQL. Whether or not getting the stores to match up enough to be able to drag and
drop complex queries from one to another is feasible, I can't say. 
 So right now there are some IFs to have the queries in JDOQL instead of HQL ? 
No, the query executor is now chosen based on the selected main store, the JDOQL
query executor has a config file with it's own set of named queries which share
the same names as the ones defined for the hibernate store.
...

 I saw that wiki macros don't seem to work. This must be because of
 missing objects queries. 
 The macros do seem to be working (I did some work on the queries used) but the list of
spaces was rather lonely with a broken tag cloud and activity stream so I removed it.
  Right.. it's the tag cloud and activity stream that fail..

 In terms of priorities I believe the following are important:
 - assesment of which default XE feature is not working and what it
 requires to make it work (this would allow to "define" what a
 Cassandra XE version would be) 
 I think the general answer to this is: things which do not rely heavily on complex
queries which such as history, activity stream and permissions should be easy while things
which do such as some of the applications will be difficult.
  - basic XWQL querying with queries on objects

 The next big step is going to be patching the store code so that it takes advantage of
NoSQL's flexibility in adding columns to single rows as a means of storing and
querying structured data without first knowing what the structure will be.
 This is obviously not supported currently since DataNucleus attempts to support all
different data stores and this would simply not be possible with a relational store.

 What does this mean ? You need to do some non-DataNucleus code to
 handle that ? Does this invalidate the DataNucleus approach or just
 requires some "additional" specific stuff on top of using DataNucleus
 ? We would have to do that specific work also for other type of
 storage using DataNucleus right ? 
It means allowing queries which would otherwise only be possible if the schema
was more normalized. With a traditional RDBMS, if you want to have a document
which contains objects, you would have to put the objects into a different table
and join them. NoSQL stores make storing nested data trivial because you can just
convert each property of a nested object into a property of the document itself.
This means loading a document would require only one database lookup instead of many
which in a system where parts of the database might be physically far from the
wiki engine, this will be critical to performance.
As far as what it will require, I hope it can be done only modifying the Cassandra
plugin, even if not, the queries will all still be quite valid, it's just that they
will be able to be run against data which is stored differently.
Since different NoSQL stores are not like different RDBMS stores in that they represent
entirely different philosophies from one another, some tinkering to make them fit is
inevitable even without this change, the most compelling other store we would want for
this is the plain old RDBMS and in that case we would be forced to use joins.
Unfortunately, most of the plugins which are provided by DataNucleus treat the storage
engine as a disk drive and just load data into DataNucleus and let all querying be done
in memory at tremendous performance cost.
Caleb
...

 Ludovic

 Caleb
  - history
 - permissions
 Also at some point
 - performance comparaison with a very large number of documents / very high load
 Great stuff in any case..
 Ludovic
 2011/8/2 Caleb James DeLisle &lt;calebdelisle(a)lavabit.com&gt;om>:
  I have an instance of XWiki finally running on
Cassandra.
 http://kk.l.to:8080/xwikiOnCassandra/
 Cassandra is a "NoSQL" database, unlike a traditional SQL database it cannot do
advanced queries but it can store data in a more flexible way eg: each row is like a
hashtable where additional "columns" can be added at will.
 The most important feature of Cassandra is that multiple Cassandra nodes can be connected
together into potentially very large "swarms" of nodes which reside in different
racks or even data centers continents apart, yet all of them represent the same database.
 Cassandra was developed by Facebook and their swarm was said to be over 200 nodes strong.
 In it's application with XWiki, each node can have an XWiki engine sitting on top of
it and users can be directed to the geographically closest node or to the node which is
most likely to have a cache of the page which they are looking for.
 Where a traditional cluster is a group of XWiki engines sitting atop a single MySQL
engine, this allows for a group of XWiki engines to sit atop a group of Cassandra engines
in a potentially very scalable way.
 In a cloud setting, one would either buy access to a provided NoSQL store such as
Google's BigTable or they would setup a number of XWiki/Cassandra stacks in a less
managed cloud such as Rackspace's or Amazon's.
 How it works:
 XWiki objects in the traditional Hibernate based storage engine are persisted by breaking
them up into properties which are then joined again when the object is loaded.
 A user object which has a name and an age will occupy a row in each of three tables, the
xwikiobjects table, the xwikistrings table, and the xwikiintegers table.
 The object's metadata will be in the xwikiobjects table while the name will be in a
row in the xwikistrings table and the age, a number, will go in the xwikiintegers table.
 The NoSQL/Datanucleus based storage engine does this differently, the same object only
occupies space in the XWikiDocument table where it takes advantage of Cassandra's
flexibility by simply adding a new column for each property.
 NOTE: this is not fully implemented yet, objects are still stored serialized.
 What works
 * Document storage
 * Classes and Objects
 * Attachments
 * Links and Locks
 * Basic querying with JDOQL
 What doesn't work
 * Querying inside of objects
 * JPQL/XWQL queries
 * Document history
 * Permissions (requires unimplemented queries)
 * The feature you want
 I am interested in what the community thinks is the first priority, I can work on
performance which will likely lead to patches being merged into master which will benefit
everyone
 or I can work on more features which will benefit people who want to use XWiki as a
traditional application wiki but use it on top of Cassandra.
 You can reply here or add comments to the wiki ;)
 Caleb
 _______________________________________________
 devs mailing list
 devs(a)xwiki.org
 http://lists.xwiki.org/mailman/listinfo/devs

 _______________________________________________
 devs mailing list
 devs(a)xwiki.org
 http://lists.xwiki.org/mailman/listinfo/devs

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [xwiki-devs] XWiki on Cassandra