New subject: [xwiki-devs] [Brainstorm] Notifications filters capabilities and performances.

27 Jun 2018

Hi developers.
I am trying to add a new filter to the notifications to be able to follow
pages
that are marked with a given tag. And it leads me to some questions about
the
technical implementation of the notifications.
To remind the context: notifications are computed on top of the events
recorded
by the event stream (a.k.a activity stream). We take events from the event
stream SQL table, we apply some transformations on them, and we display
them to
the user.
Then we have implemented the ability to filter on these events: for example
"don't show events concerning the document A nor the wiki B". Filters are
implemented with 2 distinct ways:
  1/ SQL injections: each filter can add SQL elements in the query we make
to
     fetch the events from the event stream table. We made this mechanism
so we
     can let the database do a lot of the filtering process. After all,
it's its
     job and it's supposed to perform well. To be precise, Clement has even
     created an Abstract Syntax Tree (AST) so it's easier to inject some
content
     in the query and it creates an abstraction over the SQL language so we
can
     even consider to change the storage of the event stream someday.
     The bad thing is that some complex filtering are difficult to write
with
     the SQL language (event with the AST) or even impossible.
  2/ Post-filtering: after the events have been fetched from the database,
each
     filter can still decide to keep or filter them. This is useful for
     complex filtering that cannot be expressed with the SQL language. It
is
     also needed by the real-time notification email sender, because it
takes
     the events immediately when they occurs without fetching them in the
     database (so SQL filters are bypassed).
     The bad thing is that some events are loaded in the memory to finally
be
     rejected, and these filters can perform costly operations such as
loading
     documents.
Until now, this double mechanism was working quite well, with each
mechanism
filling the lacks of the other.
However, we still have technical limitations in our design:
  1/ Users who have a lot of filter preferences can end up with a giant SQL
     query that is almost impossible to perform by the database. Actually
we had
     a user complaining about an OutOfMemory problem in the HQL to SQL
     translator !
  2/ I cannot implement the tag filter !
The tag filter is supposed to show events that concern pages that hold a
given
tag, EVEN IF THE PAGE WAS EXCLUDED BY THE USER. Example of use-case: "I
don't
want to receive notifications about wiki A except for pages marked with the
tag
T".
And it is not working. First because it is difficult to write a SQL query
for
that. It requires to make a join clause with the document and the object
tables,
which our SQL injection mechanism does not support. Even if it were
possible,
creating a SQL join with the document table will de-facto filter events
that do
not concern any pages or pages that do not have any objects: so many other
filters will be broken. I also don't consider creating a SQL subquery, I
think
the whole query would became too big. So I decided to just not inject any
SQL
code for this filter and only implement the post-filtering mechanism.
But the other filter "EXCLUDE WIKI A" generates a SQL injection such as
"WIKI <> 'WIKI A'" so the events concerning the wiki A are not
fetched from
the
database. Consequence: the tag filter never see the events that it is
supposed
to keep. It would be actually possible to by-pass the first SQL injections
by
injecting something like "OR 1=1". But doing something like this is like
dropping the all SQL injections mechanism.
I see some solutions for this problem:
  A/ For each tag, create a permanent list of pages that hold it. So I can
     inject "OR document IN (that_list)". I think this is heavy.
  B/ Drop the SQL injection mechanism and only rely on the post-filtering
     mechanism. It would require to load from the database A LOT of events,
     but maybe we could cache this.
  C/ Don't drop the SQL injection mechanism completely but use it as little
as
     possible (for example, do not use it for LOCATION filtering). Seems
hard to
     determine when a filter should use this feature or not.
  D/ Don't implement the "tags" filter, since it is the root of the issue.
But
     it is like sweeping dirt under the carpet!
Since we have the OutOfMemory problem with the SQL injections becoming too
huge,
I am more in favor of solution B or C. But I'm not sure for now, since I do
not
know how much it would impact the performances and the scalability of the
whole
notifications feature.
This is a complex topic, but I hope this message will inspire you some
suggestions or things I have not seen with my own eyes.
Thanks for your help,
Guillaume