For the tags filter, I can also:
* perform a query to fetch all documents that correspond to the given tags
(it doesn't need to be permanent, but it would be better to cache it)
* add a HQL filter on these pages (OR document IN :list_of_pages).
It's a little variation of solution A. It's ugly but it could do the job.
2018-06-27 12:58 GMT+02:00 Guillaume Delhumeau <
guillaume.delhumeau(a)xwiki.com>gt;:
  Hi developers.
 I am trying to add a new filter to the notifications to be able to follow
 pages
 that are marked with a given tag. And it leads me to some questions about
 the
 technical implementation of the notifications.
 To remind the context: notifications are computed on top of the events
 recorded
 by the event stream (a.k.a activity stream). We take events from the event
 stream SQL table, we apply some transformations on them, and we display
 them to
 the user.
 Then we have implemented the ability to filter on these events: for
 example
 "don't show events concerning the document A nor the wiki B". Filters are
 implemented with 2 distinct ways:
   1/ SQL injections: each filter can add SQL elements in the query we make
 to
      fetch the events from the event stream table. We made this mechanism
 so we
      can let the database do a lot of the filtering process. After all,
 it's its
      job and it's supposed to perform well. To be precise, Clement has
 even
      created an Abstract Syntax Tree (AST) so it's easier to inject some
 content
      in the query and it creates an abstraction over the SQL language so
 we can
      even consider to change the storage of the event stream someday.
      The bad thing is that some complex filtering are difficult to write
 with
      the SQL language (event with the AST) or even impossible.
   2/ Post-filtering: after the events have been fetched from the database,
 each
      filter can still decide to keep or filter them. This is useful for
      complex filtering that cannot be expressed with the SQL language. It
 is
      also needed by the real-time notification email sender, because it
 takes
      the events immediately when they occurs without fetching them in the
      database (so SQL filters are bypassed).
      The bad thing is that some events are loaded in the memory to finally
 be
      rejected, and these filters can perform costly operations such as
 loading
      documents.
 Until now, this double mechanism was working quite well, with each
 mechanism
 filling the lacks of the other.
 However, we still have technical limitations in our design:
   1/ Users who have a lot of filter preferences can end up with a giant
 SQL
      query that is almost impossible to perform by the database. Actually
 we had
      a user complaining about an OutOfMemory problem in the HQL to SQL
      translator !
   2/ I cannot implement the tag filter !
 The tag filter is supposed to show events that concern pages that hold a
 given
 tag, EVEN IF THE PAGE WAS EXCLUDED BY THE USER. Example of use-case: "I
 don't
 want to receive notifications about wiki A except for pages marked with
 the tag
 T".
 And it is not working. First because it is difficult to write a SQL query
 for
 that. It requires to make a join clause with the document and the object
 tables,
 which our SQL injection mechanism does not support. Even if it were
 possible,
 creating a SQL join with the document table will de-facto filter events
 that do
 not concern any pages or pages that do not have any objects: so many other
 filters will be broken. I also don't consider creating a SQL subquery, I
 think
 the whole query would became too big. So I decided to just not inject any
 SQL
 code for this filter and only implement the post-filtering mechanism.
 But the other filter "EXCLUDE WIKI A" generates a SQL injection such as
 "WIKI <> 'WIKI A'" so the events concerning the wiki A are not
fetched
 from the
 database. Consequence: the tag filter never see the events that it is
 supposed
 to keep. It would be actually possible to by-pass the first SQL injections
 by
 injecting something like "OR 1=1". But doing something like this is like
 dropping the all SQL injections mechanism.
 I see some solutions for this problem:
   A/ For each tag, create a permanent list of pages that hold it. So I can
      inject "OR document IN (that_list)". I think this is heavy.
   B/ Drop the SQL injection mechanism and only rely on the post-filtering
      mechanism. It would require to load from the database A LOT of events,
      but maybe we could cache this.
   C/ Don't drop the SQL injection mechanism completely but use it as
 little as
      possible (for example, do not use it for LOCATION filtering). Seems
 hard to
      determine when a filter should use this feature or not.
   D/ Don't implement the "tags" filter, since it is the root of the issue.
 But
      it is like sweeping dirt under the carpet!
 Since we have the OutOfMemory problem with the SQL injections becoming too
 huge,
 I am more in favor of solution B or C. But I'm not sure for now, since I
 do not
 know how much it would impact the performances and the scalability of the
 whole
 notifications feature.
 This is a complex topic, but I hope this message will inspire you some
 suggestions or things I have not seen with my own eyes.
 Thanks for your help,
 Guillaume
 
--
Guillaume Delhumeau (guillaume.delhumeau(a)xwiki.com)
Research & Development Engineer at XWiki SAS
Committer on the 
XWiki.org project