I resend this message that was accidentally sent to Sergiu only.
---------- Forwarded message ----------
From: Guillaume Delhumeau <guillaume.delhumeau(a)xwiki.com>
Date: 2018-07-02 16:06 GMT+02:00
Subject: Re: [xwiki-devs] [Brainstorm] Notifications filters capabilities
and performances.
To: Sergiu Dumitriu <sergiu(a)xwiki.org>
Hi and thanks you for your answers.
2018-06-29 5:41 GMT+02:00 Sergiu Dumitriu <sergiu(a)xwiki.org>rg>:
  The two main problems that I see are that you're
mixing two separate
 things:
 1. Tags which are stored in one place, and events which are stored in a
 different place
 2. Again tags, and document locations. They may seem similar, but they
 are completely different as they are implemented. That's why saying "not
 in this wiki, except for this tag" cannot ever be implemented in a sane
 way since the exception is on a different plane.
 The logical approach would be to also store the tags in the events, but
 this is not a good thing to do. 
Also, a tag could be added to a page AFTER the event has been triggered. In
this case, if you still want to fetch these pages, it won't work.
  Where do we stop? For future-proofing
 filters, we would have to store the entire document in the event, in
 case someone wants to add a filter on "documents with a specific type of
 object attached", or "documents with a specific value for a specific
 property of a specific object", or "documents with attachments ending in
 .extension".
 On 06/28/2018 04:44 PM, Guillaume Delhumeau wrote:
  For the tags filter, I can also:
 * perform a query to fetch all documents that correspond to the given 
 tags
  (it doesn't need to be permanent, but it
would be better to cache it)
 * add a HQL filter on these pages (OR document IN :list_of_pages).
 It's a little variation of solution A. It's ugly but it could do the job. 
 It's the sanest proposal so far, but still impossible.
 
Apparently it's the way it's done on Activity Stream:
https://github.com/xwiki/xwiki-platform/blob/553afe7287dcc4ce8b588276e639bf
283352e805/xwiki-platform-core/xwiki-platform-activitystream/xwiki-platform-
activitystream-ui/src/main/resources/Main/Activity.xml#L1008
 Since "not in LOCATION except WITH TAG" doesn't work because it mixes
 two types of information in the same query fragment, how about making
 this a one-dimensional query as "only WITH TAG". And since this
 information is in a different place, pre-querying it is needed. However,
 if there are thousands or tens/hundreds of thousands of documents with a
 tag, won't the query crash anyway?
 
Sure, it won't scale.
  2018-06-27 12:58 GMT+02:00 Guillaume Delhumeau
<
 guillaume.delhumeau(a)xwiki.com>gt;:
> Hi developers.
>
> I am trying to add a new filter to the notifications to be able to 
 follow
 > pages
> that are marked with a given tag. And it leads me to some questions 
 about
 > the
> technical implementation of the notifications.
>
> To remind the context: notifications are computed on top of the events
> recorded
> by the event stream (a.k.a activity stream). We take events from the 
 event
 > stream SQL table, we apply some
transformations on them, and we display
> them to
> the user.
>
> Then we have implemented the ability to filter on these events: for
> example
> "don't show events concerning the document A nor the wiki B". Filters
 are
 > implemented with 2 distinct ways:
>
>   1/ SQL injections: each filter can add SQL elements in the query we 
 make
 > to
>      fetch the events from the event stream table. We made this 
 mechanism
 > so we
>      can let the database do a lot of the filtering process. After all,
> it's its
>      job and it's supposed to perform well. To be precise, Clement has
> even
>      created an Abstract Syntax Tree (AST) so it's easier to inject some
> content
>      in the query and it creates an abstraction over the SQL language so
> we can
>      even consider to change the storage of the event stream someday.
>
>      The bad thing is that some complex filtering are difficult to write
> with
>      the SQL language (event with the AST) or even impossible.
>
>   2/ Post-filtering: after the events have been fetched from the 
 database,
 > each
>      filter can still decide to keep or filter them. This is useful for
>      complex filtering that cannot be expressed with the SQL language. 
 It
 > is
>      also needed by the real-time notification email sender, because it
> takes
>      the events immediately when they occurs without fetching them in 
 the
 >      database (so SQL filters are bypassed).
>
>      The bad thing is that some events are loaded in the memory to 
 finally
 > be
>      rejected, and these filters can perform costly operations such as
> loading
>      documents.
>
> Until now, this double mechanism was working quite well, with each
> mechanism
> filling the lacks of the other.
>
> However, we still have technical limitations in our design:
>
>   1/ Users who have a lot of filter preferences can end up with a giant
> SQL
>      query that is almost impossible to perform by the database. 
 Actually
 > we had
>      a user complaining about an OutOfMemory problem in the HQL to SQL
>      translator !
>
>   2/ I cannot implement the tag filter !
>
> The tag filter is supposed to show events that concern pages that hold a
> given
> tag, EVEN IF THE PAGE WAS EXCLUDED BY THE USER. Example of use-case: "I
> don't
> want to receive notifications about wiki A except for pages marked with
> the tag
> T".
>
> And it is not working. First because it is difficult to write a SQL 
 query
 > for
> that. It requires to make a join clause with the document and the object
> tables,
> which our SQL injection mechanism does not support. Even if it were
> possible,
> creating a SQL join with the document table will de-facto filter events
> that do
> not concern any pages or pages that do not have any objects: so many 
 other
 > filters will be broken. I also don't
consider creating a SQL subquery, I
> think
> the whole query would became too big. So I decided to just not inject 
 any
 > SQL
> code for this filter and only implement the post-filtering mechanism.
>
> But the other filter "EXCLUDE WIKI A" generates a SQL injection such as
> "WIKI <> 'WIKI A'" so the events concerning the wiki A are
not fetched
> from the
> database. Consequence: the tag filter never see the events that it is
> supposed
> to keep. It would be actually possible to by-pass the first SQL 
 injections
 > by
> injecting something like "OR 1=1". But doing something like this is like
> dropping the all SQL injections mechanism.
>
> I see some solutions for this problem:
>
>   A/ For each tag, create a permanent list of pages that hold it. So I 
 can
 >      inject "OR document IN
(that_list)". I think this is heavy. 
 Definitely not. You'll have to maintain this list in parallel to the
 canonical storage for tags, the xwikidoc table itself, with all the
 headaches that data duplication brings.
 
 
I agree.
 >   B/ Drop the SQL injection mechanism and
only rely on the 
 post-filtering
 >      mechanism. It would require to load from
the database A LOT of 
 events,
 >      but maybe we could cache this. 
 Nope, filtering millions of events by inspecting them one by one is a
 performance nightmare.
 >   C/ Don't drop the SQL injection
mechanism completely but use it as
> little as
>      possible (for example, do not use it for LOCATION filtering). Seems
> hard to
>      determine when a filter should use this feature or not. 
 I think location filtering should still happen in the query itself,
 since it is a lot faster.
 >   D/ Don't implement the "tags"
filter, since it is the root of the 
 issue.
 > But
>      it is like sweeping dirt under the carpet! 
 Well, not implementing something is always the fastest, least intrusive
 solution to any problem...
 Humor aside, I think that in this case it is the best approach.
 
 
Thanks for saying it :)
 > Since we have the OutOfMemory problem with
the SQL injections becoming 
 too
 > huge,
> I am more in favor of solution B or C. But I'm not sure for now, since I
> do not
> know how much it would impact the performances and the scalability of 
 the
   whole
 notifications feature.
 This is a complex topic, but I hope this message will inspire you some
 suggestions or things I have not seen with my own eyes.
 Thanks for your help,
 Guillaume
 
   Other options:
 E/ Drop SQL altogether, move the event stream into a nosql database, and
 do stream filtering; kind of like the B proposal
 
 
That's unfortunately out of the scope right now.
 F/ Drop SQL and querying altogether, switch to a pub-sub mechanism where
 a subscriber is created for every user and his filters, and matching
 events are placed in a queue until they are all sent (consumed) in a
 notification. Obviously, this only works for email notifications, not
 for browsing past events in a webpage.
 
Indeed it cannot replace Activity Stream. However, it was my first design
attempt:
http://design.xwiki.org/xwiki/bin/view/Proposal/
NotificationCenterforAppsImplementation#HIteration1
But then we decided to rely on what we had (the event stream) and to kill 2
birds with the same stones (having notifications + replacing activity
stream).
Thanks,
Guillaume
--
Guillaume Delhumeau (guillaume.delhumeau(a)xwiki.com)
Research & Development Engineer at XWiki SAS
Committer on the 
XWiki.org project
--
Guillaume Delhumeau (guillaume.delhumeau(a)xwiki.com)
Research & Development Engineer at XWiki SAS
Committer on the 
XWiki.org project