For the tags filter, I can also:
* perform a query to fetch all documents that correspond to the given tags
(it doesn't need to be permanent, but it would be better to cache it)
* add a HQL filter on these pages (OR document IN :list_of_pages).
It's a little variation of solution A. It's ugly but it could do the job.
2018-06-27 12:58 GMT+02:00 Guillaume Delhumeau <
guillaume.delhumeau(a)xwiki.com>gt;:
Hi developers.
I am trying to add a new filter to the notifications to be able to follow
pages
that are marked with a given tag. And it leads me to some questions about
the
technical implementation of the notifications.
To remind the context: notifications are computed on top of the events
recorded
by the event stream (a.k.a activity stream). We take events from the event
stream SQL table, we apply some transformations on them, and we display
them to
the user.
Then we have implemented the ability to filter on these events: for
example
"don't show events concerning the document A nor the wiki B". Filters are
implemented with 2 distinct ways:
1/ SQL injections: each filter can add SQL elements in the query we make
to
fetch the events from the event stream table. We made this mechanism
so we
can let the database do a lot of the filtering process. After all,
it's its
job and it's supposed to perform well. To be precise, Clement has
even
created an Abstract Syntax Tree (AST) so it's easier to inject some
content
in the query and it creates an abstraction over the SQL language so
we can
even consider to change the storage of the event stream someday.
The bad thing is that some complex filtering are difficult to write
with
the SQL language (event with the AST) or even impossible.
2/ Post-filtering: after the events have been fetched from the database,
each
filter can still decide to keep or filter them. This is useful for
complex filtering that cannot be expressed with the SQL language. It
is
also needed by the real-time notification email sender, because it
takes
the events immediately when they occurs without fetching them in the
database (so SQL filters are bypassed).
The bad thing is that some events are loaded in the memory to finally
be
rejected, and these filters can perform costly operations such as
loading
documents.
Until now, this double mechanism was working quite well, with each
mechanism
filling the lacks of the other.
However, we still have technical limitations in our design:
1/ Users who have a lot of filter preferences can end up with a giant
SQL
query that is almost impossible to perform by the database. Actually
we had
a user complaining about an OutOfMemory problem in the HQL to SQL
translator !
2/ I cannot implement the tag filter !
The tag filter is supposed to show events that concern pages that hold a
given
tag, EVEN IF THE PAGE WAS EXCLUDED BY THE USER. Example of use-case: "I
don't
want to receive notifications about wiki A except for pages marked with
the tag
T".
And it is not working. First because it is difficult to write a SQL query
for
that. It requires to make a join clause with the document and the object
tables,
which our SQL injection mechanism does not support. Even if it were
possible,
creating a SQL join with the document table will de-facto filter events
that do
not concern any pages or pages that do not have any objects: so many other
filters will be broken. I also don't consider creating a SQL subquery, I
think
the whole query would became too big. So I decided to just not inject any
SQL
code for this filter and only implement the post-filtering mechanism.
But the other filter "EXCLUDE WIKI A" generates a SQL injection such as
"WIKI <> 'WIKI A'" so the events concerning the wiki A are not
fetched
from the
database. Consequence: the tag filter never see the events that it is
supposed
to keep. It would be actually possible to by-pass the first SQL injections
by
injecting something like "OR 1=1". But doing something like this is like
dropping the all SQL injections mechanism.
I see some solutions for this problem:
A/ For each tag, create a permanent list of pages that hold it. So I can
inject "OR document IN (that_list)". I think this is heavy.
B/ Drop the SQL injection mechanism and only rely on the post-filtering
mechanism. It would require to load from the database A LOT of events,
but maybe we could cache this.
C/ Don't drop the SQL injection mechanism completely but use it as
little as
possible (for example, do not use it for LOCATION filtering). Seems
hard to
determine when a filter should use this feature or not.
D/ Don't implement the "tags" filter, since it is the root of the issue.
But
it is like sweeping dirt under the carpet!
Since we have the OutOfMemory problem with the SQL injections becoming too
huge,
I am more in favor of solution B or C. But I'm not sure for now, since I
do not
know how much it would impact the performances and the scalability of the
whole
notifications feature.
This is a complex topic, but I hope this message will inspire you some
suggestions or things I have not seen with my own eyes.
Thanks for your help,
Guillaume
--
Guillaume Delhumeau (guillaume.delhumeau(a)xwiki.com)
Research & Development Engineer at XWiki SAS
Committer on the
XWiki.org project