Hi Clément,
2018-06-29 13:34 GMT+02:00 Clément Aubin <aubincleme(a)gmail.com>om>:
Hi,
With the complexity of the additive filtering system, it's kind of
normal to see OOM errors. AFAIR we already talked about that a year ago,
when looking at how we could implement this feature.
We could try to improve a bit the situation by "optimizing" the filters.
For example, we can currently end up with filters like these:
- filter events about SpaceA.PageB
- filter events about SpaceA.PageC
- filter events about SpaceA.
Obviously the 2 first filters can be removed since the third one is already
doing their job. I've tried to minimize the number of unnecessary filters
by not creating a new one if the page is already concerned by a wider
filter, but it won't fix existing filters.
An other example is the watch list bridge that simply convert the list of
watched pages as they were in the previous version of XWiki and translate
them to the same number of filters. This may explain many of the OOM
problems. It is even enforced by the fact we had (and we still have) the
"autowatch" mode enabled by default.
So a filter optimizer could be a great help.
On 06/29/2018 05:41 AM, Sergiu Dumitriu wrote:
The two main problems that I see are that
you're mixing two separate
things:
1. Tags which are stored in one place, and events which are stored in a
different place
2. Again tags, and document locations. They may seem similar, but they
are completely different as they are implemented. That's why saying "not
in this wiki, except for this tag" cannot ever be implemented in a sane
way since the exception is on a different plane.
The logical approach would be to also store the tags in the events, but
this is not a good thing to do. Where do we stop? For future-proofing
filters, we would have to store the entire document in the event, in
case someone wants to add a filter on "documents with a specific type of
object attached", or "documents with a specific value for a specific
property of a specific object", or "documents with attachments ending in
.extension".
On 06/28/2018 04:44 PM, Guillaume Delhumeau wrote:
> For the tags filter, I can also:
>
> * perform a query to fetch all documents that correspond to the given
tags
> (it doesn't need to be permanent, but it
would be better to cache it)
> * add a HQL filter on these pages (OR document IN :list_of_pages).
>
> It's a little variation of solution A. It's ugly but it could do the
job.
It's the sanest proposal so far, but still impossible.
Since "not in LOCATION except WITH TAG" doesn't work because it mixes
two types of information in the same query fragment, how about making
this a one-dimensional query as "only WITH TAG". And since this
information is in a different place, pre-querying it is needed. However,
if there are thousands or tens/hundreds of thousands of documents with a
tag, won't the query crash anyway?
> 2018-06-27 12:58 GMT+02:00 Guillaume Delhumeau <
> guillaume.delhumeau(a)xwiki.com>gt;:
>
>> Hi developers.
>>
>> I am trying to add a new filter to the notifications to be able to
follow
>> pages
>> that are marked with a given tag. And it leads me to some questions
about
>> the
>> technical implementation of the notifications.
>>
>> To remind the context: notifications are computed on top of the events
>> recorded
>> by the event stream (a.k.a activity stream). We take events from the
event
>> stream SQL table, we apply some
transformations on them, and we display
>> them to
>> the user.
>>
>> Then we have implemented the ability to filter on these events: for
>> example
>> "don't show events concerning the document A nor the wiki B".
Filters
are
>> implemented with 2 distinct ways:
>>
>> 1/ SQL injections: each filter can add SQL elements in the query we
make
>> to
>> fetch the events from the event stream table. We made this
mechanism
>> so we
>> can let the database do a lot of the filtering process. After all,
>> it's its
>> job and it's supposed to perform well. To be precise, Clement has
>> even
>> created an Abstract Syntax Tree (AST) so it's easier to inject
some
>> content
>> in the query and it creates an abstraction over the SQL language
so
>> we can
>> even consider to change the storage of the event stream someday.
>>
>> The bad thing is that some complex filtering are difficult to
write
>> with
>> the SQL language (event with the AST) or even impossible.
>>
>> 2/ Post-filtering: after the events have been fetched from the
database,
>> each
>> filter can still decide to keep or filter them. This is useful for
>> complex filtering that cannot be expressed with the SQL language.
It
>> is
>> also needed by the real-time notification email sender, because it
>> takes
>> the events immediately when they occurs without fetching them in
the
>> database (so SQL filters are
bypassed).
>>
>> The bad thing is that some events are loaded in the memory to
finally
>> be
>> rejected, and these filters can perform costly operations such as
>> loading
>> documents.
>>
>> Until now, this double mechanism was working quite well, with each
>> mechanism
>> filling the lacks of the other.
>>
>> However, we still have technical limitations in our design:
>>
>> 1/ Users who have a lot of filter preferences can end up with a giant
>> SQL
>> query that is almost impossible to perform by the database.
Actually
>> we had
>> a user complaining about an OutOfMemory problem in the HQL to SQL
>> translator !
>>
>> 2/ I cannot implement the tag filter !
>>
>> The tag filter is supposed to show events that concern pages that hold
a
>> given
>> tag, EVEN IF THE PAGE WAS EXCLUDED BY THE USER. Example of use-case: "I
>> don't
>> want to receive notifications about wiki A except for pages marked with
>> the tag
>> T".
>>
>> And it is not working. First because it is difficult to write a SQL
query
>> for
>> that. It requires to make a join clause with the document and the
object
>> tables,
>> which our SQL injection mechanism does not support. Even if it were
>> possible,
>> creating a SQL join with the document table will de-facto filter events
>> that do
>> not concern any pages or pages that do not have any objects: so many
other
>> filters will be broken. I also don't
consider creating a SQL subquery,
I
>> think
>> the whole query would became too big. So I decided to just not inject
any
>> SQL
>> code for this filter and only implement the post-filtering mechanism.
>>
>> But the other filter "EXCLUDE WIKI A" generates a SQL injection such
as
>> "WIKI <> 'WIKI A'" so the events concerning the wiki A
are not fetched
>> from the
>> database. Consequence: the tag filter never see the events that it is
>> supposed
>> to keep. It would be actually possible to by-pass the first SQL
injections
>> by
>> injecting something like "OR 1=1". But doing something like this is
like
>> dropping the all SQL injections
mechanism.
>>
>> I see some solutions for this problem:
>>
>> A/ For each tag, create a permanent list of pages that hold it. So I
can
>
inject "OR document IN (that_list)". I think this is heavy.
Definitely not. You'll have to maintain this list in parallel to the
canonical storage for tags, the xwikidoc table itself, with all the
headaches that data duplication brings.
>> B/ Drop the SQL injection mechanism and only rely on the
post-filtering
>> mechanism. It would require to load
from the database A LOT of
events,
>
but maybe we could cache this.
Nope, filtering millions of events by inspecting them one by one is a
performance nightmare.
>> C/ Don't drop the SQL injection mechanism completely but use it as
>> little as
>> possible (for example, do not use it for LOCATION filtering).
Seems
> hard
to
> determine when a filter should use this feature or not.
I think location filtering should still happen in the query itself,
since it is a lot faster.
>> D/ Don't implement the "tags" filter, since it is the root of the
issue.
> But
> it is like sweeping dirt under the carpet!
Well, not implementing something is always the fastest, least intrusive
solution to any problem...
Humor aside, I think that in this case it is the best approach.
>> Since we have the OutOfMemory problem with the SQL injections becoming
too
>> huge,
>> I am more in favor of solution B or C. But I'm not sure for now, since
I
>> do not
>> know how much it would impact the performances and the scalability of
the
whole
notifications feature.
This is a complex topic, but I hope this message will inspire you some
suggestions or things I have not seen with my own eyes.
Thanks for your help,
Guillaume
Other options:
E/ Drop SQL altogether, move the event stream into a nosql database, and
do stream filtering; kind of like the B proposal
F/ Drop SQL and querying altogether, switch to a pub-sub mechanism where
a subscriber is created for every user and his filters, and matching
events are placed in a queue until they are all sent (consumed) in a
notification. Obviously, this only works for email notifications, not
for browsing past events in a webpage.
Note that the notification filtering system isn't dependent from SQL:
filters implement their conditions through an AST that gets translated
to SQL by the default manager in charge of retrieving the notifications.
This means that switching from SQL to something else shouldn't break
backward compat of the notification filters already developed.
Might be harder than expected in practice, but it's indeed a great thing
that we have this AST.
@Sergiu: (also AFAIR) option E was considered but not implemented as it
would have taken a lot of time to do ; still a nice idea IMO.
A mix of the options E and F could be the nicest IMO : adding a NoSQL DB
between the event stream and the UI so that notifications are picked up
from the ES, filtered and put into the NoSQL DB in a separate thread.
This is AFAIK the most widely used solution for important platforms.
However, this solution needs to be wisely considered it could increase
the complexity of an XWiki install and / or significantly increase the
size of the distribution artifacts that we produce. (Also, it takes time
to implement OFC).
Notifications is a complex "science" that would require almost the same
consideration than the search engine :)
Thanks for your help.
Note that we still have no answer for this problem.
The ugly solution "getting all pages that are marked with a given tag
before generating the query for notifications" is already in place in AS,
so at least we won't do worse !
Guillaume