Hi Paul,
Many thanks for your contribution. I'll certainly look into Zabbix, although I must
confess to being aghast at what appears to be a large and complex tool for what I'd
hoped was quite simple. I hadn't realised these servers were so temperamental. Before
I loose myself in getting acquainted with a new sophisticated product, could you tell me
whether Zabbix (or something else) will help me identify the following?:
- When are users suffering timeouts (doesn't have to be real time, happy to check
summary later)
- Where was the timeout occuring (network, Apache, Tomcat, Postgres)
- What was the cause of the timeout (too many connections, low memory, long Java
operation, long query, etc)
- What specific item (Java program, DB query) was responsible
I wonder whether all this should be discoverable in the logs, with the right
configuration.
I've seen a lot of mention of JMX for Tomcat monitoring, but I've shied away from
it since I wanted to start simple, but perhaps there is no simple ... ;-(
________________________________________
From: Paul Libbrecht [paul(a)hoplahup.net]
Sent: 01 November 2014 09:41
To: XWiki Users
Subject: Re: [xwiki-users] Monitoring an Xwiki stack
Hello all,
Here's my experience at monitoring XWikis.
With
i2geo.net and with my private XWiki, I use a zabbix server.
This php-based monitoring tool is quite easy to configure for http monitoring and with a
few more steps you get a mail notification when, e.g., a timeout occurs in connections.
I've been using HypericHQ for a while, a java based monitoring, which was rather nice
to manipulate but a machine-name-change broke everything, so I looked for something a tick
more modern.
At
curriki.org, a site with lots of visitors, there's quite a few tools used to
monitor.
- First, for the safety and honesty of a system outside,
alertsite.com is used. It is very
effective at detecting breakges, including potential internet backbones'. We use
monitoring from three locations.
- Second, because, indeed, the XWiki servers sometimes need a push, there used to be a
regular script that checks a basic page and, if failed, auto-restarts the app-server. For
us, this is a bit unsafe because we like to control things after a restart.
- Third, for a while, we have been running a "combined monitoring" which allowed
to combine a small graphical view synced with logs of apache, the app-server,
thread-dumps, and mysql. This allowed to catch "bad actions" which sometimes
happen when power users perform actions which trigger too big queries which locked others
(group-deletions were such an action).
- Finally, we also added a zabbix which collects http monitoring as well as other
"classical" values (disks, memory, apache-stats, …).
The rhythm at curriki is about a week… after a week, one of the two cluster nodes
(there's two currently) needs a restart because some memory gets exhausted and the GC
starts to fail. We generally get alertsite errors then.
The interest of running a monitoring infrastructure such as zabbix, is that you can
analyze the behaviors of multiple variables and see if there is a way to predict if things
are getting wrong. It remains a guts' feeling story but still gives you quite some
confidence.
It would be really nice if we could converge on a set of JMX analysis "items"
for zabbix so that we could be analyzing more concretely the xwiki-relevant information
(in particular the cache behaviors) and start adjusting to less fall out of memory.
paul