// archives

logging

This category contains 2 posts

Trend Monitoring Suites

I hate cacti. Sorry guys, there are lots of things that are good about it, and those things are that if you want to monitor just switch/router interface stats, via snmp, and that’s *it*, its very easy. When you want to plot technical data that you source through something other than snmp, working through the cacti template system is like wading through tar.

ganglia.pngStep in some newer projects. Ganglia was really interesting, and a colleague found it thanks to some presentation that Flickr demo’d. I really liked how easy it was to configured. Set the agent up on a bunch of PCs, run the web interface on one, and bang, graphs. Its that easy. We installed the agent on a couple of trial PCs and we had graphs. We then wrote some scripts to measure other metrics from custom applications. If we could write a script that produced a number, then we could graph that metric in ganglia, just by passing the number to the bundled ‘gmetric’ application. Brilliant. But what about if we can’t run an agent on the device that we want to monitor, such as a switch ? There has been talk on the ganglia developers list since 2004 about incorporating snmp support, but no real evidence of traction. So it wont work for me.

So let me offer a golden rule of performance monitoring. If you are going to write a performance monitoring suite, make sure it supports SNMP on day one. If you are writing a monitoring layer for your application, make sure it uses SNMP.

In steps Zabbix. The best of both worlds. Here, there’s an agent again, so if you want to monitor the health over time of a server, you configure the agent and send back figures to a monitoring box. Figures appear. There’s also an snmp interface, so you point it at a router, tell it the community, and more figures appear.

No graphs yet, but thats because you configure them yourself, but its really easy. Want to aggregate all of the exit ports on your router – make a graph using those metrics ! If you can imagine it, you can graph it with Zabbix. Some of it is quite clunky, i.e. configuring the snmp community for each device is a bit slow, but the back end if just MySQL, so you can change the community for a device with an “update items set snmp_community =’xx’ where hostid=’yy’;” instead of using the clunky interface. Also, to measure interface stats, you must change the ifInOctets and ifOutOctets delta to ‘speed over time’ not just accept the counter value, otherwise your graphs will show nothing more than the port counting more data as time goes by.

I strongly recommend Zabbix to anyone who finds cacti arduous to configure.

Common Event Expression.

I am getting quite excited about some of the material I have been reading on Common Event Expression (pdf).  CEE is a desire to standardise the way that events are described.  I can see this being of significant advantage to sysadmins who need to produce large scale monitoring systems.

We already all use syslog-ng or rsyslogd or similar to aggregate our logs centrally, but it would be great to be able to aggregate logs inside our monitoring systems in such a way that when we add servers to our networks, any issues that they raise, in the application layer, or in hardware, are described to monitoring systems in a common and expected way.

If the taxonomy of error handling was equivalent on, say, routing kit as well as desktop systems, this allows sysadmins to deploy complex monitoring systems with less effort.  Understand how to handle a mistake with system-X and every single system you deploy from then on benefits from tried and tested monitoring and management.

Its early days for CEE, but I am optimistic about the benefits we could all realise if there was a desire to standardise logging.  Looking forward to what happens next.