System monitoring - what are my options? (part 2)
In part one of this series on system monitoring libraries we checked out a some popular libraries used to monitor servers. In this follow-up, we take a look at a few more options and make a recommendation to answer the question 'which of the many available monitoring tools is best for your environment?'
Diamond is a Python daemon for collecting system metrics and presenting them to Graphite. Diamond is good for scale and customization.
Extensibility - Diamond uses a straight-forward model: add a collector to the configuration file to add new monitoring. This makes it low friction to scale to dozens or even hundreds of servers, because each instance is the same and responsible for reporting its own metrics. Diamond can handle it, too - the project claims up to 3m datapoints per minute on 1000 servers without undue CPU load.
Variety - Support extends to a range of operating systems with the documentation to back it up. Diamond comes with hundreds of collector plugins plus it lets you customize collectors and handlers with little effort for metrics gathering from nearly any source. Installation is easy, too.
Functionality - Collection is all Diamond does. It can talk directly to tools such as Graphite, but many setups still choose to use an aggregator like StatsD in addition to Diamond for their application metrics.
Updates - Brightcove, the original developers, stopped working on Diamond and it graduated to a standalone open source project. That’s noticeably slowed its release cycle. Diamond is a mature and well-established project, though, so decide for yourself how much of an issue this is.
StatsD is metrics aggregation daemon originally released by Etsy. It is not a collector like other tools on this list but instead crunches high-volume metrics data and forwards statistical views to your graphing tool.
Maturity - The simple, text-based protocol has spawned client libraries in nearly every popular languages. You can plug into just about any monitoring or graphing back end. StatsD has been around a long time, and it shows.
Free-Standing - The StatsD server sits outside your application - if it crashes, it doesn't take anything down with it. Listening for UDP packets is a good way to take in a lot of information without a latency hit to your application, or needing to worry about maintaining TCP connections.
Limited Functionality - StatsD is an aggregator more than a collector. You still need to instrument your application to send data to it. For thresholds or alerts, you’ll need to build in a backend or use something like Nagios.
Data Reliability - Fire-and-forget UDP has downsides, too - for dispersed networks or essential data, packet loss is a risk. Also, if you are using vanilla graphite and send too much to StatsD in its flush interval, those metrics drop and won’t graph. Hosted Graphite can handle it though. :)
Zabbix is an open source monitoring framework written in C. Zabbix positions itself as an enterprise-level solution and so offers all-in-one monitoring that handles collection, storage, visualization and alerting.
Range of Coverage - Zabbix can track not just in real time, but trends and historical data, too. Monitoring extends to hardware, log files, and VMs. A small footprint and low maintenance means you can fully exploit what Zabbix has to offer.
Convenience - Don’t like plug-ins? Nearly all the functionality you might want is built in to Zabbix. If something is missing, there are extensive, simple customization options. Templates tell you what to collect and configurations are set centrally through the GUI.
New Alerts - Service discovery and configuring things like triggers are both more involved than they really ought to be. Tools like Graphite and StatsD can start to track a new metric just by referencing it in your application. Zabbix is not one of those tools.
Large Environments - Zabbix doesn’t do well on big networks - all that historical data is greedy for database space, PHP for the front end has performance limitations, and debugging is awkward. For an enterprise-level system, that’s a bit underwhelming.
So - what do we recommend? One set of criteria for running monitoring services in production goes like this:
- Operationally it should play well with others - no taking down a box if something goes wrong, no dependency hell, no glaring security issues.
- Extensible without hacking on source code - a library should have some mechanism for supporting the entire stack you're using without having to mutilate the code to get it working.
- Well supported by the community - Lots of users means that you know that bugs will be squashed, and updates will arrive to support new measuring scenarios or technologies.
- Works to its strengths - A library that tries to do everything itself ends up doing nothing particularly well. Opinionated design means that a library can focus on the important issues without adding in pointless frivolities.
So - If I were to pick based on these criteria, I'd recommend either Diamond or CollectD. They both handle the collection of your data admirably with extensible plugins, and can forward it on to a storage and visualization service like Hosted Graphite (we even give you pre-made dashboards for both services!). They're both well supported by the open-source community and play nicely with your systems.
If you're in the Java ecosystem there may be some natural attraction to DropWizard, or StatsD if you're using a PaaS such as Heroku - but if you're running your own servers or using AWS, CollectD or Diamond are a good fit.