Graphite and Metric collisions
One of the really useful things about Graphite (and maybe the main one if you were going to pick one standout that has led to it’s wide adoption), is that you can just fire a new metric at the collector and Graphite will happily accept it and you get useful graphs. Add some code to your app, or configure a plugin for collectd or diamond, restart your app and quickly your new metrics appear like magic!
There are two possible issues with this:
- With a lack of basic controls, this can also be a problem - if someone commits a chunk of unreviewed code that fires off a rapidly changing or random element in a metric name, you’re going to end up with a whole lot of junk in your system. Add a username to your metric name in a system with a few million unique users? Whoops!
- Metric name collisions - if you have more than one server sending any given metric name at the same time, it’s like the movie Highlander (with less Freddie Mercury / lightning effects): THERE CAN BE ONLY ONE!
As Jason Dixon summed it up in a recent post on the Graphite-dev mailing list:
It’s assumed that you avoid namespace collisions in each backend cluster. Otherwise, whichever backend returns the query first, “wins”.
Let’s say someone accidentally uses the same metric name in a few different places - picture trying to get useful information from two completely separate sets of data that have been interpolated side-by-side rather than collected together and processed?
Well, that sucks. When building out the backend for Hosted Graphite, we spent a lot of time trying to figure out the best and worst parts of Graphite so we can focus on the good and eliminate or mitigate the bad. In the usual love-hate relationship that people have with Graphite the fast metric creation is great, and collisions are just sort of an annoying behaviour.
In our setup, we have control over where we collect from - and have removed any issue with metric collisions. We’re greedy! If you send the same metric name from multiple servers, we collect all the data. By default we display you the average, but we also collect all the data points and give you a true sum, minimum, and maximum as well as a few other more exotic views like a random sampling of data to be used for percentile data.
Not suffering from metric namespace collisions is particularly useful if you don’t want to pre-aggregate your data somewhere yourself, or you’re looking to count something quickly across servers. No weird interpolations, just data that does what it’s supposed to.