A victim of its own popularity: Scaling our CloudWatch integration
The trouble with success
Once upon a time, a customer asked us if we could help them pull their AWS CloudWatch data into Hosted Graphite, so they could store it long-term alongside other system and application metrics.
We put together a simple integration comprising an allocator, a job queue, and a worker pool using the Redis Queue (RQ) library. A job fetched all metrics configured for a single access key in a given time frame, typically 10 minutes.
This, plus a nice configuration UI, got the job done with minimal setup and development. Customers loved it! We expanded the range of AWS services it could pull data for. Over time, the integration’s popularity grew and grew—and so did the resources required to run it.
This could result in slower throughput, and delays fetching metrics for customers. Arguably a nice problem to have: the initial, simple solution was outgrowing its (single) machine resources.
We maintain a fairly homogeneous cohort of bare-metal servers, and try to avoid scaling vertically wherever we can. Our SRE team have Understandably Strong Feelings about single-homed services, too: while it was possible to catch up on CloudWatch data when we failed over to another machine, it required manual work from both the SRE and development teams. No-one should ever have to wake up to do something a computer could.
So this was a good opportunity both to split the work over multiple machines, and to make a more reliable service with automated failover.
What to do …
We needed three things:
- To split the work across a (changing) number of machines;
- No impact should one or two of those machines fall over; and
- As far as possible, “just once” jobs—re-running jobs can land customers with larger AWS bills, as well as ingesting duplicate data (visible under certain of our data views).
Alongside these, we wanted to hold our service level objectives invariant. Since we define these in terms of customer impact, they didn’t need to change as we scaled out, except that they became easier to meet. Without going into boring detail, we had SLOs for job delay, number of attempted jobs, and number of successful jobs.
Our initial design–allocator, job queue, worker pool–worked well, and was chosen partly because it should ease horizontal scaling. So the plan was to replicate that setup across a few machines, elect an allocating leader, and distribute jobs across all workers in the cluster.
… and how to do it
Simple enough, but the devil is as usual in the details. How do we determine the leader? How does the leader know which machines to allocate jobs to? What happens if a machine goes down? What happens if the leader goes down?
First up, we needed to know what machines were members of the cluster, and their health status. This is the classic service discovery problem, and it was easy for us to base a solution on etcd—we already used it for just this purpose. Each machine writes a short-TTL key into etcd to announce that it’s present and healthy.
Now our leader is ready to allocate jobs! It maintains an RQ queue on each machine in the cluster, and assigns jobs by hashing over the set of healthy machines. It also maintains a mapping—shared across the whole cluster—of what jobs are allocated where, so that in the event of leader failover, we can easily figure out what jobs need to be reallocated.
We modified the behaviour of the job slightly, too: it reports completion times to every machine in the cluster. Then after a machine failure—leader or otherwise—it’s easy to see when a job was last completed and reschedule it, replaying any missing data from missed runs.
Getting to production
That all sounds good, but it was a big shift in how our CloudWatch service operated: how could we do it without impacting our customers? Rather than choosing a flag day to cut over to a new solution, we went with an incremental approach—each key feature was rolled out gradually and in a safe, controlled way.
Deploying the service discovery part and verifying it was OK was easy: just making sure the right stuff ended up in etcd.
To gain confidence in the leader election part, we ran machines alongside the “real” service and deployed “fake” election capabilities: all machines would vie to acquire the lock, but actual leader behaviour was tucked behind a feature flag.
Our next deployment synced leader context to followers. Next came “fake” job hashing, and then creation of RQ queues across the cluster. Finally, we deployed a mechanism to write back job results.
At this stage we had all of the important components working in the background successfully, while at the same time our original machine operated as normal.
All that remained was to flip our feature flag such that leader election & job distribution came into effect. So we flipped it and … it worked! What seemed like a potentially daunting deployment process ended up being quite easy.
Although this incremental approach added some development overhead, it was entirely worth it: we had no outages or customer impact at all during the rollout.
The steady state
We can now expand our CloudWatch integration—either in the number of AWS services we support, or the number of customers using it—without worrying about capacity. Growing is easy: we just provision a new machine for the cluster and it automatically joins and starts working.
We’ve watched (and intentionally triggered!) various leader failovers, but only by looking at graphs: it happens as seamlessly and automatically as we’d hoped.