The Secrets of Load-balancing Long Lived TCP Connections
How do you deal with load-balancing customer traffic at the border of your infrastructure when you don’t own the network? Following a series of experiments, I implemented a service that leverages our internal Graphite monitoring to dynamically weight HAProxy backend servers based on some measurement of load.
In the early days, we relied on Route53 round-robin DNS for load-balancing data sent to our TCP and UDP endpoints, but the limitations of this approach became more obvious as we scaled. Although DNS is a simple solution and relatively cost effective for load-balancing, it has several drawbacks such as clients disobeying DNS record TTLs and having no control over client-side DNS caching. To add to that, for services like ours with non-uniform per connection data-rates, there was no actual balancing being applied. As a result, as we added more and more servers our DNS records continued to grow and inevitably became difficult to manage. We started to have regular issues with customers overloading servers with TCP traffic sent over a small number of connections. In such cases, rotating servers out of service to mitigate the impact was a manual process for our SREs and would not solve the issue quickly enough. We needed a better way to load-balance those TCP connections that were long lived.
We decided to run HAProxy instances on each of our ingestion nodes to perform intra-cluster TCP load balancing. With this setup we had a traffic path like so:
We began with experiments in round-robin and leastconn balancing methods, the latter balances new connections to a node with the current least active connections, and offers the best results. At first, we were pretty happy with how this worked, but knew it would be impossible to truly balance the load across the cluster in this way, as not all connections are created equal (5 data-points/s accepted for one connection…5,000 data-points/s for another). So, for a short period, we rolled with this limited load balancing and continued to rely on DNS for balancing UDP.
Building an infrastructure border
At this point, we realised we desperately needed to find a way to effectively balance our UDP traffic across our ingestion layer, which was at the border of our infrastructure. Client side DNS caching was hurting us, we had some customers sending huge amounts of UDP data to a subset of our ingestion layer. This was causing overload issues resulting in backlogged data, that the SRE team were frequently getting paged to investigate. Further digging into our options threw little light on the problem, until a fellow SRE, who was leading the project, decided to build a new load balancing layer which would act as the infrastructure border. With a border layer of load balancers in place, we could clean out our ever growing DNS records. It also allowed us to continue to load balance TCP connections with HAProxy, while UDP balancing would be handled by Linux Virtual Server (LVS). The LVS project is a high-performance software load balancer that has existed in the Linux kernel since 2001. Why we chose it for a UDP load balancing solution is worthy of an entire blog post of its own!
Connecting the dots
We now had a border layer of load balancers that looked something like this:
This new load balancing layer opened up another opportunity: to revisit our flawed balancing of TCP connections. As the engineer tasked with the job of achieving TCP traffic convergence, I started by looking at how we could use the tools we already had to balance highly variable traffic rates over long lived TCP connections. Usefully, HAProxy provides a method for dynamically weighting backend servers via the UDS (Unix Domain Socket) API which allows for new connections to be balanced to nodes based on their weighting, but what should determine the weights? The obvious solution was to use our own extensive service monitoring, which we had easy access to through our own Graphite API.
I implemented a new Python service to be run on each load balancer. This would handle the dynamic weighting of individual servers for a specific HAProxy backend based on the results of some measurement of load. In our case, the “measurement of load” was the overall TCP datapoint rate observed at each node in the ingestion layer – however, this could be anything! loadavg?, network bandwidth?, disk space? (sure.. why not!).
Leveraging our Graphite render API, we could weight the forwarding of new connections to each node based on the results of a render query something like:
How it works
So, how does this all work? The operation loop of a single weighter process is as follows:
- On initialisation, using the included UDS wrapper, HAProxy_weighter fetches the weight currently set for each server present in the configured HAProxy backend.
- Every rebalance interval, the weighter renders the configured Graphite target to acquire some measurement of load currently observed on each relevant server.
- Weights are then calculated for each server by feeding the render results to one of two configurable weighting functions.
- The calculated weights are then set in HAProxy, again using the UDS API.
Each weighting function takes a single datapoint for each server, aggregated from the render results using either the leading edge or the average of the whole dataset for the series, and returns a weight between 1 – 256. We currently have two options: normalised exponential distribution, and limited proportions which is in use for balancing our TCP connections to our ingestion layer. The weight is determined by calculating the inverse proportions for each value of a list of data-points while limiting the minimum datapoint value to `1/LIMITING_FACTOR` of the average, this avoids lower range outliers being assigned a maximum weight proportion. An `AGGRESSIVENESS_FACTOR` constant can be raised or lowered to modify the standard deviation between weightings.
Once the haproxy_weighter was eventually deployed, we could see a significant improvement. Our DNS records are smaller and much easier to manage. We no longer have outliers within our ingestion layer. Where before, we would have seen servers receiving 5 times the average traffic due to a couple of particularly traffic heavy connections, we now have nicely even per server traffic rates. TCP traffic convergence was achieved. This system also gives us the ability to automatically react quickly to traffic spikes, routing new connections away from overloaded servers, saving our SREs from getting paged in the depths of the night.