Deadlines, lies and videotape: The tale of a gRPC bug
TL;DR: If you use gRPC in your services, you’ll want to make sure you set a reasonable deadline for your RPC calls, upgrading to gRPC 1.16 as soon as possible is highly recommended. You should also enable client-side keepalive, and adjust the kernel setting for tcp_syn_retries (at least until the fix for this issue gets released).
Recently, we’ve been working on a project to decouple some parts of our ingestion and aggregation pipeline to make it more resilient against network failures. As a result, we needed a mechanism to transfer data as it’s processed to these new components. This is a system that’s subjected to high throughput, with reliability being an important factor. Due to these requirements we ended up settling on gRPC to drive traffic through this pipeline.
The component in question was due to be broken down into two pieces, which we’ll call layer0 and layer1. Layer0 would receive data directly from our ingestion pipeline, do a bit of preprocessing, and forward it to layer1 for further processing. Any delays in data reaching layer1 would also result in delays for our users to view their own data (we run a monitoring service). As a result, one of our design priorities was to make sure that a network failure, or a node (or set of nodes) experiencing issues wouldn’t delay processing. Any single layer0 node can have hundreds of layer1 destinations.
Conveniently, gRPC has the concept of a deadline, which allows us to be explicit in how much we’re willing to wait for a single RPC call to be handled by the server. It’s important to note that by default, this deadline is set to unlimited, which is almost certainly not what you want. We therefore highly recommend testing your application and setting a deadline that works for your service.
You think someone would do that, just go on the internet and experience timeouts?
When we started rolling out the new components, everything was working as expected–with one exception. Shortly after rollout, we started noticing some strange behaviour. At seemingly random intervals, our layer0 service would stop processing and forwarding data to all layer1 destinations. This would continue for what seemed like a random period of time, from a few seconds up to minutes. Our own health checking mechanisms detected this quickly and routed traffic around the affected nodes until they recovered, but this certainly warranted further investigation.
At this point, we didn’t know what triggered the issue and our own instrumentation wasn’t much help. It essentially showed our layer0 service freezing, and even stop reporting its own internal metrics, only to recover a few minutes later. As we didn’t know how to reproduce the problem, it was hard to make progress.
Shortly after, we noticed something interesting. During the course of an incident, network connectivity between some layer0 and layer1 nodes had been severely affected. Our system had been designed to handle this eventuality, but the resulting behaviour was not what we expected. If a single layer1 node experiences connectivity issues, we would only expect to see increased backlog to that particular destination, while others should continue processing traffic as normal. Instead, the whole layer0 would freeze completely. Once connectivity was restored, all layer0 nodes recovered and started working through their backlog as if nothing had ever happened.
Our expectation was that the layer0 service would notice that RPC calls to that particular layer1 destination were not succeeding within our deadline, and that our internal health checking mechanisms would disable that destination and route the flow of data around it.
The game’s afoot!
But at least now we had a clue. When network connectivity was broken between layer0 and layer1 nodes our gRPC deadlines did not seem to be enforced. That also gave us a way to reproduce the issue: we could just use iptables to break connectivity between two otherwise healthy nodes and the problem would appear.
We then suspected the issue to be within gRPC, but we didn’t have definitive proof, nor any useful information we could give the gRPC team. We needed to dig deeper. First, we started to trace our program with both strace and lptrace, in order to see if the blocking would happen within our own code or gRPC’s. The lptrace outputsuggested the blocking happened deep within gRPC. This revealed that the blocking call was whatever function call preceded the call to _check_call_error which, as the output suggests, is a function belonging to gRPC’s channel object.
It’s not you, it’s me
Once it emerged that the blocking itself wasn’t happening within our code, we needed a simple repro case before opening an issue with the gRPC project. After all, our own application is using gRPC in complex ways with many different destinations. It was therefore likely that the bug was being triggered by something we were doing on our end. Besides, we asked ourselves, if all it takes to trigger this issue is a break in network connectivity, why aren’t other people reporting it?
Turns out we just needed some minor modifications to the greeter client in their helloworld example to reproduce the issue…almost.
As it happened, our initial attempts to reproduce the issue with our modified greeter client were unsuccessful. The client would detect that the RPC calls were not succeeding within the stated deadline and fail accordingly, instead of blocking. As a result, we were concerned that the issue had a more subtle trigger that we were missing. We started making some modifications to the client code to make it more similar in behaviour to our own service (while still being a minimal repro case) and we found that the size of the message we were trying to send was relevant.
When we modified our client to send a larger message, we found that blocking happened almost immediately when connectivity was broken. It took much longer to reproduce the issue when sending smaller messages (which made the issue harder to detect during our initial tests).
At this point we already had enough information to make a reasonably detailed bug report with the gRPC project. However, we were still curious about this behaviour and wanted to understand the problem better, so we did some more tracing on our modified greeter client. The full results of our investigations are outlined in this issue comment–we found that the trigger for the blocking was the TCP send buffer filling up.
Every time the client tries to send a message it gets added to the TCP send buffer, so the sendmsg system call technically succeeds. This is normal behaviour: in a healthy system the kernel would make sure that the messages in the send buffer reach their destination. Seeing as in our case connectivity is broken, the send buffer will continue growing every time we attempt to send a new message (or retry sending a previous message), until the send queue is full. At that point, the sendmsg syscall will actually fail, and that’s when the blocking happens. It will stay that way until either connectivity is restored, or the kernel gives up on that particular connection, closing it.
This was an important find for us for two reasons:
- It provided more information for the gRPC team which would hopefully make it easier to identify and fix the underlying bug.
- It explained some of the behaviours we were experiencing, such as why the service will unblock itself after connectivity was restored, and why any blocking period wouldn’t last more than 20 minutes. This is because the number of retries the kernel is willing to do before giving up on a connection is governed by net.ipv4.tcp_retries2. The timeout between retries is calculated dynamically and is based on the RTT of the connection, but with our defaults results in around 15-20 minutes in total.
So you’re telling me there’s a chance…
The response from the gRPC team was great, and soon they had a patch due to be included in their 1.14 release. Unfortunately, the fix had to be reverted as it was causing some tests to fail and was therefore removed from the 1.14 release. It was then suggested that we could rely on keepalive to detect when a connection breaks. When keepalive detects that a channel’s connection is broken it will change the state of the channel, which we can use before sending data through it to decide to back off. This wouldn’t completely solve our problem, but would greatly reduce the likelihood of this issue affecting us.
In order to address this issue, the gRPC introduced the following proposal: TCP user timeout. This proposal would simply take advantage of the TCP_USER_TIMEOUT socket option when sending data. It controls the maximum amount of time that a socket can have unacknowledged data before the connection is forcibly closed by the kernel, allowing us to recover faster in this kind of scenario. We have been testing the new changes within our own services and, thanks to the combination of keepalive and TCP_USER_TIMEOUT, we could no longer reproduce the issue.
“Yes Mister Frodo, it’s over now”
Turns out, it wasn’t quite over.
All our testing around our original repro case indicated the original issue was gone so we rolled out this new version of gRPC along with our services, expecting the backlog buildups to be a thing of the past.
Not long after, we noticed that when connectivity would break, we were actually experiencing very similar behaviour to before, even with a fully upgraded gRPC. The only difference was that now our gRPC channels weren’t blocking forever (or up to twenty minutes), instead the channel would recover after (almost exactly) 127 seconds. Our first impulse was to verify that TCP_USER_TIMEOUT was correctly being set on our socket options when initiating a channel connection, which it was.
One thing that’s important to know about TCP_USER_TIMEOUT is that its value is only ever honoured for currently open connections, which means that we were probably timing out before the channel was able to establish a connection. An important kernel setting that can affect how long it takes to timeout when trying to establish a connection is tcp_syn_retries (and its sister socket option, TCP_SYNCNT). This value governs how many times we will try to retransmit a SYN packet before we give up on a connection attempt if our SYN attempts go unacknowledged.
When attempting to establish a connection, the kernel will retry sending SYN packets and exponentially backing off between retries. The first retry will happen 1 second after the first SYN attempt, with backoff time doubling for each subsequent retry (2 seconds for the second retry, 4 seconds for the third retry, etc). If we have sent tcp_syn_retries unacknowledged SYN retransmission attempts, then we give up on this connection attempt and close it.
The exponential backoff means that, with a higher number of retries, it will take much longer for a connection attempt to timeout. The current default on modern kernels is 6, which (as you can see on the table below) results in a connection timeout of 127 seconds.
tcp_syn_retriesMax connection timeout13s27s315s431s563s6127s
We experimented with different values for tcp_syn_retries and we could easily confirm that, setting a low enough number of retries for connection attempts greatly reduced the impact of a connection timeout on our own services. In our case a value of 2 or 3 are usually enough to avoid having noticeable impact, but the right value for you will depend on your own workload and services.
This was great news, and by changing this setting we’re now able to minimise impact of connectivity issues across our fleet. Unfortunately, this currently requires tweaking a kernel setting that affects your whole system and might not be an appropriate value for all services. The ideal solution would be to have enough control over the socket options gRPC sets so we can set the value of the TCP_SYNCNT socket option and leave the kernel defaults unchanged.
After some digging in the issues of the gRPC project, we found this bug report. The mention of gRPC “hanging” for 127 seconds when trying to send to unavailable hosts indicates that it’s a special case of our original bug report, that only affects channel connection attempts that are already unavailable. Hopefully this will result in the TCP_SYNCNT socket option being exposed through gRPC.
So what do I need to do?
If you already have client-side keepalive enabled, there’s no need to do anything special to avail of the new TCP timeout, gRPC will automatically enable and configure it for you. If you don’t have keepalive enabled in your gRPC applications, we highly recommend that you do it.
Here’s a modified version of the greeter client of gRPC’s own helloworld example, showing how you can enable keepalive and in turn set the TCP_USER_TIMEOUT socket option.
There are two channel options that you’ll need to set to enable keepalive on your gRPC client:
- grpc.keepalive_time_ms: How often keepalive pings should be sent by your client.
- grpc.keepalive_timeout_ms: How much we should wait for a keepalive ping to be acknowledge before considering this connection as unhealthy and closing it. TCP_USER_TIMEOUT will also be set to this value for the underlying socket.
The actual values for both settings will greatly depend on your own network and application. Shorter values will help you detect issues faster but you might end up flooding your own channel with ping messages, and get a bunch of false positives if latencies get a bit high. Find one that makes sense and works for your architecture.
If you’re worried about connection retries severely impacting your application, you might want to tweak the value of net.ipv4.tcp_syn_retries to something you’re comfortable with (see table above with what effective timeout values to expect for any value), but we recommend being very careful when modifying this value, and only doing so after testing extensively, as it can negatively affect your system if set too low.
TCP_USER_TIMEOUT has already been merged, and is available as part of the 1.16 gRPC release. Work on the second issue is still ongoing. We’d like to thank the gRPC team for their work on this project and their help with this particular issue.