"It's dead, Jim": How we write an incident postmortem
No matter how hard we try to offer an uninterrupted service, outages are inevitable. Fixing the underlying issue and notifying customers is crucial, but it shouldn’t end there. There must be a process in place to learn from what happened and make sure it doesn’t happen again.
In the penultimate part of our SRE process series, we look at how to write an incident postmortem–what it is, why it’s important, who should write it, and considerations to keep in mind before putting pen to paper.
What is a postmortem?
A postmortem is the written record of an incident, including its impact, actions taken to mitigate it, and lessons learned from it (including followup tasks). While the focus of our incident management is usually in mitigating a currently ongoing incident, the goal of the postmortem is to look forward, and try to make sure we have learned as much as we can from a given incident, so we can be in a better position than we were before to avoid similar issues from reoccurring in the future. If we don't do this we'll be fighting fires every day, and that's no fun.
In other words, a postmortem is the process by which we learn from failure, and a way to document and communicate those lessons. The more we fail, the more learning opportunities we have.
Why are postmortems important?
There are several reasons postmortems are an incredibly important tool:
- It allows us to document the incident, ensuring that it won't be forgotten. A well-documented incident is invaluable because it includes not only a description of what happened but of what actions we took and the things we believed to be true at the time, which can help inform our actions during future incidents.
- They are the most effective mechanism we can use to drive improvement in our infrastructure. Nothing like seeing our services and processes fail in new and interesting ways to realise what areas need improvement.
- It helps shift the focus from the immediate now ("we need to mitigate the impact from this incident now") to the future ("what can we do to improve our systems so this incident doesn't reoccur?").
- When they're posted publicly, it lets our users know that we take every outage seriously, and that we're doing all we can to learn from them and prevent any future disruptions to the service we provide.
What's the goal of a postmortem?
The number one goal of a postmortem is to learn things from it. It's not a great sign when you sit down to write a postmortem and you already know everything you're going to say. If we're not learning anything, we aren't digging deep enough.
The final postmortem document is just a (small) part of our postmortem process, and its value lies in sharing (both with the rest of the team and the outside world) the important lessons we have learned, so the goal of our postmortem process is not just to produce a document. This document is merely the conduit by which we share what we learned on our journey of discovery. In a way, you could say that the real postmortem was the friends we made along the way.
Why do we share our postmortems?
We believe in being open with our customers, and we take this very seriously with our customer communication during incidents, so publishing our lessons learned after an incident is just an extension of this. Our customers deserve to know why their service wasn't working the way they expect it to work and that when we tell them we'll do better in the future we're not just saying it, and we have actual steps we'll take to ensure that's the case.
So we just had an incident. It probably required posting something on our status page, and now status page is giving you the option to write and publish a postmortem for this incident. This is where the fun begins.
Remember, a postmortem is not just a document
We already said that the goal of a postmortem process is not just to produce a document, but to learn from failure as much as we can. This means that part of this process is going to involve asking some hard questions to try to extract as much learning juice as we can from failure. This means that if we feel we don't have much to say on a given postmortem it could very well be because we haven't dug too deeply into this particular incident and everything surrounding it.
For example, we shouldn't be satisfied with identifying what triggered an incident (after all, there is no root cause), but should use the opportunity to investigate all the contributing factors that made it possible, and/or how our automation might have been able to prevent this from ever happening. The lessons we learn from an incident only stop coming when we stop digging, so an incident with no lessons learned only means we didn't look hard enough.
When should I write a postmortem?
Postmortems are such a good learning opportunity that we should take every chance we get to write one, but the decision on writing one or not usually falls on the incident commander (normally the on-call engineer at the time). If we aren't sure if a given incident "deserves" a postmortem, it's never a bad choice to write one anyway, it's best to err on the side of oversharing than to give the impression that we don't care enough about communicating about our incidents, and we should always be happy to have another learning opportunity.
If you think an incident is "too common" to get its own postmortem that's a good indicator that there's a deeper issue that we need to address, and an excellent opportunity to apply our postmortem process to it. Sometimes a single instance of an incident can't give you enough information to get any meaningful lessons out of it, but when looking at a group of seemingly related incidents as an aggregate they might start to paint a clearer picture.
If we know that we'll want to write a postmortem before officially resolving the incident on our status page, it's always a good idea to tell our customers to expect a postmortem. A postmortem doesn't need to go out on the same day the incident happened, and there's certainly no expectation of staying until late or over the weekend writing one. Having a postmortem ready on the next business day after an incident is a good goal, but in some cases (such as particularly complex incidents, or times where we're still very busy dealing with the fallout) this could be delayed a bit more. Ideally, it should never take more than a week after the incident is resolved for the postmortem to be published.
It's also worth noting that not all postmortems need to be published on our status page or be tied to an actual incident. Sometimes we'll want to write a postmortem around near-incidents or incidents that didn't have enough of a visible impact to warrant updating our customers. A postmortem doesn't need to be published externally to be useful.
Who should be writing this postmortem?
It's usually up to the on-call engineer to write a postmortem for any incidents that happened during their watch, but as with many other things regarding on-call, this too can be delegated. The on-call engineer is still responsible for ensuring that we produced a postmortem and that it's shared both internally and publicly, but they don't necessarily need to write it themselves.
That said, just because one person is leading this process doesn't mean a postmortem is a one-person job. A postmortem is a team effort and you'll want input from everybody that was involved in the incident (and others that weren't). We all have a different perspective and a different mental model of what our systems look like, so only by combining them all, you'll get closer to the full picture of what really happened during an incident.
Is this going to be a finger-pointing exercise?
No. We could fill pages talking about blameless postmortems and how important we are, but the main takeaway is that we all make mistakes, and we're not here to point at those and say "our problem is that someone made a mistake, we'll try making zero mistakes next time". What we want is to learn why our processes allowed for that mistake to happen, to understand if the person that made a mistake was operating under wrong assumptions (and how our people can have the necessary information to make better decisions) or even why they were doing what they were doing in the first place (instead of that process being fully automated).
Nobody gets blamed when something goes wrong, but the more we share about these experiences, the more we'll learn about them. Despite not assigning blame, we can (and should!) explicitly identify times where a mistake was made.
It's important to call out mistakes, but our focus should be on the mistake itself and what can be learned from it, as opposed to the person making the mistake. That person becomes our leading expert on that particular mistake, so we'll want to learn from everything they have to teach us.
So where do I start?
The first (and most important) steps of the postmortem process don't require us to write a single word. Before we can put everything we've learned in a document we have to truly understand the incident.
- Compile a timeline of the incident. It's really useful to see what actions we took and when we took them, and the things we thought to be true at any point during the incident.
- Ask yourself (and others) a lot of questions. We know there is no (single) root cause, and that the story of an incident is composed of infinite hows, which means that a postmortem will only be useful if we continue digging and challenging any assumptions we have about or systems. Some examples of useful questions would include (but are certainly not limited to):
- How did this failure go unnoticed for XX minutes? Do we not have alerts that cover this failure scenario? Did they work as expected?
- Even in cases when we still don't know why something happened and remains a mystery, what kind of instrumentation/diagnostics do we think we'd need to be able to identify it the next time it happens?
- Did we accurately assess the impact originally? If we didn't, how can we make sure we do it better the next time?
- Could the incident have been worse but maybe we got lucky somehow? What could happen if the next time we don't have that kind of luck?
- Did we get unlucky and an incident that shouldn't have been a major issue somehow became one? Then we need to dig into what were the contributing factors to that, since "have better luck next time" is not the best strategy.
- Was the incident caused or made worse by something we did? What led us to believe that was the right course of action? Could our systems/tooling have prevented us from taking that action or mitigate its impact? Remember, our postmortems are blameless so this is not a finger-pointing exercise, but we need to be able to identify these instances so we can look at all the contributing factors.
- Did we, at some point, make the wrong call? Did we have invalid/incomplete information at the time? Maybe our documentation was the issue?
- What kind of information would we need to do better next time?
- What was each of us thinking during the incident? How did we feel? Did we feel we had the right information/context at all times? The people involved in the incident are also a part of the system we're trying to learn about, and as such it's important not to overlook them.
While working through the timeline and asking questions you should be making a note of everything that makes you think "hmmm maybe this could have gone better if only we had X", as those will end up becoming our follow-up actions for this postmortem.
Next up, we’ll take a deeper look at the structure of a postmortem–section by section–with a helpful template, writing tips, and some pointers to keep in mind.
There are too many great resources out there to list, but the following should be considered required reading (or watching!) on the topic: