Post-mortems don’t have to be painful

How to run an effective and peaceful post-mortem and document what you’ve learned so that it doesn’t happen again

By Merlin Carter

Recently, our CTO Stephan Schulze talked about how to handle a critical incident in production. But what do you do after the emergency is over? This week we talk about how to pick up the pieces and learn from the experience.

And it’s not just about process. We also cover the emotional and human aspects of post-mortems and how to foster psychological safety.

If you prefer to read, there’s a summary of the discussion below.

How do you run a post-mortem?

Merlin: After you’ve resolved a critical incident, there are a few things you can do, such as a Root Cause Analysis and answering the “Five Whys” and so on.

These processes are typically part of a post-mortem. The idea behind the post-mortem is that you document what you’ve learned so that it doesn’t happen again.

Stephan, you’ve been through a lot of these. What’s your general process for doing a post-mortem?

Create a timeline and share your findings with the whole company

Stephan: My approach is pretty simple. I try to create a detailed timeline of what happened together with the team.

We start by pinpointing exactly when the issue was first discovered and by whom. We then look at the steps taken to mitigate the problem. For example, maybe we deployed a short-term hotfix. Then we take some time to find the root cause of the issue and go through each event second by second.

To do this, you need to bring in everyone who was involved in the incident. The people who observed it and the people who fixed it.

For example, let’s say you had a 30min outage at 12 noon. What we’d do is create a very detailed log of what happened.

So you’d have something like this:

12h: 00m: 30s: An issue was discovered
12h: 00m: 45s: An automated alert was sent out to the team
12h: 15m: 00s: After investigating the issue, a problem with the load balancer was identified
12h: 20m: 00s: We tried to apply a fix, but it didn't work.

And so on…

So you have a detailed account of what you tried and why. And at the end of your account, you hopefully have an entry explaining how you finally solved the problem.

After you’ve done your timeline, the second task is to detail the root cause, what you learned from the incident, and the steps you can take to avoid it happening in the future.

But that shouldn’t be the end of the story. People sometimes forget to communicate these findings outside of the tech organization.

You should make those details transparent to everyone in your company because the incident could have affected other teams. Your company might have lost some money.

So you need to send a clear signal that you’ve completely understood the problem and you know what to do in the future.

That kind of openness builds trust and confidence in your team throughout the organization.

Publicly available templates for post-mortems

Merlin: Yes, speaking of openness, I remember seeing a GitHub repo with a record of the post-mortems carried out by various companies such as Facebook and DataDog.

Screenshot: A GitHub repo with a record of the post-mortems carried out by Facebook, DataDog, and other companies
The owner, danluu, also did a bit of analysis on the common causes of these incidents

They’re usually well structured, but they all follow different document templates. Do you have any specific template that you use?

Communication is more important than templates

Stephan: Not really, except that I always have a “timeline and root cause” section. But those templates are a great resource if you’re doing a post-mortem for the very first time.

Generally, I think the way you communicate is more important than the template you use. What you definitely should not do is point your finger at someone and say, “Hey Peter…or Paula, what the hell did you do? It was your fault!”

If that starts to happen, you need to immediately step in and remind everyone that this isn’t the culture that you want in your team.

You need to foster an environment where everyone can openly speak about their problems, mistakes, and lessons learned.

So my main rule for post-mortems is no finger-pointing, no blaming.

Preventing discussions from getting too heated

Merlin: I have to admit, it’s easier said than done. I’ve sat in on a few post-mortems where things got pretty heated.

Typically, this is when one person feels that another person “should have known better”. Maybe they deployed without running a certain test or whatever. How do you prevent this from happening in the first place?

Do you try to give a pep talk beforehand so that people don’t get too emotional?

Set the rules of the game before you start

Stephan: Well, you normally agree on the rules of the game before you start the game, right?

Naturally, there are always going to be people who’ll break the rules anyway. That’s why you need to moderate the discussion closely. You need to stop finger-pointing as soon as it starts and stick to the facts.

However, emotions won’t just go away on their own. You need to provide a space for people to express their frustrations, but that’s not what a post-mortem is for. If tempers are high, schedule smaller one-to-one sessions with the people involved. Try to understand what’s causing these emotions.

In the post-mortem, you need to focus on what you can do to prevent the incident from happening again. Of course, if people are refusing to calm down, then perhaps you have a bigger problem with your team culture.

When psychological safety doesn’t extend to other teams

Merlin: Yes, on the subject of teams — what about the situation where your post-mortem involves several different teams?

For example, you have a small, tight-knit development team, and everyone is comfortable with providing and receiving constructive criticism.

But then you have to bring in a data team or another development team, and you don’t know them as well. Perhaps they’re more sensitive to criticism. But you have to tactfully point out… that the action that triggered the incident was performed by someone in their team.

So how would you deal with a situation like this?

Always give people the benefit of the doubt

Stephan: Well, as I said, a post-mortem is not the right place to criticize other people or another team.

It’s for figuring out what caused the problem, not who

You can just say, “we discovered that a specific asset wasn’t deployed.” Everybody knows who’s responsible for that asset, right? But even then, it doesn’t really matter.

Maybe the deployment for that asset just failed unexpectedly. Maybe you’re blaming a team for something they couldn’t have foreseen….and you actually found a glitch in the process.

There are always good reasons why things didn’t happen. And your task is to find out what those reasons are. Not who is responsible.

Learning from experiences outside of work

Merlin: Yes, that sounds very sensible. I guess that’s the part where I’ve seen people struggling the most.

It would be interesting to see if the same approach would apply to something in your private life — in your household, for example. Say your wife comes home, and the house is a mess. The dishes were supposed to be done, the windows were left open, there were clothes on the floor, you name it.

It was your responsibility, and you didn’t do it. Wouldn’t it be understandable if she was angry at you?

Ask why before you assign the blame

Stephan: But you need to first ask, “why?”. Why didn’t that happen? As I said, there’s often a good explanation. Perhaps your kid had an accident, and you needed to take them to the doctor, and you only just got back.

It doesn’t make sense to start assigning blame before you find out why something happened.

Don’t assume the worst from people

Merlin: OK, I get you now. Your first assumption shouldn’t be that people are dropping the ball. You should keep an open mind until you hear their explanation of the incident. Even if something looks careless at first glance, there is usually a good reason behind it.

Most people want to do a good job

Stephan: Absolutely. You need to trust that your colleagues want to do a good job. And yes, people can be forgetful, but then you need to ask, “why do they tend to forget these steps?”

Maybe they’re overloaded with tasks already? Maybe a reminder system would help? You could look for ways that you could support them.

And if an incident was indeed caused by some kind of lapse in process, you can say, “Look, we fucked up, we’re taking responsibility for it, here’s what we’re going to do to make sure it doesn’t happen again.”

Merlin: Got it. People will always give you a chance if you can show that you’re willing to improve.

Don’t forget to get everything in writing

Stephan: Exactly. It all comes down to your attitude. And also being clear on the outcome and the results that you want to get out of the process.

You need to get this in writing. You need one person there who only has one job: take notes and record what everyone is saying. Like a stenographer. Or you can use some kind of automated transcription.

And I like that people are contributing these post-mortems to public GitHub repositories.

Those post-mortems can be a great resource for junior developers who need to find out how people typically solve these problems. It’s also the ultimate form of transparency.

Stephan: Thanks — by the way, we haven’t put any of our post-mortems into a GitHub repo, but it’s always an option. It depends on the nature of your company. It’s up to you.