By Merlin Carter
In this discussion, we go through the key decisions you need to make and what to do after you’ve resolved the incident.
This week, Fastly had an outage that made headlines across the world. One can only imagine what their SRE team went through to resolve it. So I talked to our CTO Stephan Schulze about how he handles critical incidents.
If you prefer to read rather than listen, I’ve included a written summary of the interview further down:
Merlin: When do you give up and roll back to the last reliable state?
Stephan, let’s talk about when things go wrong…like when you have some kind of error or bug in production and you need to fix it. When it gets too difficult to fix….you’re faced with a dilemma.
- Do you keep trying just that little bit longer to crack the problem?
- Or do you throw your hands in the air and say, “OK, we need to roll back to the last reliable state”.
How do you approach this dilemma?
Stephan: First, you need to check if data was affected
It’s a good question but a tricky one. It depends on the time frame between when you deploy and when you discover the bug.
If it’s short, then you should probably roll back rather than investing too much time in fixing it — at least for smaller, less critical bugs. These are typically bugs where no data was affected. And that’s a key issue. Generally, you need to ask yourself the question “did this bug affect any of my data?”
For example, imagine you’re deploying a new feature that requires changes to the database schema so the deployment includes a database migration. You need to ask yourself: “If something goes wrong, how do I roll back this change?”
Let’s say your deployment included a user database migration and you spot a problem shortly after you go live. Depending on what you changed, rolling back is probably the better strategy because nothing has happened yet right? You deploy, you see something’s not working, you roll back straight away. That’s an easy decision
But what if you’re deploying a high-traffic application that has already processed some data after the database migration? Maybe you discovered the bug 10, 15, or even 20 minutes after you deployed. That’s a much harder decision because doing a rollback could result in data loss or even more issues with your application.
Doing nothing is not an option because the longer you leave it, the worse things will get for your application. So you need to do an impact assessment:
- If your whole system is crashing, then rolling back is still the most sensible decision.
- If it’s just a service degradation, there might be other options.
So let’s continue the example. Say you deploy your application at a time when there’s usually less traffic, less load. Perhaps in the early morning. At first, the new feature works fine. But later in the day, the traffic increases and you see that your database server is using too much CPU. And the number of read/write operations are much lower than normal. You see that the problem is obviously related to the feature you just deployed.
What do you do? In this case, a rollback isn’t the answer. The feature technically works and it has already processed some data, but it’s causing a serious service degradation.
So you need to solve the problem as fast as possible. There are different tactics you can use in this situation.
What I typically prefer to do is bring everyone into a call, or teleconference or whatever. The main thing is to have some kind of audio channel that is always open that everyone can hear. Then you nominate someone to take the lead and guide other developers through the incident resolution process. This person should be experienced enough to bring some structure to the process and to keep everyone calm. A hectic scramble isn’t going to help anyone.
And then you go step by step through the options, keeping in mind how much each option could cost you.
- Maybe you just need to increase the storage capacity…
- Maybe you allocate more memory…
- Maybe you increase the number of workers…
All of these are short-term solutions that won’t solve the root cause, but they can reduce the pain and buy you more time.
But it’s especially important that whoever’s in guiding the resolution process has the power to make real business decisions. If you decide to double the amount of CPUs, for example, you’ll need to get the budget approved to incur that extra cost. You need someone who has the power to approve that budget themselves, without having to go through any administrative hoops. You need to “staunch the bleeding” as soon as possible, and clean up later.
And as I said, the tactic of getting everyone in a Slack channel or whatever you prefer, helps to keep all of these decisions transparent.
You can also pull new people into the channel once you find out more about the problem. For example, you might have a hypothesis that the bug is caused by part of a library that Jane wrote. You can then pull Jane into the call or channel and see whether she can help to fix the issue.
Merlin: Do checklists help?
It’s funny, there’s a part in the book “The Unicorn Project” which covers the exact scenario you just described. They created a “war room” where everyone was physically in the same room with their computers solving the problem together. Obviously, this doesn’t work for distributed teams, but an open audio channel serves the same purpose.
Another metaphor that comes to mind is the cockpit of a plane. When a plane is malfunctioning, all the crew and technical staff come together in the cockpit and work through predefined checklists to solve the problem.
In the aviation industry, there are very clear “playbooks” and checklists for diagnosing or resolving specific types of mechanical issues.
Do you use anything like that for critical application issues? Is the process “systematized” or is it different each time?
Stephan: It depends on your organization
I know there are companies that have these checklists, but they have a bit of experience already. If you’ve handled enough incidents, you start to develop a routine which informs these processes and checklists. You’ll have some broad categories of symptoms and their root causes, such as CPU issues, or network failures. So much like a doctor, you can find the right solution based on your list of symptoms and appropriate treatments.
But if you discover a new type of problem, you need to work as a team to mitigate it. If we go back to your cockpit metaphor, you also need a pilot. Someone who takes ultimate responsibility and coordinates the resolution process.
So to be clear, you can work through these checklists but this won’t always work. In either case, you still need a leader and a coordinated team.
Merlin: Do you try to prepare our startups for critical incidents?
Larger organizations tend to have more “tried and trusted” processes. But what about our portfolio companies and other early-stage startups?
Do you sit them down and say “look, you’re going to have an incident one day, it’s just a matter of time, here’s how you deal with it”?. Do you proactively prepare CTOs and tech teams for when things go wrong?
Stephan: Sometimes, if there’s a considerable revenue risk
I don’t do it in any structured way. I have general discussions about technical risks and issues. Part of my job is to talk to people in my network and get a feel for where they fit in terms of their company’s maturity.
Given the size of the team, I need to assess the impact of any potential incident. An incident always costs money, but there are different ways it can cost you. It could have a direct impact on your revenue (due to customers not being able to use your application) or it could slow your growth because your team isn’t working on features.
So the larger the business, the more revenue you are generating, the more critical it is to avoid any kind of incident or outage. If your traffic and user numbers are still low, then an incident might only cost you €500 per hour. When you get larger, a serious incident might cost you €50,000 per hour.
That’s when I’m more likely to ask about incident management and impact assessments. But I’m not going to tell people exactly how they should handle their incidents. There are already a lot of books and resources that you can refer to. Some O’reilly titles come to mind…such as “Seeking SRE: Conversations about running production systems at scale” or “Site Reliability Engineering: How Google Runs Production Systems”. They both have chapters on incident resolution.
Also, senior staff have typically acquired this knowledge throughout their careers. They’re aware of the broader incident categories and resolutions.
But even if you have an incident that you’ve never encountered before, there are processes to learn from it. What you do after an incident is just as important as how you resolve it.
That’s why I advise people to always hold a post-mortem meeting. Even if you couldn’t find the root cause, you should still take the time to map the chronology of events. Create a timeline so that you understand precisely what happened, and when.
Then write down what you learned and identify other things that could go wrong in the future. That way, as an organization, you can prevent these issues or at least be prepared when they do occur.
That’s how I typically advise our portfolio companies.
Merlin: Great, thanks for the detailed and considered advice. You piqued my curiosity when you mentioned risk assessment. I’d like to know more about that, but that’s not for this session. We do have a follow-up session about running post-mortems though, so I’m looking forward to that.
Thanks again for your insights, Stephan.
Stephan: Thanks Merlin, I’m looking forward to the next session.