The 5 Pillars of a Good Runbook
Here are the top 5 things you need to make a good runbook. Consistency: The same actions are taken every time. Accessibility: The runbook needs to be easily found and used by the response team. Automation: The runbook should automate the mundane and time consuming tasks. Observability: You need to record the inputs and outputs of every step for later analysis. Leveling Up: Based on the results (what you observed) the runbook should be dynamic and constantly improving.
If you’ve found your way here, you likely already know what a runbook is. As a quick summary, it’s a written down set of steps used to do a process/workflow with cloud services. There are several types of runbooks, but one of the most common deal with incident response and incident remediation. If you need a runbook example, you can look at what Gitlab open sourced. Typically runbooks are created and used by SREs or level 1 support staff.
What are the 5 pillars?
- Consistency: The same actions are taken every time.
- Accessibility: The runbook needs to be easily found and used by the response team.
- Automation: The runbook should automate the mundane and time consuming tasks.
- Observability: You need to record the inputs and outputs of every step for later analysis.
- Leveling Up: Based on the results (what you observed) the runbook should be dynamic and constantly improving.
Everyone has heard the adage “The definition of insanity is doing the same thing and expecting a different result”, or some differently worded version. Typically it’s used to describe a scenario where you actually want a different result, but in the case of runbooks, we want the opposite. Once you’ve gotten good at creating runbooks, you want the good results consistently.
It’s not always the level 1 support personnel that you need to worry about. Very experienced cloud engineers may not know that system. Also even the most experienced will be inconsistent in their approach because they know it so well (or it’s 3:00 am and they’re not at their best).
We’ve spoken with a lot of SRE/DevOps teams and many still keep their runbooks on a wiki, or a folder in a shared drive. Imagine getting paged, going to the wiki and trying to be sure you find the correct runbook to deal with the issue at hand. Some organizations also have different runbooks for different environments. Is this one for dev, test or production? Is it for the correct region? Do I have all the permissions needed?
ZK Research found that 90% of MTTR is spent finding and verifying the source of the problem. When an incident occurs, the responder often has to first acknowledge the alert (unless you don’t have monitoring and alerts and you learn of outages from your users), then gather the appropriate system information, then find the runbook that should be used to start the repair. Automating these steps and speeding up that 90% greatly will clearly help with resolution time. Tying together monitoring, alerting, and data collection the SRE can have everything they need at their fingertips to make the call on next remediation steps. They can get even more advanced with incident response communication by having a slack channel spun up or a zoom meeting created adding the right people for the severity and type of issue that occurs.
Doing post mortems after a cloud incident is resolved is critical to understanding what went wrong, why it went wrong and how to do better in the future. The only way to achieve this successfully is if you have access to all the data, logs and actions taken. Your runbook automation needs to be sure to log all the inputs to each step and the outputs from each step in an easy to consume way. You also need to be able to see what runbooks are executing and have a dashboard with insight into the current health of the system.
Once you’ve done your incident post mortem and have gained insights from all the data as to what happened and how your team responded, you can now look for ways to improve. Adapting the runbook over time, to make things more efficient and also modifying your system to be more stable are critical to achieving better uptime and service availability. Your runbooks need to be easy to adapt and change as needed, without obscure hard coded steps that may have been created by someone no longer working at your company. I’ve looked at a lot of SRE profiles on LinkedIn, and most tend to have tenures around a year at each company they’ve been at.
Fylamynt has created the world’s first enterprise ready low code platform for building, running and analyzing SRE cloud workflows. With Fylamynt an SRE can achieve runbook automation for the parts that are the most time consuming, allowing them to make decisions where their expertise is needed. Fylamynt can show you a runbook example, and guide you through creating runbooks.