IR — Incident Response, Repair, Resolution or Remediation?
It seems the SRE world agrees that the I in IR stands for Incident, but what about the R?
What does IR stand for?
Many people consider it to be Incident Response. In this case an incident refers to something not operating correctly in a cloud environment. The issue could be small — say a performance impact that doesn’t really affect successful operations. Alternatively, the impact could be massive — an outage that has ceased all operations with data loss, revenue loss, and damaged reputation (consider the recent Facebook outage or Roblox outage).
When an incident occurs, organizations will typically have a monitoring tool that spots the issue. These tools have parameters and ranges of acceptable use (think too many 500s, or too long to respond). This tool will likely then trigger an alert, or the alert could come from another tool. The alerting logic could be simple (page everyone) or there could be sophisticated rules to page just the on-call team, or the subject matter experts for the type of incident. A factor to also consider is incident response communication. Modern tools will do things like spin up a zoom, slack channel and Jira ticket automatically so that the response team can simply jump in. Next comes the much harder part, fixing the problem.
Incident Repair, Incident Resolution, Incident Remediation
These three terms all refer to the same basic result: the incident is fixed and operations are back to normal. Depending on the incident, the difficulty to resolve could be very simple, or take hours or days. The amount of time to repair the incident is called MTTR. Obviously, you want your MTTR to be as low as possible, and you want to consider more advanced tools and methodologies to achieve that. Reducing MTTR is one of the key objectives of a site reliability engineer (SRE).
How are incidents resolved?
Savvy organizations start by creating an incident runbook. These incident runbooks are basically an instruction manual on what to do, in what order, to remediate the incident. Simple incidents could be handled by level 1 support personnel, while multi day outages will be all hands on deck. Then consider runbook vs playbook. Playbooks are higher level overarching responses that can contain multiple runbooks and personnel.
An incident runbook can have many steps in them, but a typical set of high level steps are as follows:
- Type of incident, what services are affected
- How to collect the data and logs to verify the incident
- What to do to correct the incident (this could be pages)
At Fylamynt, we call runbooks a workflow. Fylamynt has built the world’s first enterprise ready low code platform for building, running and analyzing SRE cloud workflows. With Fylamynt an SRE can automate the parts of the runbook that are the most time consuming, allowing them to make decisions where their expertise is needed, removing mistakes and simple errors. We also provide many runbook examples within the product so you don't have to start from scratch.