Incident Management At 2am

The clock strikes 2am and the dreaded pager rings for a DevOps engineer with a message titled "Prod API - too many 500 errors".

Shobhit Gupta
February 21, 2020

In the modern age of DevOps application, an on-call SRE engineer has multiple tools to increase their preparedness in response to an incident. For example, as soon as an incident occurs in the cloud environment applications such as DataDog or Prometheus sends a notification citing the possible reasons and analytics reports. The message is received by a downstream application such as Pager Duty which sends a message to the on-call SRE engineer with details about the incident.

At 2 am, the on-call SRE engineer quickly wakes up to the pager received on his phone. He frantically searches for his laptop, boots it up, and with sleepy eyes start to go over the JIRA ticket filed under his name to get details about the incident. He quickly moves to the logs as received from upstream applications such as AWS Cloud Watch, Terraform to understand the root cause of the incident from a plethora of incident types such as “Out of Memory VM”, “Failure of Deployment”, “Security Threat”, and so on. Each incident type has subject matter experts who will be pulled into a meeting (escalation) in case the on-call SRE engineer is not able to resolve the underlying issue.

It is at this moment of panic, stress, and loneliness that an SRE engineer in charge of the mission-critical job of running the infrastructure hunts for a “Runbook” - a step by step guide to detect, analyze, and remediate any DevOps management issue. Once they find a “Runbook” associated with the incident type, they start to make sense of the signals being received from monitoring tools such as Data Dog. For example, an “Out of Memory” incident where a Virtual Machine instance runs out of pre-assigned limits and effectively stops working. In following the runbook protocols for this incident, the SRE engineer takes approval from one of his team members to run scripts in the production environment through slack or personal messaging escalating the issue.

After approval is given, the resolution phase kicks in. SRE Engineer starts with running script 1 to take a backup of the VM, followed by script 2 to detect the faulty VM from a series of 100s of VMs, followed by script 3 to delete the faulty VM which was found. Finally, a new VM is instantiated with higher memory specifications thereby remediating the incident after a few hours of frantic messages, a barrage of slack notifications, and intense collaboration between remote teams.

At 5 am, the SRE engineer goes back to the environment and tests the system to flush out any further vulnerabilities. Finally, once the issue is completed, he logs back into the JIRA ticket, marks the ticket as resolved, and uploads the reports and details of the incident for auditing purposes. He sends a slack message to the manager and the rest of the team updating them about the resolution. All the stakeholders go back to sleep feeling safe about their infrastructure. In the age of low code development, DevOps needs a simple way to build runbooks for all possible use cases of incidents as well as workflow automation software to take necessary action to respond to incidents 24 X 7 without waking up teams in a state of utter distress late at nights.

Fun Fact - Thirty percent of companies cite a lack of automation for configurations and integrations across the complete delivery cycle as a top technical challenge. — Forrester Consulting, “Continuous Delivery Mandates Automation, Testing, And Full Pipeline Visibility”. October 2017

In the next blog, the clock strikes 10 am and the DevOps teams carry out the blameless post-mortem at their standup meetings.

Blog Image by Matthew Henry on Unsplash)