Why Cloud-First Organizations Are Adopting a Modern Incident Response Approach
Gone are those chill days, when an on-call person would take hours to fix a problem and restore the service for their customers. According to Statista, more than 50% of the enterprises reported they lost $500K or more per hour due to server downtime.
I don’t get surprised these days when I hear about a software company publishing 15+ releases a day, instead of just a single waterfall release every 6 months. It’s typical when your customer base is growing and you need to effectively maintain the agile software development status.
This is certainly a painful scenario for people who are on-call and managing incidents that day. But wait, what on-call team are we talking about here? Is there really a team for it? So what do they do? Those days are gone when the operations or on-call team was the one who built the piece of code. Software engineers don’t need to worry about cloud infrastructure these days. If something’s broken and your customers start noticing an unplanned disruption in their business operations, such situations are noted as ‘incidents’ and the on-call victim would be responsible to spot, troubleshoot and restore the service.
The typical ‘Incident Response’ team would look something like this: (please take this with a desert of salt because the overall company size would determine the on-ground team structure :-p)
At times, specific service SREs/DevOps/Cloud Engineers are also added in the mix and that totally depends on the complexity of the incident and overall organization size (SRE — Site Reliability Engineer)
Distributed service development but integrated incident response: who do we blame?
I would rather blame it on the digital transformation than anything else. With cloud native development, microservices, kubernetes and agile practices it’s easy to build a distributed service system where the owners are responsible for their piece of code but when integrated it’s supposed to work for the customers. I don’t want to take a lot of your time explaining what this really is (I’m sure you like the modern software development world already :-)), just that I want to make sure we understand the necessary complications involved when an incident occurs.
Gone are those chill days, when an on-call person would take hours to fix a problem and restore the service for their customers. According to Statista, more than 50% of the enterprises reported they lost $500K or more per hour due to server downtime. In simple words, downtime is a big no no. It is just not acceptable to have even a slight hint of downtime, which directly means that your incident response plan needs to work no matter what. The on-call engineer or SRE has to build a system that is reliable, quick to respond and well integrated.
SREs are smart people: they care about MTTR
They (this cruel world) say you (good people: SREs) need to keep your MTTR as low as possible. What is MTTR though? And why can’t we just pull it down by responding to incidents as fast as possible?
Remediating an incident by mitigating the impact and restoring the service back to its original state on average is called Mean Time To Recover and that is MTTR for you. And yes you can respond to incidents as fast as possible but that just doesn’t pull down the MTTR metric.
MTTR is the recovery metric which is a combination of MTTD, MTTA and remediation time. Mean Time To Detect the incident, Mean Time To Acknowledge the incident and time to resolve the incident. SREs are responsible for implementing solid incident engineering practices so that the time spent in detecting and acknowledging the incident would be as low as possible. Strong alerting and monitoring capabilities are applied to detect the issue faster and inline incident management practices to triage/enrich/notify to acknowledge the issue in time.
So what’s the problem then? Isn’t this approach modern enough for you? Detect, Acknowledge, Troubleshoot and then Remediate. Use the best in class tools like PagerDuty, Datadog, Instana, Dynatrace, Slack, Jira, ServiceNow and remediate the incident as fast as possible to bring down the MTTR metric? You might think the process makes sense and the level of response also makes sense. Even for a minute let’s not involve the lack of skilled SREs topic in this picture. You would ‘still’ have a hard time justifying a decent enough MTTR for your organization.
The real reason behind this problem is not unifying the important pillars of incident response. You can’t just use the niche tools and create silos, you can’t just run ad-hoc scripts to connect some static API calls, you can’t just expect adding information to a ticket and hope it will cover all the necessary context and most importantly you can’t just keep switching screens from one tool to another. How long will you survive doing so?
When your company grows, the incidents will grow in number and also the ‘type’ of incidents. Your firefighting time will take a lot of your bandwidth. And to be honest, most of these alerts will be false positives. But you still have to deal with them and that doesn’t let you prioritize the best streamlining options to implement.
This indirectly makes the processes of your incident response or cloud infrastructure maintenance chaotic. At times the symptoms could be drastic. But most importantly, your team’s sanity will be impacted and you are no longer making the right decisions to keep a healthy cloud environment for your company. You are just dealing with the in-hand issues.
Eventually this will impact your MTTR — as a result you might notice unhappy customers and broken SLAs. Which is nothing but playing with the ultimate goals of your organization.
In this case, how do you slowly turn towards the modern incident response approach? Let’s take a look.
What is so modern about the modern incident response approach?
Some of you might have guessed automation or maybe runbook automation to be specific. But no, that’s not entirely true here. Creating runbooks and ad-hoc scripting is also a traditional and old way of responding to incidents. Headless automation also can only take you so far.
After talking to a bunch of cloud professionals I realized SREs in the modern incident response era are building a monolith that connects collaboration, orchestration, automation and remediation tools into a single platform that helps cloud engineers (On-call responders) to perform their tasks. These tasks are often tied to top-level business goals of maintaining high availability, performance and scale for the infrastructure and applications running on top of it.
And to be honest, this is the big difference between the old and the new way. An end to end platform to streamline your incident response, and that too with automation.
Imagine your entire product stack being integrated in a single platform and the necessary API actions surfaced to be used, with a way to ingest an incident automatically, track it, enrich it and classify it for response. A workflow that fires off on the specific incident category and collecting logs is automated, collecting monitoring information, synthetic tests are automated and all of it added to the incident. Now it’s easy to make a decision where the incident is a false positive or not. Also imagine if the workflow added relevant stakeholders to the ticket for manual review, Where even the collaboration aspect is simplified. And last but not least, imagine if the remediation for the specific incident category is also automated with room for manual approval.
This platform is not just a headless automation tool but it drives a modern incident response approach by taking you through the various aspects of incident response which were siloed earlier with the traditional way.
I know SRE life ain’t that easy, but slowly this transformation will yield a better response, reduce MTTR and not bother you for those pesky 2AM incidents!