Our Top 3 Predictions for The 2022 SRE Ecosystem
As we head into 2022 and thankfully put a challenging 2021 behind us, we at Fylamynt are looking to the future and pondering what changes to the SRE (Site Reliability Engineering) ecosystem are likely to unfold.
As we head into 2022 and thankfully put a challenging 2021 behind us, we at Fylamynt are looking to the future and pondering what changes to the SRE (Site Reliability Engineering) ecosystem are likely to unfold. In 2021 we saw some headline grabbing outages from the historic Facebook DNS mistake to severe AWS outages 3 times in the same month. Cloud incidents are inevitable, but we believe a few shifts in thinking and focus will help. Our top 3 predictions for 2022 are:
- Modernization of Runbooks
- Incident Response Automation
- Shifting Firefighting to Prevention
Modernization of Runbooks
In 2021 most of the SRE and Devops teams we spoke with still keep their runbooks in wikis, google drive or even physical printouts. Some not only had no runbooks at all, but hadn’t even heard the term before.
We believe 2022 is the year for sophistication and modernization of runbooks across many organizations. With every company having fierce competition, they can ill afford to have cloud incidents, and definitely not complete outage. Runbooks will be moved into tools designed to store them and allow appropriate execution when needed. For those just now emerging into the world of modern incident response, here are the basic steps needed:
- Implement monitoring tools (e.g. Datadog)
- Implement alerting tools (e.g. Splunk On-Call)
- Use Incident Management and Response tools (e.g. Squadcast)
- Use tools to create, store and execute your runbooks (e.g. Fylaymynt)
Why create and store your runbooks in a tool? Consider the following possibilities. If your organization has a level 1 help desk, or junior SREs, they could make mistakes in identifying the true source of the cloud issue, choosing which runbook to use and also the steps within the runbook itself. Conversely, you may have a very seasoned SRE that has responded and remediated so many times they overlook steps. Consistency in response and remediation is key to ensuring that you minimize the impact of the incident.
Incident Response Automation
Once you’ve gotten your tooling in place, now comes the real savings: runbook automation. Today, even many of the most seasoned SRE teams are still hard coding together their runbook automation. For example, they write python scripts to run when an incident is triggered by their monitoring tool, which begins collecting data while the alerting tool does its thing. Maybe at this point they make some API calls to their incident management system so when the response team jumps in they can see what’s happening and start diagnosing the problem.
In 2022 this process will take a huge step forward. By utilizing low-code / no-code tools to quickly and easily assemble your runbook, your steps can not only execute automatically, but you can ensure they execute in the same way every time. Research has shown that 90% of the time spent in MTTR is identifying and verifying the source of the problem. By automating most or all of those tasks, you can reduce your MTTR by up to 90%.
From Firefighting to Preventing
Due to the changes already outlined, SRE teams will begin to see a shift away from constant reactive firefighting and will have the time and resources to dedicate to building out their systems with more resilience. Once firefighting is reduced, SREs can spend time on activities that add direct value to the business such as:
- Modifying systems for increased scalability
- Optimizing cloud spend
- Standardization across systems
- Hardening of availability
- Increased performance
We hope to see these shifts happen as the year progresses, allowing SREs to provide even more value than they already have, and best of all, letting them sleep through the night.
Sleep well SREs.