5 Incident Management Best Practices
Incident management is the process that goes from an incident happening, to a ticket being created, identifying the issue and cause, remediating the service, and lastly reporting for post-mortems.
What is Incident Management?
Incident management is the process from an incident happening, to a ticket being created, identifying the issue/ cause remediating the service to reporting for post-mortems.
The 5 best practices:
- Create Optimized Processes
- Multi-channel Communication
- Collaboration & Reporting
Create Optimized Processes
Establish processes that minimize MTTR. They should include identification of incidents, communicating to impacted stakeholders, assigning to the correct on-call team, and allow users to follow and document the lifecycle of the incident.
The platform used needs to be able to allow stakeholders to communicate in many different channels (email, phone, slack/teams/etc, Zoom/Teams/etc), and potentially even within the platform itself. In today’s world of most if not all employees working remotely, it’s imperative to not only provide many channels of communication, but to also spin them up automatically and include the correct people. These channels need to be available on both web and mobile. Communication should also immediately go out to the affected end users so they are aware that an incident exists and is being addressed.
Automation should be used throughout the entire lifecycle of incident management. It starts with automating ticket assignment, spinning up a zoom room and slack channel, consolidating multiple alerts into one incident and then beginning remediation. Once a critical decision point is made (such as destroying an instance or failing over traffic) the process (runbook) should pause, but present all relevant information to the SRE to make the correct decision.
No organization wants simple issues distracting top talent, nor do they want major issues bumbled by those ill equipped to handle them. Tickets should be assigned priority status (critical issues handled first), and automatically escalate beyond the level 1 support desk as required.
Collaboration & Reporting
The incident response team needs to be able to collaborate together within the incident management platform. All data and executions should be visible and able to be analyzed as needed in easily accessible dashboards. Post mortem reporting should be easy and data rich to allow teams to continuously improve on the processes and explain to leadership the cause and status.
Fylamynt has created the world’s first enterprise ready low code platform for building, running and analyzing SRE cloud workflows. With Fylamynt an SRE can achieve runbook automation for the parts that are the most time consuming, allowing them to make decisions where their expertise is needed. Fylamynt can show you a runbook example, and guide you through creating runbooks.