Automated Outage Alert Monitoring and Remediation

Did you just receive a critical outage alert? Can you respond to it in just a few minutes?

Prasen Shelar
September 7, 2021

Did you just receive a critical outage alert? Can you respond to it in just a few minutes?

How long does it take you to not just respond, but also troubleshoot, run diagnostics and remediate the problem? If your current process is slower than desired, you might want to consider automating the process.

Fylamynt’s integration with AWS Health helps you automate the monitoring and remediation of AWS Health outage alerts so you can respond rapidly to minimize service disruption. AWS Health provides ongoing visibility into your resource performance and the availability of your AWS services and accounts. AWS Health events are ingested into Fylamynt, triggering automated workflows specific to the event type.

Let’s focus on automating the response for AWS Health issues. These issues can be related to API, network connectivity, operational or run instance issues. In this scenario, an AWS Health AWS_EC2_API_ISSUE alert triggers a Fylamynt workflow as seen in the example below.

This particular EC2 instance is running the Jenkins application. Fylamynt obtains the IP of the instance from Route53 and then using Datadog_Get_API_Test_Results action gets the Datadog Synthetic test result configured for Jenkins service running on the AWS EC2 instance.

Datadog_Get_API_Test_Results are filtered for that EC2 instance and actions are taken based on the results. 

If the service is running fine, then a Slack message is sent to relevant team:

If the results indicate that there is a problem, another Datadog action (Datadog_Search_Monitors) is triggered to search for any issues within the EC2 instance.

If the EC2 instance is impacted, a Jira ticket is created and details of the EC2 instance are extracted.

Next, an AMI (Amazon Machine Image) of the EC2 instance is created.

Once EC2 AMI is created, we clone the new EC2 instance. When the new EC2 instance is up and running, we will stop the old impacted instance. We then fetch the new EC2 instance ID and the public IP of that instance.

We then replace the Route53 record with the new IP address.

The final conditional node checks to see if the Route53 update was successful. Slack channel is updated with note that failed EC2 instance has been replaced by a new instance:

Conversely, if the update action on Route53 failed, a Slack message is sent to notify the team.

As you can see, Fylamynt can help you automate response and remediation of operational issues identified by AWS Health. Although we focused on EC2 instances, you can also extend automation to S3 buckets and EKS clusters issues, as well as other operational issues related to API and network connectivity. So perhaps it’s time to “outsource” these mundane tasks to Fylamynt. With the Fylamynt platform, you can now automate the remediation process, be more efficient and maybe even shorten your work week!