Incident Remediation With Jenkins and Terraform
Experienced DevOps personnel are very familiar with tools like Jenkins to create workflows and Terraform to automate orchestration. But are these the best tools to use when firefighting production cloud incidents?
What is Jenkins?
Jenkins is an open source automation server for DevOps. Jenkins has ~1800 plugins that support many of the tools used in build and deployment scenarios. The plugins cover build management, source code management, administration, platforms and UI. Jenkins was designed specifically for CI/CD (continuous integration / continuous delivery) environments as well as automating other routine development tasks.
Jenkins still requires scripts to be written for the steps, but gives a framework for integrating the entire chain of build / test / deploy. These “pipeline scripts” are stored in a file called Jenkinsfile, which is stored in your repo.
What is Terraform?
Terraform is an open source infrastructure as code (IaC) software tool. Terraform allows you to write code in a higher level language to manage operations in the cloud. Terraform supports ~100 cloud providers, and gives you the ability to create new resources, manage existing ones and destroy those that are no longer needed.
Terraform has a concept called modules. Terraform modules are like functions in programming languages. They provide a standard interface (input/output) for creating resources. Essentially, modules allow for consistent (and debugged) common actions — again just like you’d create a function that encapsulates many actions to perform a higher level action.
Are Jenkins and Terraform suitable for incident remediation?
To answer this question, we can look at the tools used to respond to and resolve cloud incidents. First, a monitoring tool needs to detect the issue. Popular products in this space include Datadog and New Relic. When inspecting the Datadog plugin for Terraform, you quickly learn that Terraform is simply configuring and deploying Datadog resources. When you get the next step in resolution, you typically use an incident management tool like PagerDuty or Opsgenie. Inspecting the Terraform plugins for those tools reveals the same situation. Terraform is designed primarily for creation, configuration and destruction of cloud resources due to its declarative nature.
Could Terraform be used to automate portions of a cloud incident runbook/workflow? Absolutely, but since this wasn’t the intended use case a lot of custom code will need to be written to tie the tools together, requiring not only on-going maintenance but also opens the door to edge-cases and bugs. Facebook’s outage in late 2021 is a classic example of this problem. They stated they had written code to check for errors in deployment scripts but that code had a bug in it, and allowed the error to propagate across the entire Facebook/Instagram/WhatsApp footprint cutting it off from the Internet.
Now take a look at Jenkins. Again, incident response and remediation was never the intended use case for Jenkins. It excels at CI/CD automation, making the lives of developers and DevOps personnel much easier. However, This is even more of a square peg into a round hole type of approach. The pipelines do operate like workflows, but have none of the logic or connections built in for the remediation steps required. You would essentially be writing most of the code required to make this work, and at that point you might as well just ditch Jenkins and wire everything together by hand.
Fylamynt has created the world’s first enterprise ready low code platform for building, running and analyzing SRE cloud workflows. With Fylamynt an SRE can quickly and easily build runbook automation for the tasks that are the most time consuming, allowing them to make decisions where their expertise is needed.