Linux Service restart

Performing repetitive tasks typically produces boredom, which can result in errors and mistakes. So today, let us try something new and challenge ourselves to get the job done more efficiently and be able to focus on the things that matter.

Johann
November 22, 2021

Description

Performing repetitive tasks typically produces boredom, which can result in errors and mistakes. So today, let us try something new and challenge ourselves to get the job done more efficiently and be able to focus on the things that matter.


As SREs and IT administrators can attest, software applications can sometimes have a mind of their own and behave in ways that are not easy to understand or comprehend. The faults and errors can be attributed to many factors like hardware, resource allocation, operating systems, network, DNS, cloud provider services going down, and the list goes on and on.


Let’s be honest, at one point in time and probably still today, your business is running and you have to support an old legacy application with critical data.  Unfortunately, it is not always easy to replace these systems due to internal knowledge gaps, application EOL, or migrations that can be costly and time-consuming.

Use case

The repetitive task we want to address is where a distributed application experiences memory leaks and causes the Linux operating system to run out of memory and creates performance issues. 

A simplified runbook to remediate such a scenario might look something like this:

1. Alert is received from APM tool for high server memory utilization

2. The server is identified by either the name or IP address

3. Connect and authenticate via SSH to the Linux server

          a. The server might be in an isolated or firewalled network, which requires a Bastion host or VPN server for connectivity

4. Restart the service of the application

5. Verify memory utilization

6. If memory utilization is still high, create a Jira ticket

7. If resolved, close the incident


This might sound simple enough to tackle manually, but having to repeatedly perform this task, and at the dreaded 2 am, just becomes a drag.


So let’s see how you can, using Fylamynt’s low-code workflow engine, automate the remediation of an alert received from a performance monitoring tool.


Integrations

Firstly, before building the workflow, you need to configure and authorize the required integrations.


1. Login to Fylamynt

2. Select Settings

3. Select and configure the following integrations to be used in this workflow:

          -Trigger workflow execution with a selected New Relic Policy.

          - Alternatively you can use Datadog, Sumo Logic, Humio, Instana, or Splunk On-Call

          -Securely authenticate and access your SSH servers to execute commands.

           -Send messages and notifications to your teams

           -Create or resolve incidents

           -Alternatively you can use Jira, Twilio, or ServiceNow


Creating a workflow in Fylamynt

Now that we have our integrations connected, let’s create the workflow.


Step 1: Create a new trigger based workflow

1. Login to Fylamynt

2. On the workflow page, click “New Workflow”

3. Provide a workflow name

4. Select New Relic as the trigger type

You are now presented with the Workflow Editor where you drag and drop Fylamynt’s action nodes as steps, it’s as simple as that.


Step 2: Add JSONPath node

The New Relic trigger is added by default and will provide the alert body in JSON output, which you can consume in any downstream node.

Since the data is in JSON format you need to extract the relevant information, in this case, the hostname that is experiencing high memory utilization, that you have to SSH into to restart the service. 

To add and configure the JSONPath node, here are the steps:

1. From the left menu bar, drag and drop the JSONPath action node onto the canvas and connect it to the New Relic node. 

2. Select the new action node

3. On the right menu, select the JSON input.

            a. For demonstration purposes I am going to pre-populate the JSON input with the New Relic alert first, just to show how the JSON path expression delivers the                 output, and then will change back to retrieve the output of the New Relic trigger node as input for the JSONPath.

4. Change the JSON Input to “Trigger 1”

5. For Previous Step Output select “output_json”

6. Enter the JSON Path expression to extract only the relevant name

          a. “$.targets[0].labels.fullHostname”



Step 3: Add Teleport SSH Execute node

For this example workflow, the Teleport integration is used to authenticate and access the Linux server that runs on an isolated network. Fylamynt does support adding SSH Targets to specific servers that are publicly accessible, for instance your Bastion hosts, and in conjunction with the SSH Execute action node can run commands and retrieve the results. 


To add and configure the Teleport SSH Execute node, here are the steps:

1. From the left menu bar, drag and drop the action node onto the canvas and connect it to the previous JSONPath node. 

2. Select the new action node

3. On the right menu, select the input tab.

4. Enter the SSH User

5. For the SSH Target Host, you will retrieve the host information from the JSONPath’s output.

6. Add the SSH Command you want to execute on the server

7. “systemctl restart newrelic-infra.service && journalctl --unit=newrelic-infra.service -n 100 --no-pager”

8. Optionally, you can also add an S3 bucket where the execution logs will be stored.


Step 4: Add String Transformation node

The next step is to transform the JSON output to string to be easily consumed in the Slack node.


To add and configure the String Transformation node, here are the steps:

1. From the left menu bar, drag and drop the String Transformation action node onto the canvas and connect it to the previous Teleport SSH Execute node.

2. Select the new action node

3. On the right menu, select the input tab.

4. For the JSON Input, you will retrieve the host information from the JSONPath’s output.

5. For the operation, select To Lowercase.




Step 5: Add Slack Send Message node

The Slack node is added to notify the users in the specified Slack channel that the service was restarted successfully.


To add and configure the Slack Send Message node, here are the steps:

1. From the left menu bar, drag and drop the Slack Send Message action node onto the canvas and connect it to the previous String Transformation node.

2. Select the new action node

3. On the right menu, select the input tab.

4. Select the Slack Channel you want to send the message to.

5. Click Add Slack Variables

6. Enter the variable name

7. For label select the String Transformation node, and as the previous step output select “string_output”

8. Click Save Variable

9. Now in the Message Text field you can consume the variable in the following way:

10. “Service restart on the host {{hostname}} has been successfully carried out.


Step 6: Save the workflow

1. Click the Save New Version button

Every change made to the workflow within the editor will be saved as a new version.

You can also very easily revert to previous versions.

2. Select the Workflow name in the top-level corner, or click on the manage versions button.




Optional action nodes

Fylamynt has over 100 actions across 38 services, with multiple integrations that you can use.

Here are some additional steps that you can add to enhance the workflow.

Approval 

1. The approval node will send a message to a Slack channel where a user can approve or deny the restart of the service

2. Slack Send Message

3. This action can also send a notification at the beginning of the workflow that the memory alert was received from New Relic and that the service of the application will be restarted on the affected server.

4. New Relic NRQL Query

5. This action node allows you to perform a query and retrieve data that can be used to verify the memory utilization metric on the host after the service was restarted.

Conditional

1. The conditional node can be used to review the new memory utilization metric after the service was restarted.

2. The rule would check whether the memory utilization is still above 80%, and if that is the case,  create a ticket or incident in one of the Fylamynt other integrations like Jira, ServiceNow, Pagerduty, etc.



Incident Management - Automate the workflow

Incident Management is the business layer on top of workflow executions and is used to automatically execute the workflow that is associated with a task type assignment.


Incident management contains 3 core pieces which include Incident Types, Incident Type assignment, and the Incident itself.


To automatically execute the example workflow with the New Relic trigger, the task type and assignment need to be configured. Here are the steps:


Create an Incident Type:

1. Fylamynt -> Settings -> Incident Types

2. Select New Type

3. Enter the Name of the Incident type

4. Provide a description.

5. Select the name of the example workflow created 

6. Click Next

7. Leave the default AlertBody Runtime Parameter.

8. Click Create Incident Type


Policy Type Assignment:

1. Fylamynt -> Settings -> Integration -> New Relic -> Incident Type assignments

2. Click New Assignment

3. Select the New Relic Policy from the drop-down

4. Select the Corresponding Incident type created in the previous step

5. Click Add Assignment

The New Relic Policy name and the associated Incident type are now visible under Incident Type Assignments. Multiple Incident type assignments can be created to associate specific integration incidents/alerts to incident types.


New Relic notification Channel:

After completing the configuration steps for the New Relic Integration, you need to create a webhook notification channel on the New Relic policy selected from the Policy Type assignment.

1. Step to create a New Notification channel:

2. Select Webhook as the channel type 

3. Enter a channel name

4. The Base URL is available from the New Relic integration page on the Fylamynt console.

5. Add custom headers

6. Enter name “x-api-key”

7. The value is available from the New Relic integration page on the Fylamynt, by firstly selecting a Webhook API Key Name.

8. Click Create Channel

9. Add the channel to the Policy




Automatically execute the workflow:

To trigger the workflow you need to create some artificial load on the host. This can be achieved with a tool like stress-ng by executing the following command:  “stress-ng --vm 2 --vm-bytes 1G --timeout 240s”


On the New Relic Policy, wait for a memory alert to trigger on the New Relic

In Fylamynt a new Incident is created, where you can monitor the execution of each step.

Try Fylamynt for free -->