EKS Cluster Management with Fylamynt

Manage your EKS cluster with various Fylamynt Actions

Jal Jalali Ekram
December 15, 2020

Investigating logs is one of the crucial parts of troubleshooting in Kubernetes clusters. Container logs in pods help SREs identify and resolve incidents. As a result, the availability and the analysis of these logs are crucial to resolving incidents in a timely manner. Fylamynt workflows can automate backup and enable analysis of these logs long after the pod is gone. Fylamynt enables automation on Kubernetes clusters in Amazon Elastic Kubernetes Services(EKS) and enhances troubleshooting experience.

To automate the process of capturing logs and storing it for later use, the SRE team would need to build and maintain a robust infrastructure. This infrastructure may include a list of servers equipped with sufficient credentials to call Kubernetes and S3 services (or other storage services) in AWS.  In these servers, a set of commands need to be executed to fetch and copy logs. First, for each Kubernetes cluster, the kubeconfig needs to be fetched in order to make additional requests.

aws eks --region <region> update-kubeconfig --name <cluster_name></cluster_name></region>

To get the list of dead pods:

kubectl get pods --namespace=<namespace> --field-selector=status.phase=Failed</namespace>
kubectl get pods --namespace=<namespace> --field-selector=status.phase=Succeeded</namespace>

Once there is a full list of dead pods, for each pod the following command needs to be executed for logs

kubectl logs <pod_name></pod_name>

As each pod log is fetched, they need to be (temporarily) written to a local file. Then, the file needs to be uploaded to preferred storage. For S3, the command is:

aws s3 cp <path_to_local_file> s3://<path_in_s3> </path_in_s3></path_to_local_file>

At the end, all resources and files must be cleaned up.

Due to the privilege these servers have, the access to them should be restricted and any action must be logged. Actions logs should also be analyzed for suspicious activity and appropriate personnel must be notified. 

Maintaining this infrastructure to copy logs may become a time overhead for the team. It also could be an expensive solution for the company.

Solution: Fylamynt’s EKS Actions

Fylamynt provides an integration with every AWS service including the EKS. Using a Fylamynt workflow, a series of actions can be performed directly on a Kubernetes cluster. For example, logs from a specific Kubernetes pod can be copied to an Amazon S3 bucket using action Copy Pod Logs (under EKS Actions). 

Other examples include listing all the dead pods, describing clusters and so on. These actions can be combined with integrations like CloudWatch, DataDog, Slack, JIRA and many others in order to make the perfect workflow that meets business needs.

Copy Logs Workflow

EKS-Copy-Dead-Pod-Logs is a Fylamynt workflow that makes sure that important pod logs are never lost. During execution, this workflow finds all the terminated pods and copies the logs to a given S3 bucket.

In the bucket, each pod logs are stored separately and all the keys have a prefix ‘fylamynt-pod-logs’. 

This workflow can also be executed periodically by setting a schedule in Fylamynt. EKS-Copy-Dead-Pod-Logs can be duplicated and then modified to include a suitable trigger (i.e. Datadog) and/or alerts (i.e PagerDuty).

See this workflow in action