What is SRE "Site Reliability Engineering"?
Site Reliability Engineering- the comprehensive explanation of this term was coined by google's prestigious engineering team when they realized that the duties and responsibilities required had deviated significantly from traditional IT/DevOps. One of the key differences is the use of code to help solve problems within cloud-native systems and infrastructure.
Site reliability engineering (SRE) is a software engineering (developer) approach to IT operations (ops). SRE teams manage systems, handle scale, firefight incidents/problems and automate some operational tasks.
SRE was coined by the Google engineering team, when they realized that the duties and responsibilities required had deviated significantly from traditional IT/DevOps. One of the key differences is the use of code to help solve problems within cloud-native systems and infrastructure.
Any system that requires high availability and/or scalability needs SRE as a dedicated practice.
SRE can also stand for site reliability engineer, which are the individuals who handle site reliability engineering. SREs perform many tasks and are focused on the production cloud environment.
Common SRE tasks:
- Scaling the system
- Optimizing cloud spend
- Remediating incidents (when things break)
- Runbook Automation
- Patching and upgrades
SREs will often write custom code (software) to link systems together, and will create workflows (often called runbooks) to help automate parts or all of the cloud system needs.
What does an SRE do?
At a high level an SRE is responsible for ensuring the systems run 24/7 and can handle scale as needed. To achieve this requires a lot of tools and expertise, not to mention often times having to “carry the pager” and handle incidents any time of the day or night.
Historically SREs came from the software development or sysadmin worlds and became a bit of a hybrid of the two. There are several areas that SREs are responsible for.
How code is deployed into the production environment.
Using systems to monitor proper operations.
Using tools to alert the appropriate people when systems aren’t functioning properly (or are at risk of not functioning properly).
Configuring systems appropriately for optimal performance or cost reduction.
Keeping latency of systems within acceptable limits.
Keeping track of changes in systems both as a historical record but also in many cases to comply with industry standards and certifications
Quickly reacting to and mitigating cloud incidents as they happen
Optimizing systems, often with automation, to reduce MTTR (Mean Time To Recovery/Repair/Resolution) — when things break, fix them as quickly as possible.
One of the primary outputs from an SRE are called runbooks or workflows. There are many situations that happen repeatedly, so it of course makes sense to create a repeatable process to handle these situations. Tying steps together in an automated way is how SREs optimize their processes. Common workflows will deal with things like cost optimization or incident remediation.
For example, an SRE might create a workflow that runs on a daily basis for cost optimization (autoscaling). A simplified workflow for this could have the following steps:
- Check instance utilization
- If usage has remained under 50% for the last 24 hours reduce instance size
Conversely, an SRE might create a workflow for replacing a bad EC2 instance (incident runbook).
- Alert from AWS Health
- Spin up new instance
- Reroute traffic
- Kill old instance
These very simplified runbooks will have several steps in them, with conditional branches and could even have what’s being called a “human in the loop”, which is a defined pause point in the runbook to allow a human to verify the situation and authorize appropriate actions.
SREs look for repeatable processes and then try to automate as much of those as they can to both simplify their job, but also to maintain as high availability as possible. No SRE team expects systems to have 100% uptime, but they plan for incidents and create processes to address them quickly.
Runbook vs Playbook
Many in the cloud engineer space use the terms runbook and playbook interchangeably. However, they are actually quite different. A playbook is a larger over arching concept that can include several runbooks and personnel. It's more of a larger planning concept than a specific execution plan.
As you can see, following the best practices is often hard to do manually. It’s important to automate your runbooks so that they can be run consistently without errors, while keeping humans in the loop.Want to find out more about runbook automation, see a runbook example or how simple it is creating runbooks?
There are many categories of tools that SREs use to effectively maintain cloud operations. The tools range from monitoring, logging, alerting, incident management, orchestration, and workflow automation and execution.
Fylamynt has created the world’s first enterprise ready low code platform for building, running and analyzing SRE cloud workflows and we welcome you to try it out. We provide runbook examples, and guide you through creating runbooks.