Wow, hard to believe we are at the end of 2021! It’s been an evolving year of changes in more ways than one, and for us at Fylamynt as well. So as we approach this new year with renewed optimism, here’s a quick look back at how we’ve evolved to a modern cloud incident response platform to better serve SRE and DevOps engineers.
Often errors are requested to be fixed, and incidents require troubleshooting and might consume a lot of time before restoring the workflow process. This has caused companies to strengthen and enhance their incident responses towards these kinds of events by employing advanced technology and innovations. One of these is the automation of incident response.
Here are the top 5 things you need to make a good runbook. Consistency: The same actions are taken every time. Accessibility: The runbook needs to be easily found and used by the response team. Automation: The runbook should automate the mundane and time consuming tasks. Observability: You need to record the inputs and outputs of every step for later analysis. Leveling Up: Based on the results (what you observed) the runbook should be dynamic and constantly improving.
Today’s outage at Amazon Web Services' us-east-1 cloud region is impacting customers globally which results in the loss of revenue. These are not isolated incidents and can happen at any time and on any public cloud service provider. How do you deal with these outages? Do you maintain application-level high availability, do you wait for services to return to normal, or do you take manual action by means of failing over or migrating your workloads to a healthy region in order to retain uptime?
When using cloud-native services, you will undoubtedly have cloud incidents that disrupt the normal operation of your systems. No SRE team believes they can achieve 100% uptime. Instead, they plan ahead, trying to anticipate what could go wrong (or has in the past) and create runbooks (sometimes called pipelines or workflows) to get things back to normal as quickly as possible.
Site Reliability Engineering- the comprehensive explanation of this term was coined by google's prestigious engineering team when they realized that the duties and responsibilities required had deviated significantly from traditional IT/DevOps. One of the key differences is the use of code to help solve problems within cloud-native systems and infrastructure.
I’ll start in 1811 England. There was a new invention called a loom, allowing lower skilled laborers to operate and produce lower quality products that ruined the artisans’ reputation for quality. The name Luddites was coined, and this group of people went on to physically smash looms eventually causing Parliament to make frame-breaking a hanging offense.
Gone are those chill days, when an on-call person would take hours to fix a problem and restore the service for their customers. According to Statista, more than 50% of the enterprises reported they lost $500K or more per hour due to server downtime.
At a high level an SRE is responsible for ensuring the systems run 24/7 and can handle scale as needed. To achieve this requires a lot of tools and expertise, not to mention often times having to “carry the pager” and handle incidents any time of the day or night.
Facebook going down isn’t surprising and we have seen mishaps in the past. Just last month we noticed a similar DNS issue with Slack. The power of BGP is well known and it’s even surprising that this outage lasted so long and how widespread it really was.
As cloud computing is increasingly getting adopted all over, automation is taking a prime stage these days in the cloud-native space to streamline and manage various IT-related tasks. In this article, we will discuss cloud automation and various aspects related in brief.
Today’s SaaS applications are constantly changing and require real-time observability. Instana is an automated Application Performance Management (APM) solution designed specifically for the challenges of managing microservices and cloud-native applications. Once you have observability, you want to act on the insights. You want to trigger a specific workflow based on what you find. In this article, you will learn how to do this by Fylamynt’s triggers built for Instana.