Payment Processor Uptime- Dramatically Increasing
Have you seen payment processor uptime dramatically increasing? An up and coming online payment processor is taking aim at a market dominated by a few very large players.
An up and coming online payment processor is taking aim at a market dominated by a few very large players. They already have some interesting angles and more upcoming plans to differentiate them from the more established companies that are also mired in tech debt. In order to succeed with this large challenge, they need to not only execute their strategic plan and roadmap extremely well, but they can’t afford to have anything damage their reputation or strike doubt in the minds of their prospects as to their ability to perform.
If their platform fails to process payments, their growth plans and market success could fall apart very quickly. To this end, they have invested heavily in their cloud infrastructure team and spent considerable time building out runbooks and designing their systems to be as resilient as possible.
Running critical services in the cloud requires handling scale, as well as performing well at all times. However, the most important thing they can do is to ensure all customer payments are processed, and they experience no downtime.
When alarms go off the SRE team needs to jump on the situation as quickly as possible, and fix it as quickly as possible. As you can imagine this requires significant monitoring and redundancy, but most importantly as much “safe” automation as possible. Fully automating scenarios where traffic is rerouted, calls are retried or instances are destroyed often present too much risk for executive appetites.
Fylamynt allowed them to implement their runbooks into workflows that automate the time consuming parts of remediation, and allowing a human in the loop to pause execution before “scary” actions are taken. With Fylamynt, they were able to remove their hard coded connections, and ensure that the same, correct actions were taken in the same order every time the same type of incident occurred. This consistency and predictability helped them to reduce their MTTR by 42%.
They plan to continue to utilize Fylamynt to automate all the areas of their operations that make sense. This includes more incident response and remediation areas, but also setting up scheduled workflows for things like optimizing cost (orphaned disks, oversized instances, and others).
Fylamynt has created the world’s first low code incident response and remediation platform for building, running and analyzing SRE cloud workflows. With Fylamynt an SRE can automate the parts of the runbook that are the most time consuming, allowing them to make decisions where their expertise is needed.