Global outage at Facebook, WhatsApp and Instagram

Facebook going down isn’t surprising and we have seen mishaps in the past. Just last month we noticed a similar DNS issue with Slack. The power of BGP is well known and it’s even surprising that this outage lasted so long and how widespread it really was.

Prasen Shelar
October 4, 2021

Thirty minutes after the first reports of the outage Facebook tweeted - "We're working to get things back to normal as quickly as possible, and we apologize for any inconvenience."


“To make error is human. To propagate error to all server in automatic way is #devops.” - DevOps Borat


As we all know, it takes a village to maintain high availability (say, 99.99%) and hundreds of late-night hours to fix problems in cloud infrastructure and applications. Especially when we’re dealing with a beast like Facebook where the downtime costs $13.3 million per hour.


Facebook’s extended downtime along with Instagram, WhatsApp, and Messenger is essentially leading us all to the same question - “what’s really wrong with Facebook and when will it be back?” 


As a cloud workflow automation company (Fylamynt) built for DevOps and SREs, we started thinking about this incident and want to add one more relevant question to an already long list of headaches - “what could have we done to avoid this problem?”


The Real Problem


To a naked eye, you see the message - "This site can't be reached. Check if there is a typo in facebook.com. If spelling is correct, try running Windows Network Diagnostics,"


But as they all say, the real devil is in the details. Facebook’s DNS names stopped resolving, and their infrastructure IPs were unreachable. Without knowing more about Facebook’s internal infrastructure, we could still poke at it from outside and gauge the situation.


It seems that the root cause was in fact the DNS configuration change, the BGP peering between Facebook and service provider went down, and as a result, all routes advertised by Facebook were withdrawn, including the DNS server. So the outsiders now cannot access Facebook. 


"Between 15:50 UTC and 15:52 UTC Facebook and related properties disappeared from the Internet in a flurry of BGP updates," explained John Graham-Cumming, CTO of Cloudflare.


"The BGP routes pointing traffic to Facebook's IP address space have been withdrawn. The Internet no longer knows where to find Facebook's IPs. One symptom is that DNS requests are failing," added Johannes B. Ullrich, Ph.D., Dean of Research at the SANS Technology Institute.


It was as if someone had "pulled the cables" from their data centers all at once and disconnected them from the Internet.


An Analysis

Facebook going down isn’t surprising and we have seen mishaps in the past. Just last month we noticed a similar DNS issue with Slack. The power of BGP is well known and it’s even surprising that this outage lasted so long and how widespread it really was.


“What it boils down to: running a LARGE, even by Internet standards, distributed system is very hard, even for the very best,” Bellovin tweeted.


In general terms, what could we do to help resolve this issue? 


DIGing into the problem a bit, we can see that Facebook’s DNS entries are unus.


Of course, most sites aren't the size of Facebook, and don't need their own custom DNS infrastructure. Even if a site is using a managed DNS solution like Amazon Route 53, they should still be concerned about monitoring their DNS and being able to react if problems arise.


There could be any number of ways to approach remediation of a problem with DNS, depending on how things were created in the first place. If a tool like Terraform was used to generate the entries, it’s probably best to use that to restore them, assuming it’s not the source of the error. Another approach is to make periodic backups of the Route 53 hosted zone, and then use the JSON format backup to restore the zone to a known good state.


As an example, a customer using Route 53 to direct traffic to an AWS EC2 instance might have a configuration that has a SOA record for their base domain name, a few DNS server entries, and a CNAME record to a load balancer that serves their site.


In Fylamynt, they can see their hosted zone information with an AWS node that calls the Route53 ListResourceRecordSets API with a Hosted Zone ID.


That will output a hosted zone record like the following:


In the event of DNS emergency, this can be fairly quickly restored back to Route 53 (or quickly deployed to a new region, in the event of a Route 53 failure itself) using the Route 53 call ChangeResourceRecordSets, which could easily be automated in Fylamynt in the event that a problem with DNS is detected.


A continuing problem?


Outages will grow in number and the effect will be massive in today’s ever growing Internet. What matters the most in such cases is the initial diagnosis that brings in a lot of chaos and confusion.


If it can happen to Facebook, it can happen to you!


What is your disaster recovery and backup plan?


We are following this outage and post-mortem information closely and we will provide more detailed analysis later.