By now, almost everyone in the world knows about the total service and connectivity outage that afflicted Facebook, WhatsApp, Instagram and few other services owned by Facebook.
The outage started around 11:40 am ET; service started getting restored around 5:00 p.m. ET.
Rumors were flying left and right (isn’t that Facebook’s forte?) about the root cause of the outage, given the troubling revelations about Facebook’s policies that came out yesterday and in the days before.
Cloudflare’s CTO tweeted some info about how Facebook’s Internet Addresses (aka IP addresses) started getting deleted from Internet backbone routers this morning, indicating a possible misconfiguration, accidental or malicious, rather than equipment or fiber failure or a DDoS attack.
This is what the outage looked like to network engineers -
Marcus Hutchins, hacker extraordinaire, was quite confident that this was a (BGP) configuration error and not a cyber attack.
Around 5:00 pm ET, Facebook service started getting restored.
So, what happened?
According to blog.cloudflare.com/… and www.wired.com/…, it was a combination of DNS and BGP issues, possibly caused by a misconfiguration.
DNS, Packets and IP Addresses
The following is a very simplified description of DNS, BGP and Internet routing, that will help understand what went wrong. Those who are familiar with this technology stuff can skip this and the next section.
DNS is the Domain Name System, which translates domain names such as www.facebook.com to its IP address. It is a set of servers around the Internet, each with some subset of the millions of such mappings. It is like a global phone-book. Large companies like Facebook host their own DNS servers. An IP (version 4) address is a 32-bit number, often displayed in the form 31.13.70.36, as four numbers separate by a period, each number is in the range 0 to 255.
All data from your PC or laptop travels as “packets” of binary bits, hopping across routers to reach its intended destination. Such packets do not contain names like www.facebook.com; instead they contain two simple 32-bit numbers, aka IPv4 addresses, one for the destination and one for the source. Every router inspects these two numbers to steer the packet towards its final destination.
The DNS system’s job is to translate the name www.facbook.com to its 32-bit IP address. When we type www.facebook.com in the browser, the browser sends a query to a DNS server (whose IP addresss it knows), inside your organization or one belonging to the Internet Service Provider. The DNS server in turn may query other servers, who may request the original server to query other servers responsible for the name. Eventually, the correct “authoritative” server is queried which sends back the correct IP address to your client PC. The browser or app can then send packets to Facebook using the received IP address. In addition, your browser, your PC and DNS servers generally save (cache) these discovered mappings for a short time, so that any future queries can be answered faster without querying multiple servers.
BGP
So how does the Internet know which IP address or group of IP addresses are located where? That’s where BGP, the Border Gateway Protocol, comes in. It is a set of procedures and packets exchanged between various networks connected to the Internet and the routers in the Internet. Using BGP, networks such as those run by Facebook, inform the rest of the Internet about the IP addresses they “own”. Each router in the Internet exchanges this info with other routers and builds up a routing table which indicates which sets of IP addresses are reachable via which next router connected to it. If a group of IP addresses are missing from the routers, then packets destined to such IP addresses get dropped.
Note that the DNS queries we mentioned earlier also get routed via these routers to reach the various DNS servers.
What happened with Facebook?
Experts believe that for some reason, Facebook routers stopped advertising their network IP addresses using BGP to the Internet this morning. Hence, packets could not be routed to Facebook anymore, since Internet routers lost all record of Facebook’s IP addresses. And packets could not be routed to Facebook’s DNS servers either! Hence, every browser and app, that needed to find Facebook’s IP address using DNS started failing. With DNS failing, browsers and apps had nowhere to send their data packets. Ergo, total loss of connectivity and service. It was as if Facebook no longer existed on the Internet.
From this tweet, it appears that Facebook’s BGP only stopped advertising 24 of it’s 116 IP networks. But the DNS servers (just 4 of them) were in these 24 networks, hence no DNS service.
Just 4 DNS servers? Looks like they got hit with the perfect storm.
The problem affected other companies too, especially those who host DNS servers, like Cloudflare. Failed DNS queries for Facebook resulted in these other DNS servers getting overloaded with queries redirected towards them.
Most likely, the problem originated with some software or configuration update that went awry.
There is also the possibility that they were hacked.
But one of the consequences of the problem was that engineers could no longer remotely access Facebook networking equipment.
Facebook engineers probably diagnosed the problem early, but there was the issue of reaching the correct equipment and reconfiguring them remotely. So, they probably had to physically access the equipment and connect to it locally, say using a laptop, and then fix the problem. Whatever happened to dial-up connections as backup?
What goes down must come up. So, it did.
Facebook issued a simple apology, with no explanation.
Lessons learned?
Leaving aside, for a moment, all the social problems caused by Facebook in the charged political situation we have today, and the fact there is no love lost on Facebook around these parts, Facebook is used by billions of users and businesses on a daily basis. It is a part of our society and our infrastructure fabric. Facebook is not just a gossip and news/disinformation site; most organizations, small and big, have a presence on Facebook — for marketing, for disseminating information, for customer support, for meetings, etc. I don’t use Facebook, I am sure many of us here don’t either, but billions do. We have to think beyond ourselves if we want to solve society’s problems. And not every one of the 7.9 billion people in the world is like you or me.
The problem today points to the fact that our systems are so fragile that a single misconfig, accidental or malicious, can bring an entire system down. The problem may resurface with other Internet companies. What can be done to improve resilience? How easy is it for a malicious hacker or a state actor to do something like this or worse? What role should governments, outside experts and the public play in a technology that affects billions of people? I am sure the question keeps many experts awake, trying to find solutions which does not require ripping out everything we have and replacing it with something entirely new.
As to Facebook, maybe the long term solution is to break up its monopoly. Maybe it needs better rules and regulations. Perhaps it needs oversight by outside experts. Wishing for Facebook’s demise is a false solution; if Facebook fails, some other company will take its place. The genie is out of the bottle, we have to manage it.
What do you think? Give us your informed comments on how we can improve things, given that we live in a complex world, under the control of technology that is often cobbled together from bits and pieces in an ad hoc manner and technology (esp. software) that few people fully understand.
Further Reading
- Understanding How Facebook Disappeared from the Internet — blog.cloudflare.com/…
- Why Facebook, Instagram, and WhatsApp All Went Down Today — www.wired.com/…
- What Happened to Facebook, Instagram, & WhatsApp? — krebsonsecurity.com/…
- 101: Why BGP Hijacking Just Won't Die — www.darkreading.com/...
- The Confessions of Marcus Hutchins, the Hacker Who Saved the Internet — www.wired.com/… (unrelated to this issue but a terrific read)
Update
Post-mortem from Facebook confirming the above and providing some more details -