On Monday evening, Facebook along with its subsidiaries Instagram and the world’s favourite messaging system, WhatsApp suddenly went down globally. Millions of users were affected and were unable to access, receive or send messages. The outage lasted several hours all across the world, including India.
In the meanwhile, Facebook tweeted, “We’re aware that some people are having trouble accessing our apps and products. We’re working to get things back to normal as quickly as possible, and we apologize for any inconvenience”.
WhatsApp also took to twitter and conveyed its apologies, “Apologies to everyone who hasn’t been able to use WhatsApp today. We’re starting to slowly and carefully get WhatsApp working again.
Thank you so much for your patience. We will continue to keep you updated when we have more information to share”.
Instagram followed suit and wrote, “Instagram and friends are having a little bit of a hard time right now, and you may be having issues using them. Bear with us, we’re on it!”.
What caused the outage?
Cybersecurity experts and researchers have tried explaining the main cause for the global outage. According to Bryan Kerbs, a cybercrime journalist, it was a Border Gateway Protocol (BGP) issue.
Understanding Border Gateway Protocol (BGP)
Border Gateway Protocol is the underlying protocol of the global routing system of the internet. It helps the internet to exchange routing and reachability information among autonomous systems (AS). The internet is made up of a massive network of AS and the list of possible routes needs to be continuously updated. In simple terms, BGP is the gateway protocol that runs the internet or makes it work.
If under any circumstances, BGP stops functioning, the routers will not know what do and the internet comes to a halt. Domain name system (DNS) is the address system of each website or the IP address. BGP is like the roadmap which helps figure out the most efficient way to get to that address.
Every AS in the network has an assigned Autonomous System Number (ASN). AS can originate prefixes, say as in control a cluster of IP addresses. It can also transit prefixes, say like know how to reach specific clusters of IP addresses. Every single ASN has to announce its prefix routes to internet and BGP makes it possible.
What happened with Facebook?
A BGP update shares information on any potential changes made to the prefix advertisement or whether it has been completely withdrawn. Facebook completely stopped announcing the routes to its DNS prefixes. This led to the routes being withdrawn and the DNS servers went offline. Hence due to these changes, Facebook and its subsidiaries Instagram and WhatsApp went down globally.
What did Facebook say?
After several hours, Facebook gave more details on the outage. They said that this was the worst outage in 4 years and offered their apologies. They shared the technical details with their users and the lessons learned. The unfortunate mishandling of an error code condition caused the outage. The automated system for configuration values’ verification failed and caused the damage.
This verification automation was supposed to check for invalid cache configuration values and replace the same with updated values from the persistent store. Everything’s fine in case of a transient issue with the cache. However, this automated system doesn’t work when the persistent store is invalid.
Facebook said that they altered the persistent copy of a configuration value which the system identified as invalid. Hence, every single client saw the invalid value and tried to fix it. The fix involves making a query to a cluster of databases. And therefore, that cluster became overburdened with hundreds of thousands of queries every single second leading to Facebook outage.
It said, “To make matters worse, every time a client got an error attempting to query one of the databases it interpreted it as an invalid value and deleted the corresponding cache key. This meant that even after the original problem had been fixed, the stream of queries continued. As long as the databases failed to service some of the requests, they were causing even more requests to themselves. We had entered a feedback loop that didn’t allow the databases to recover.
The way to stop the feedback cycle was quite painful – we had to stop all traffic to this database cluster, which meant turning off the site. Once the databases had recovered and the root cause had been fixed, we slowly allowed more people back onto the site.
This got the site back up and running today, and for now we’ve turned off the system that attempts to correct configuration values. We’re exploring new designs for this configuration system following design patterns of other systems at Facebook that deal more gracefully with feedback loops and transient spikes.”