The outage that took down Facebook, Instagram and WhatsApp for about six hours, affecting billions of people around the world, may have been due to a tiny human error.
Key points:
- The Facebook outage has been traced to the "navigation system" of the internet, known as BGP
- A minor update to BGP navigation directions appears to have been misconfigured
- As a result, the company's network fell off "the map of the internet"
The tech giant with billions of users and a mission statement of "bring the world closer together" suffered a mysterious technical breakdown about 2:45am AEDT.
The effect of that rippled around the globe, as one after the other of its network of services went dark.
Messages would not send on Messenger; money would not flow on WhatsApp money transfer; pages that used Facebook for logins locked users out.
At Facebook itself, employees reportedly could not use their keycards to enter buildings or access standard office software for work and collaboration.
Facebook bossMark Zuckerberg lost $9.6 billion as the company's share price plunged.
As chaos reigned, Facebook executives took to the rival platform Twitter to explain what was happening, and apologise to their own users.
Loading
By about 8:45am AEDT, Facebook, WhatsApp and Instagram users began to regain partial access.
So, what happened? And how could a company this big suffer such a glitch?
How Facebook fell off the 'map of the internet'
"We can see what happened, but we don't exactly what was the cause," said Markek Kowalkiewicz, director of the Centre for the Digital Economy at QUT.
Moments before the outage began, Facebook stopped announcing the routes to its DNS addresses.
Loading
DNS stands for Domain Name System, and it functions like an address book for the internet.
"It's the system that translates what we type into a browser into an IP address," Professor Kowalkiewicz said.
Say you type "http://facebook.com" into a browser; DNS connects that name with the numerical address of one of the Facebook servers.
But what good is an address without any idea of how to get there?
Border Gateway Protocol, or BGP, is that navigation system.
BGP uses tables that list the routes to particular network destinations; when you send a request to a local server to access Facebook, these tables show the server where to send your request so it eventually reaches Facebook.
Who publishes the tables that show where to find Facebook? Facebook does.
Shortly before the outage, Facebook updated its BGP routing tables, according to the internet infrastructure company Cloudflare.
Loading
The update withdrew the routes to Facebook's IP addresses, meaning servers had no idea where to send requests to access Facebook.
"Routing tables are bit like a map of the internet — they explain how to get from this intersection to that intersection," Professor Kowalkiewicz said.
Facebook and its sites had effectively disconnected themselves from the internet, said Daniel Angus, a professor of digital communications at QUT.
"The routing tables are like the front doors to Facebook's various services," he said.
"Somewhere along the way, someone made an update and when it was deployed, those doors have disappeared."
Right. So who stuffed up?
This is the part that's not clear yet, though rumours abound.
Professor Angus speculated that the ultimate cause may have been human error; a very senior engineer at Facebook may have made a basic transposition error, and this mistake was propagated through the internet by self-replicating protocols.
"A small error propagates out to the entire network," he said.
Professor Kowalkiewicz agreed.
"Right now we're in a world of speculation, but large organisations can make very basic mistakes.
"My suspicions were there was some change of configuration at Facebook, something went wrong and it propagated like a wave coming through the internet."
If it wasn't human error, it could have been a malicious act, perhaps by a disgruntled employee, Professor Kowalkiewicz said.
Facebook is under mounting public and political pressure to act on misinformation.
The outage came shortly after a US current affairs program aired an interview with a whistleblower who claimed the company was aware of how its platforms were used to spread hate, violence and misinformation, and that it had tried to hide that evidence.
Have outages like this happened before?
The outage is the worst since a bug knocked Facebook's services offline for about a day in 2008, affecting about 80 million users.
The company now boasts 3 billion users.
Other big tech companies have seen major outages recently.
In late 2020, both Amazon and Google had separate outages that were a major inconvenience for millions of users around the world.
They even created havoc in internet-connected homes, where turning on the lights or regulating the thermostat was controlled by a Google or Amazon app that no longer worked.
Loading
In June 2020, websites around the world went dark due to a software bug at a company that manages a crucial piece of internet infrastructure.
These incidents remind us that the internet is both "incredibly robust and incredibly fragile," Professor Angus said.
Its decentralised structure of distributed servers means it's hard to take down, but the centralisation of services, through tech giants such as Facebook, Amazon and Google, means that the impact of any outage is greater than ever before.
In its blogpost, Cloudflare wrote: "Today's events are a gentle reminder that the internet is a very complex and interdependent system of millions of systems and protocols working together."
A Facebook spokesperson said: "To everyone who was affected by the outages on our platforms today: we're sorry.
"We know billions of people and businesses around the world depend on our products and services to stay connected.
"We appreciate your patience as we come back online."