“The storm saved us” .. Facebook reveals the details of the precise operation in “A .”


09:28 PM

Tuesday 05 October 2021

Report – Muhammad Safwat:

Yesterday, Monday, the world lived a true rehearsal for the interruption of Internet services for seven continuous hours, in a precedent that had not happened before, when a mistake from the engineers of the social networking giant “Facebook” caused the failure of its basic network systems, and they could not easily access the company’s servers to correct the error.

In this report, we explore together the details of what happened and the real reasons behind the great disruption that hit the world:

Hackers are innocent of disabling Facebook

As usual, rumors played a role in spreading terror and promoting conspiracy theories, and while some spoke of a Chinese child bringing down Facebook systems, an internal company statement revealed details of what happened: “We found that changes to the settings of the basic routers that coordinate network traffic between data centers caused In an explicit denial that the failure was due to malicious activity, the company had no evidence that user data had been compromised as a result of the failure of its platforms.

According to Facebook, the outage resulted from a change in the system settings that manage the capacity of its global backbone, the “backbone” that connects all of its computing services made up of tens of thousands of miles of fiber-optic cable that crosses the globe and connects all of its data centers.

What are Facebook data centers?

The social media giant revealed that data centers come in different forms, some of which are huge buildings that house millions of devices that store data and run the heavy computing loads that keep Facebook platforms running, and others are smaller facilities that connect its core network to the wider Internet and the people who use the platforms different company.

How Facebook data centers work

When you open a Facebook app and upload your feed or messages, the app’s request for data travels from your device to the nearest facility (data centers), which then communicates directly across the Facebook core network to a larger data center, which is where the information your app needs is retrieved and processed. And send it back over the network to your phone, so that the app works in its normal and usual way.

Maintenance process causes disconnection

According to Facebook, the routers that manage data traffic between various facilities and facilities and that determine where all incoming and outgoing data is sent need regular maintenance, explaining that engineers while carrying out regular maintenance of the company’s massive infrastructure inadvertently cut off all communications in the company’s core networks.

He pointed out that such errors are designed for audit systems, and that an error in one of the audit systems after the power outage caused the global outage that occurred yesterday: “The outage made things worse.”

The statement continued: “The failure to find the site on the Internet caused a problem in the “DNS” that determines the “IP” of the browser via the BGP protocol, which made an error that made the company’s “DSN” servers, which the statement described as the backbone, inaccessible despite It was still working.”

Facebook’s big crash fix hurdles

And Facebook revealed that there were two major obstacles that caused the delay in fixing the crash, the first that it was not possible to reach “the company’s data centers through our normal means because its networks are down, in addition to the disruption and loss of DNS that made it difficult to use many of the internal tools that use To investigate and resolve outages.

The solution.. Send engineers to solve the problem

Facebook sent its engineers to data centers to solve the problem, which took time due to its high-security design, as routers are difficult to modify even when you have physical access to them. So it took extra time to activate the secure access protocols needed to get the people on the site to be able to work on the servers. Only then can we confirm the problem and put the backbone back online.

Once the company’s primary network connectivity was restored across its data center regions, everything was back with it. He stated that the problem did not end here, so the company’s engineers had to restart the services once, which threatens with new problems and malfunctions.

Storm Test.. How to Save the Company

He pointed out that the company was prepared for such cases thanks to the “storm” exercises it has been conducting for a long time, which contributed to returning things to normal.

Storm drills simulate a major system failure by taking a service, data center, or entire region offline, stress testing all infrastructure and software involved. The company considered that each such problem is an opportunity to learn, improve its capabilities and conduct a comprehensive review to better understand its systems.

In addition to affecting people, companies and others who rely on the company’s tools, Group CEO Mark Zuckerberg received a financial hit, as the “Fortune” tracking site for billionaires reported Monday evening that Zuckerberg’s personal fortune fell by about six billion dollars from the previous day, falling to just under 117 billion. .

Source link


Please enter your comment!
Please enter your name here