Facebook Reveals the Timeline of Its Service Falls in Congregation

 




Facebook revealed the chronology behind the collapse of the WhatsApp, Instagram, and Messenger services on Monday (4/10). Apparently the disturbance stems from routine maintenance.


Facebook's Vice President of Engineering and Infrastructure Santosh Janardhan previously explained the cause of the collapse of Facebook, Instagram, and WhatsApp services. In his latest blog, he provides a more detailed explanation.



According to Janardhan, during routine maintenance, an order was sent to check the availability of the backbone network that connects all of Facebook's computing facilities. But this command actually broke the connection and a bug in Facebook's internal audit system could not prevent the execution of this command.



This problem was big enough, but then it got worse. When Facebook's DNS servers can't connect to its main data center, it stops providing BGP routing information that helps all computers on the internet connect to its servers.


"The end result is that our DNS servers cannot be contacted even though they are still operational. This makes it difficult for the entire internet to find our servers," Janardhan said as quoted from Engadget, Wednesday (6/10/2021).


This issue also made it difficult for Facebook's engineering team to fix the glitch. The absence of a network connection and the loss of DNS makes Facebook's internal systems that are usually used for repair and communication down.


Facebook ended up sending its engineering team directly to one of its data centers. This also proved not easy because of the security systems and protocols implemented in the server area.


Once the technicians entered the server area, they managed to bring the backbone back online and slowly restore service. This is what makes the Facebook service recovery process take a long time, because if it is completely restored immediately it will cause a more severe crash.


"Every failure like this is an opportunity to learn and get better, and there's a lot we can learn from this event," Janardhan said.


"After each issue, small and large, we undertook an extensive review process to understand how we could build a more robust system. The process has already begun," he concluded.

Previous Post Next Post

Contact Form