On December 12, 2022, at approximately 0:05 UTC, monitoring systems alerted to reachability issues in the Dallas Data Center. At 0:10 an incident was raised, the incident response team was deployed and investigation began. A status page was published at 0:27 UTC once the scope of the issue was fully understood. By 0:55 it was identified that a large amount of broadcast traffic was impacting hypervisors and causing intermittent latency and reachability for a subset of Linodes- approximately 80-90% of the Dallas data center. Continued investigation revealed that most of the broadcast traffic was from unanswered ARP requests, resulting in a cascade and amplification of said broadcasts. Having identified the broadcast storms causing the network instability, our teams still needed to identify the underlying cause and a way to mitigate it.
After extensive investigation, at approximately 04:20 UTC, a filter was deployed to the fleet to stem these broadcasts, allowing hosts to regain CPU cycles and answer ARP requests on a consistent basis. The network then recovered and stabilized, and alerts began to resolve en masse by 4:32 UTC. After additional monitoring of this fix, the incident was moved to a resolution at 6:45 UTC.
The immediate fix was a temporary measure. However, it was further tested and implemented permanently. This was finalized on December 13, 2022 at 18:35 UTC.
Further investigation is ongoing to determine the root cause of this issue with a full awareness of the other recent incidents occurring in our Dallas data center. As upgrades to our infrastructure are being applied to our active network, processes are being put into place in order to ensure the continuous stability of our infrastructure. We anticipate that incidents such as these will be prevented from occurring in the future as we move forward with these efforts.