On November 22, 2022 at approximately 4:45 UTC, monitoring systems alerted to reachability issues in the Dallas data center. The impact of this incident occurred on Dallas’s private network, but would have affected public services that depend upon private communications such as NodeBalancers. At 5:28 UTC an incident was raised, the incident response team was deployed and investigation for the root cause of the issue began. After initial work on the situation, a status page was created at 5:43 UTC. Upon troubleshooting by several members of the Network Operations team team, it was determined that a network loop was occurring in a portion of the Dallas network, resulting in increased resource usage in several of the network devices in that segment. During troubleshooting, the Network Operations team determined that the loop was a result of a software bug and took steps to isolate the faulty software. Due to the nature of the bug, several systems needed to be restarted consecutively in order to properly isolate the network. Our Network Operations team began this process at 11:00 UTC and continued to roll out the fix for the next 90 minutes. Once all of the systems were restarted, connectivity was restored at approximately 12:30 UTC.
Investigation is ongoing regarding the recent networking issues that have been occurring within our Dallas data center. As updates to our global infrastructure are being released, we are continuously implementing processes to strengthen the reliability and stability of our network in the future. These upgrades are being placed on an expedited timeline, which will help to prevent incidents such as these from occurring as we move forward.