Between 13:22 UTC and 21:10 UTC on May 31st, 2023 customers within the Frankfurt data center (EU-Central Region) experienced network instability due to a hardware failure on one of our redundant border routers. The nature of the hardware failure triggered a bug on the edge router that allowed traffic to continue to forward to an interface that was physically down, causing any flows transiting that interface to blackhole.
Initial alerting indicated a loss of edge capacity, the network team began troubleshooting, and identified a failed line card in a border router. Attempts to resolve this issue by restarting the line card, and then resetting the line were unsuccessful in bringing it back online. Determining that the initial network instability was the result of the failed hardware, standard procedures and steps to offline the line card, prepare for RMA, as well as preventive measures for interfaces affected, were implemented.
However, while most indicated recovery, some reports of packet loss remained and the team began troubleshooting once again, attempting to locate the source of the packet loss; a discrepancy was discovered between the FIB and RIB on the impacted border router, showing next hops for an interface which was physically down and belonged to the failed card. The individual interface is one of many links between the border routers and the core or spine routers. Due to intrinsic load balancing of traffic to the core, the overall impact was minimal and difficult to track down, causing delays in full mitigation.
At 21:09 UTC, the Network Team drained all traffic between the affected border router and the core spine router, this action succeeded in removing stale destinations, further testing confirmed the loss was no longer present, the team re-added traffic to the all links that were not impacted by the initial hardware failure, restoring all redundancy and connectivity.
The root cause of the bug itself is still under investigation, and we are currently working with the switch vendor for a permanent resolution. In the interim, we have adjusted our processes to ensure that future encounters of this nature are mitigated in a more fault-resilient manner.