At 16:00 UTC on 1/21/21, the Linode Network Operations team began a routine scheduled network maintenance to replace a failing core switch (“Switch-B”). This switch is part of an active/active redundant pair, and replacement should have been non-customer affecting. The shutdown, removal and installation of Switch-B proceeded as planned. During the operation of turning up and bringing Switch-B online and back into active service, something triggered the primary switch (“Switch-A”) to go into a failed state. This caused an outage, then latency and packet loss for customers downstream from this switch pair.
Working quickly, the NetOps team began to triage the problem but was unable to determine the immediate cause of the failure. Opting to rollback and shutdown Switch-B brought Switch-A back online, restoring service to affected customers. However, this left the network in a degraded state with only one switch in service. An emergency maintenance for 1/22/2021 at 3:00 UTC was scheduled to bring the faulty switch back online. During this maintenance we experienced the same type of outage.
The NetOps team continued working to bring Switch-B into full service, thus restoring connectivity and redundancy. A full investigation is underway; however, initially it seems that a mechanism designed to protect the redundancy was triggered and caused Switch-A to enter a failed state.