A pair of switches configured for redundancy connects two pods in the Newark Data Center, referred here as SwitchA and SwitchB.
This switch pair utilizes Multi chassis LAG, often referred to as CLAG or MLAG. At approximately 23:25 EST on 08/25/2021 links on a connected core switch(CoreSwitchA) flapped multiple times in succession.
This either caused or resulted from the MLAG process on SwitchA failing and restarting into a broken state. This failure caused brief loops to be formed in the Newark network. The NetOps team isolated SwitchA and worked to resolve the issues caused by these loops. On 8/27 the NetOps team performed an emergency maintenance to fully isolate and repair this switch.
It’s worth noting that because of the multiple failures of this switch pair, the NetOps team has moved off the primary traffic and functions of these switches, and they will be decommissioned in the near future. In addition the connectivity between these two PODs has been re-engineered to utilize a different network architecture, which should improve overall network stability.