On Tuesday evening at 8:40pm EST (local time), a portion of Atlanta hosts experienced a total network outage for approximately two hours due to the failure of a redundant pair of distribution-layer switches.
Outage Start: 8:40pm EST
Outage End: 10:30pm EST
Total duration: 1h 50m
At 2pm on February 21st, 2017, our network monitoring alerted us to a partial control plane failure on a single distribution layer switch. Each of these switches runs in an active/active configuration with another peer switch. We determined at the time that this partial control plane failure was not causing any data plane forwarding issues, meaning that we had time to work with the switch vendor and schedule an appropriate maintenance window before taking action.
Later that day at around 7pm, our systems administrators were made aware of several incidents of unreachable Linodes within the Atlanta datacenter. These Linodes were all determined to be under the affected switch pair identified earlier, and the symptoms pointed toward intermittent, widespread switching failure.
Several members of the Network Operations team and Systems team conferred for approximately one hour. As we discovered the scope of the issue, a status page was posted at 8:15pm indicating that there was a major network hardware failure in progress.
After attempting several unsuccessful fixes, it was agreed that the most simple and straightforward solution to the failure would be to take the degraded switch offline. Under nominal operating conditions, the switch's peer was designed to seamlessly handle this kind of failure without traffic interruption. Another status update was posted at 8:30pm noting that this action was imminent, and power was cut to the degraded switch at 8:40pm.
Unfortunately, due to circumstances that we still do not not fully understand, the peer switch did not handle the failure gracefully, causing an extended network outage for hosts which were homed under this switch pair.
Some time was spent unsuccessfully attempting to coerce the remaining switch into a working state, but we were eventually forced to take both of the switches offline entirely, then bring them back online and rejoin them into an active/active pair. As of 10:30pm, the switches had both been brought online and were confirmed to be operating normally.
I would like to sincerely apologize to the customers who were affected by this extended network outage. We will be asking our switching vendor for a technical investigation into the cascading failures we saw during this incident, and we will apply any remedial fixes that are necessary in the near future.