Linode Status

Emergency Network Maintenance - Atlanta

Incident Report for Linode

Postmortem

Atlanta partial network outage on February 21st, 2016

Summary

On Tuesday evening at 8:40pm EST (local time), a portion of Atlanta hosts experienced a total network outage for approximately two hours due to the failure of a redundant pair of distribution-layer switches.

Date: 2017-02-21
Outage Start: 8:40pm EST
Outage End: 10:30pm EST
Total duration: 1h 50m

Timeline

At 2pm on February 21st, 2017, our network monitoring alerted us to a partial control plane failure on a single distribution layer switch. Each of these switches runs in an active/active configuration with another peer switch. We determined at the time that this partial control plane failure was not causing any data plane forwarding issues, meaning that we had time to work with the switch vendor and schedule an appropriate maintenance window before taking action.

Later that day at around 7pm, our systems administrators were made aware of several incidents of unreachable Linodes within the Atlanta datacenter. These Linodes were all determined to be under the affected switch pair identified earlier, and the symptoms pointed toward intermittent, widespread switching failure.

Several members of the Network Operations team and Systems team conferred for approximately one hour. As we discovered the scope of the issue, a status page was posted at 8:15pm indicating that there was a major network hardware failure in progress.

After attempting several unsuccessful fixes, it was agreed that the most simple and straightforward solution to the failure would be to take the degraded switch offline. Under nominal operating conditions, the switch's peer was designed to seamlessly handle this kind of failure without traffic interruption. Another status update was posted at 8:30pm noting that this action was imminent, and power was cut to the degraded switch at 8:40pm.

Unfortunately, due to circumstances that we still do not not fully understand, the peer switch did not handle the failure gracefully, causing an extended network outage for hosts which were homed under this switch pair.

Some time was spent unsuccessfully attempting to coerce the remaining switch into a working state, but we were eventually forced to take both of the switches offline entirely, then bring them back online and rejoin them into an active/active pair. As of 10:30pm, the switches had both been brought online and were confirmed to be operating normally.

I would like to sincerely apologize to the customers who were affected by this extended network outage. We will be asking our switching vendor for a technical investigation into the cascading failures we saw during this incident, and we will apply any remedial fixes that are necessary in the near future.

Posted Feb 22, 2017 - 17:06 UTC

Resolved

Being that we have not experienced additional connectivity issues affecting our Atlanta data center, this matter is now resolved.
If you are still experiencing connectivity issues, please reach out to our Customer Support Team for assistance.

Posted Feb 22, 2017 - 04:52 UTC

Monitoring

Normal connectivity in Atlanta has been restored at this time, however we will continue to monitor things should any additional issues arise.

Posted Feb 22, 2017 - 03:25 UTC

Investigating

We are aware of connectivity issues affecting Linodes in our Atlanta data center and are currently investigating. We will update this post with any additional information as it becomes available.

Posted Feb 22, 2017 - 02:02 UTC

Update

We will be taking the affected piece of network equipment offline shortly. Because of our redundant switching infrastructure, we do not expect there to be any significant impact to customer traffic. However, customers may observe brief periods of increased latency or packet loss.

Posted Feb 22, 2017 - 01:31 UTC

Identified

We have identified a traffic routing issue with a distribution layer switch that serves a subset of physical hosts in Atlanta.

Posted Feb 22, 2017 - 01:16 UTC

This incident affected: Regions (US-Southeast (Atlanta)).

Compute

Storage

Networking

Databases

Services

Solutions

Pricing

Library

Technical Resources

Community

Marketplace

What's New