At approximately 12:00:00 EST on February 21st, 2019, our monitoring systems detected widespread networking issues in our Newark, NJ data center. It has been determined that a power feed to one of the redundant data center routers (Router1) was interrupted. Router1 has full chassis power redundancy and runs in a redundant configuration with Router2. When Router1 came back online and reformed its adjacency with Router2, instability with some traffic flows were detected. Engineers immediately started to troubleshoot the impacted router and isolated the problem to a corrupted neighbor table on Router1. The table was flushed and service was restored.
12:00:00 EST - A-side power on Router1 interrupted
12:05:00 EST - Linode Network Operations alerted to widespread network-related data center outage
12:06:00 EST - Incident response plan activated
12:15:00 EST - Router1 back online, reachability issues in the DC still apparent
12:25:00 EST - VPC consistency verified on router pair
12:35:00 EST - FIB consistency verified on router pair
12:50:00 EST - Router1 isolated from WAN routing, no change to impacted connectivity
13:00:00 EST - Router2 isolated from WAN routing, no change to impacted connectivity
13:25:00 EST - Router2 adjacency table flushed, no change to impacted connectivity
13:30:00 EST - Router1 adjacency table flushed
13:35:00 EST - Service restored
It is still not clear why we experienced a prolonged outage when a router was removed from the redundant pair. These routers have sustained many reboots during upgrades and are designed to maintain functionality when one is dropped from the pair. It is also not clear why it was necessary to flush Router1's adjacency table to restore connectivity when it came back up. Linode Network Operations plans to replicate the Newark environment in our lab and work with Cisco to find the root cause of these multiple failures.