Friday, April 26, 2013
2:20pm (EDT): We've been alerted to a network connectivity issue affecting the Fremont facility at this time. We are currently in the process of investigating and will provide more information as it becomes available.
3:20pm (EDT): We're still working on resolving the connectivity issues being experienced at our Fremont facility. At this time there is no ETA for full resolution. Once more information is available we'll be providing an update here.
3:45pm (EDT): The networking issue should be resolved at this time. If you continue to experience any problems please open a support ticket from within the Linode Manager.
6:05pm (EDT):
At approximately 2:00 PM EDT on April 26, 2013, the network in our Fremont datacenter began to experience major packet loss. We have fixed the source of the packet loss and are releasing this RFO to detail the events which led to this outage.
We have recently been working with our upstream provider to add additional, redundant links to a new segment of our network which tunnels through a part of our upstream provider's network. Shortly before the outage, these new links were being brought up by our provider. Under normal conditions, bringing up new links has no effect on our network. However, due to reasons we are still investigating, a network loop was created when these redundant links were brought online.
Network loops of this nature cause two main problems: 1) All broadcast traffic storms the network, and 2) MAC address tables can be fed with incorrect information about where MAC addresses are believed to be in the network. In this state, traffic can be mistakenly passed to a switch which does not actually have a particular MAC address. When this happens, this traffic is flooded to all ports on the switch which can cause significant packet loss in a large network.
Our network engineers swiftly identified the loop caused by these new links and shut the links down, cutting the loop and stopping the broadcast storm. However, the MAC address tables in the switches and routers still had some incorrect information. Our network engineers worked to remove the errorenous table data which restored the network to a stable state.
We are working with our upstream provider to determine the root cause of the network loop and to find a way to safely bring these redundant links online. These links will remain disabled until a plan of action is formulated, which will be performed during a future scheduled network maintenance window.
We sincerely apologize for the inconvenience this has caused. If you have any further questions regarding this incident, please do not hesitate to open a support ticket via the Linode Manager or give us a call at (855) 4-LINODE.