Linode Status

Current Status
Connectivity Issue - London
Incident Report for Linode
Postmortem

On August 25, 2022 at 22:22 UTC, the London Data Center experienced a disruption in service, with most connectivity resolved by 23:35 UTC.

This disruption occurred after a deployment of a route policy change triggered a bug in the London Data Center. This change was previously planned for release, but its deployment was expedited in order to address reports of IP Sharing issues across all Linode locations.

These route policy changes were deployed successfully to both the Mumbai and Newark data centers. When this change was deployed to London, it triggered a bug that caused BGP sessions in London’s core network to go down and not automatically recover. Despite the phased deployment procedure we follow for updates of this nature, the bug was not surfaced by our monitoring until it had been deployed to all of the devices in London.

Following the deployment of this change to London, Network Engineering detected a full disruption of connectivity. Linode immediately invoked its Incident Response Process as the Linode Network Engineering team continued troubleshooting the root cause. Network Engineering resolved the issue by restarting the routing daemons across all top-of-rack switches within London.

The root cause of the bug itself is still under investigation, and we are currently working with the switch vendor for more information. However, we have neutralized the possibility of this particular bug causing further disruption to London by implementing additional configuration changes. A permanent adjustment to our deployment processes and monitoring will also ensure that future configuration changes of this nature are performed in a more fault-resilient manner.

The frequency of events within this location has initiated a larger strategic review of our change control policies and overall reliability. Your trust is our most important resource, and we must work even harder to earn and preserve it. We sincerely apologize for the disruption that this event has caused.

Timeline (UTC):

  • Aug 25 19:01 Fix for IP Sharing issue rolled out to Mumbai Data Center
  • Aug 25 21:50 Rollout of fix to Newark Data Center begins
  • Aug 25 22:03 Rollout of fix to Newark Data Center completed
  • Aug 25 22:04 Rollout of fix to London Data Center begins
  • Aug 25 22:22 Traffic to London Data Center falls
  • Aug 25 22:25 London compute hosts begin alerting for downtime
  • Aug 25 22:29 Network Engineering messaged about alerts
  • Aug 25 22:35 Status page for London data center goes live in Investigating state
  • Aug 25 23:06 Network Engineering restarts routing daemons on top-of-rack switches in London  one-by-one to restore service
  • Aug 25 23:11 Status page set to Identified
  • Aug 25 23:35 Network Engineering has at least one uplink on every switch back online
  • Aug 26 00:14 Status page set to Monitoring
  • Aug 26 02:30 Status page set to Resolved
Posted Aug 30, 2022 - 22:41 UTC

Resolved
This incident has been resolved.
Posted Aug 26, 2022 - 02:30 UTC
Monitoring
At this time we have been able to correct the issues affecting connectivity in our London data center. While we're still conducting a full root-cause analysis into this incident, our preliminary investigation suggests this was the unexpected result of a network configuration change in London. We've set this incident to monitoring while we ensure connectivity in London remains stable and we'll provide a post-mortem here as soon as possible. If you are still experiencing issues, please open a Support ticket for assistance.
Posted Aug 26, 2022 - 00:14 UTC
Update
Our team has identified the issue affecting connectivity in our London data center as an issue impacting the data center's fabric switching. We are working quickly to implement a fix, and we will provide an update as soon as the solution is in place.
Posted Aug 25, 2022 - 23:11 UTC
Investigating
Our team is investigating a connectivity issue in our London data center. During this time, users may experience connection timeouts and errors for all services deployed in this data center. We will share additional updates as we have more information.
Posted Aug 25, 2022 - 22:35 UTC
This incident affected: Regions (EU-West (London)).