On August 25, 2022 at 22:22 UTC, the London Data Center experienced a disruption in service, with most connectivity resolved by 23:35 UTC.
This disruption occurred after a deployment of a route policy change triggered a bug in the London Data Center. This change was previously planned for release, but its deployment was expedited in order to address reports of IP Sharing issues across all Linode locations.
These route policy changes were deployed successfully to both the Mumbai and Newark data centers. When this change was deployed to London, it triggered a bug that caused BGP sessions in London’s core network to go down and not automatically recover. Despite the phased deployment procedure we follow for updates of this nature, the bug was not surfaced by our monitoring until it had been deployed to all of the devices in London.
Following the deployment of this change to London, Network Engineering detected a full disruption of connectivity. Linode immediately invoked its Incident Response Process as the Linode Network Engineering team continued troubleshooting the root cause. Network Engineering resolved the issue by restarting the routing daemons across all top-of-rack switches within London.
The root cause of the bug itself is still under investigation, and we are currently working with the switch vendor for more information. However, we have neutralized the possibility of this particular bug causing further disruption to London by implementing additional configuration changes. A permanent adjustment to our deployment processes and monitoring will also ensure that future configuration changes of this nature are performed in a more fault-resilient manner.
The frequency of events within this location has initiated a larger strategic review of our change control policies and overall reliability. Your trust is our most important resource, and we must work even harder to earn and preserve it. We sincerely apologize for the disruption that this event has caused.