Linode Status

Current Status
Connectivity Issues - London and Frankfurt
Incident Report for Linode
Postmortem

Postmortem Summary for August 1-2 Connectivity Issues - London and Frankfurt

On August 1, 2022 from 21:00 UTC to August 2, 2022 5:00 UTC, our colocation provider performed planned power maintenance in our EU-West (London) data center. This unexpectedly led to 100% loss of connectivity in EU-West and a partial reduction in network reachability in EU-Central (Frankfurt). Our engineers were alerted to EU-West’s degraded state at 21:40 UTC and began investigating immediately. It was found that the power connections that serve our edge routers were not configured to handle what should have been a non-impacting loss of power during this routine maintenance.

All of Linode’s edge routers are configured with multiple redundant power supplies connected to redundant power sources across independent power feeds. While these routers can tolerate the loss of 50% input power without interrupting service, their power feeds must be connected in specific configurations to allow such tolerances. During the investigation, it was found that our feeds to both routers were not configured in this manner. As a result, the routers lost power to such an extent that they could no longer forward traffic in or out of the data center.

Our engineers were able to use our out-of-band administrative access to the routers to verify that the power distribution to the edge routers was no longer redundant. At 22:10, our engineers engaged the local data center staff to physically move power cables to restore power. At approximately 23:10 UTC, Linode confirmed sufficient power was restored to each edge router.

Despite power being sufficiently restored to the edge routers, connection instability persisted. This required a full reboot of each router to restore full connectivity. By 23:35 UTC, both edge routers were once again engaging in full traffic exchange, fully restoring service to EU-West.

Although the immediate issue was resolved at this time, our engineers were aware that EU-West was not operating with full power redundancy. Our engineers drafted an action plan to implement a fix, but elected to wait until our colocation provider was no longer undergoing power maintenance before making any further changes. This was intended to prevent any further disruption later in the evening. An emergency maintenance was scheduled, with work to commence on August 2, 2022 at 23:00 UTC.

On August 2, 2022 at 21:30 UTC, before we could implement our planned changes, Linode was alerted to a near-total loss of connectivity in London and another partial reduction in network reachability in EU-Central (Frankfurt). Linode became aware of this outage at 21:30 UTC and investigation began immediately. Our engineers discovered that, once again, planned power maintenance in our data center, combined with our partially redundant power cabling that had carried over from the previous day, had led to both routers falling below the power thresholds necessary to forward traffic. At 21:35 UTC, our engineers engaged the data center to implement our planned changes immediately. By 22:35 UTC, changes had been implemented and power was restored to both edge routers. At this point, full connectivity was restored and the routers’ power feeds were cabled in a fully redundant state.

Following the completion of this maintenance, all edge routers in EU-West are using correctly configured, fully redundant power feeds. We do not anticipate any additional issues related to misconfigured power feeds to these systems again.

We have kicked off a fleet-wide audit of our power designs, configurations, and implementations, to ensure that all our sites are capable of weathering similar power events in the future without impact to our customers. Additionally, we will be adding periodic checks, either manually or through automation, to ensure that our critical power infrastructure remains fault tolerant.

What we learned and what we need to do better 

The outages that our customers experienced on August 1 and again on August 2 were the result of two crucial oversights on our part.

The main issue was caused by the arrangement of power redundancy connections on our edge routers. This episode showed that this arrangement was not sufficient to withstand the power outage from the maintenance event. As mentioned earlier, we will be conducting a full audit of the power redundancy arrangements in our data centers to ensure that they will be able to handle situations like this.

The second issue was that we missed important notification emails from our colocation provider regarding maintenance. Both of the power maintenance events were announced with sufficient notice from our colocation provider, but were not integrated with the automation systems we use to keep track of these notifications. Had we been aware of the August 2 power maintenance, it would have been possible to fast track our peer review process and make the necessary changes in time to avoid the outage on August 2. Our failure to process notices from our colocation partners was a blind spot that these disruptions exposed.

We have automation in place for other notifications of this kind from other providers. Based on the events that occurred over the past two days, we will be extending this automation to encompass maintenance notifications from our colocation provider as well. Future notices of this nature will initiate an additional audit of power redundancy before maintenance occurs.

Timeline of events, August 1, 2022

21:40 UTC: Monitoring systems alert to to connectivity problems in the London data center

22:03 UTC: Internally identified the issue as power failures to both edge routers

22:10 UTC: Colocation facility engaged to move power cables

23:11 UTC: Sufficient power restored to both edge routers

23:35 UTC: Full service restored to the London data center

23:40 UTC: Hands off 

Timeline of events, August 2, 2022

13:30 UTC: Power maintenance change submitted for review 

~13:50 UTC: Peer review completed and approved

14:17 UTC: Emergency maintenance to reconfigure power posted, scheduled for 23:00

21:30 UTC: Monitoring systems alert to to trouble in the London data center

21:35 UTC: Data center contacted to implement scheduled maintenance early

21:48 UTC: Partial outage detected for customers in Frankfurt

22:05 UTC: Impact to Frankfurt mitigated

22:26 UTC: Maintenance complete, first edge router in London rebooted

22:30 UTC: Second edge router in London rebooted

22:35 UTC: Full connectivity to the London data center was restored. 

22:40 UTC: Hands off

Posted Aug 03, 2022 - 15:21 UTC

Resolved
Connectivity for all services in London and Frankfurt have remained stable and will now consider this incident resolved. If you continue to experience problems, please open a Support ticket for assistance.
Posted Aug 03, 2022 - 00:36 UTC
Monitoring
We have corrected the issues affecting connectivity in our London and Frankfurt data centers. We will be monitoring this to ensure that services remain stable. If you are still experiencing issues, please open a Support ticket for assistance.
Posted Aug 02, 2022 - 22:54 UTC
Identified
We’ve identified the cause of this incident as a power issue and are working with our data center partner in London to restore normal connectivity as soon as possible.
Posted Aug 02, 2022 - 21:57 UTC
Update
We are continuing to investigate this issue.
Posted Aug 02, 2022 - 21:46 UTC
Investigating
Our team is investigating a connectivity issue in our London data center. During this time, users may experience connection timeouts and errors for all services deployed in this data center. Additionally, users may experience connection timeouts and errors for some services in our Frankfurt data center. We will share additional updates as we have more information.
Posted Aug 02, 2022 - 21:36 UTC
This incident affected: Regions (EU-West (London)).