On August 1, 2022 from 21:00 UTC to August 2, 2022 5:00 UTC, our colocation provider performed planned power maintenance in our EU-West (London) data center. This unexpectedly led to 100% loss of connectivity in EU-West and a partial reduction in network reachability in EU-Central (Frankfurt). Our engineers were alerted to EU-West’s degraded state at 21:40 UTC and began investigating immediately. It was found that the power connections that serve our edge routers were not configured to handle what should have been a non-impacting loss of power during this routine maintenance.
All of Linode’s edge routers are configured with multiple redundant power supplies connected to redundant power sources across independent power feeds. While these routers can tolerate the loss of 50% input power without interrupting service, their power feeds must be connected in specific configurations to allow such tolerances. During the investigation, it was found that our feeds to both routers were not configured in this manner. As a result, the routers lost power to such an extent that they could no longer forward traffic in or out of the data center.
Our engineers were able to use our out-of-band administrative access to the routers to verify that the power distribution to the edge routers was no longer redundant. At 22:10, our engineers engaged the local data center staff to physically move power cables to restore power. At approximately 23:10 UTC, Linode confirmed sufficient power was restored to each edge router.
Despite power being sufficiently restored to the edge routers, connection instability persisted. This required a full reboot of each router to restore full connectivity. By 23:35 UTC, both edge routers were once again engaging in full traffic exchange, fully restoring service to EU-West.
Although the immediate issue was resolved at this time, our engineers were aware that EU-West was not operating with full power redundancy. Our engineers drafted an action plan to implement a fix, but elected to wait until our colocation provider was no longer undergoing power maintenance before making any further changes. This was intended to prevent any further disruption later in the evening. An emergency maintenance was scheduled, with work to commence on August 2, 2022 at 23:00 UTC.
On August 2, 2022 at 21:30 UTC, before we could implement our planned changes, Linode was alerted to a near-total loss of connectivity in London and another partial reduction in network reachability in EU-Central (Frankfurt). Linode became aware of this outage at 21:30 UTC and investigation began immediately. Our engineers discovered that, once again, planned power maintenance in our data center, combined with our partially redundant power cabling that had carried over from the previous day, had led to both routers falling below the power thresholds necessary to forward traffic. At 21:35 UTC, our engineers engaged the data center to implement our planned changes immediately. By 22:35 UTC, changes had been implemented and power was restored to both edge routers. At this point, full connectivity was restored and the routers’ power feeds were cabled in a fully redundant state.
Following the completion of this maintenance, all edge routers in EU-West are using correctly configured, fully redundant power feeds. We do not anticipate any additional issues related to misconfigured power feeds to these systems again.
We have kicked off a fleet-wide audit of our power designs, configurations, and implementations, to ensure that all our sites are capable of weathering similar power events in the future without impact to our customers. Additionally, we will be adding periodic checks, either manually or through automation, to ensure that our critical power infrastructure remains fault tolerant.
The outages that our customers experienced on August 1 and again on August 2 were the result of two crucial oversights on our part.
The main issue was caused by the arrangement of power redundancy connections on our edge routers. This episode showed that this arrangement was not sufficient to withstand the power outage from the maintenance event. As mentioned earlier, we will be conducting a full audit of the power redundancy arrangements in our data centers to ensure that they will be able to handle situations like this.
The second issue was that we missed important notification emails from our colocation provider regarding maintenance. Both of the power maintenance events were announced with sufficient notice from our colocation provider, but were not integrated with the automation systems we use to keep track of these notifications. Had we been aware of the August 2 power maintenance, it would have been possible to fast track our peer review process and make the necessary changes in time to avoid the outage on August 2. Our failure to process notices from our colocation partners was a blind spot that these disruptions exposed.
We have automation in place for other notifications of this kind from other providers. Based on the events that occurred over the past two days, we will be extending this automation to encompass maintenance notifications from our colocation provider as well. Future notices of this nature will initiate an additional audit of power redundancy before maintenance occurs.
21:40 UTC: Monitoring systems alert to to connectivity problems in the London data center
22:03 UTC: Internally identified the issue as power failures to both edge routers
22:10 UTC: Colocation facility engaged to move power cables
23:11 UTC: Sufficient power restored to both edge routers
23:35 UTC: Full service restored to the London data center
23:40 UTC: Hands off
13:30 UTC: Power maintenance change submitted for review
~13:50 UTC: Peer review completed and approved
14:17 UTC: Emergency maintenance to reconfigure power posted, scheduled for 23:00
21:30 UTC: Monitoring systems alert to to trouble in the London data center
21:35 UTC: Data center contacted to implement scheduled maintenance early
21:48 UTC: Partial outage detected for customers in Frankfurt
22:05 UTC: Impact to Frankfurt mitigated
22:26 UTC: Maintenance complete, first edge router in London rebooted
22:30 UTC: Second edge router in London rebooted
22:35 UTC: Full connectivity to the London data center was restored.
22:40 UTC: Hands off