At approximately 01:15 UTC on October 1, 2022, our administrators were alerted by the Support team that customers were experiencing failures of cross-data center migrations of Linodes out of the Toronto data center, particularly to London and Frankfurt. Affected migrations would have shown as being stuck at 40% progress in Cloud Manager.
An investigation into the failures began, and the cause was ultimately identified as a network configuration inconsistency between Toronto’s compute hosts and the rest of Toronto’s internal network. Following configuration updates at 09:30 UTC on October 1, 2022, migrations of Linodes leaving Toronto for other data centers were successful.
Once the immediate impact was fixed, Linode proceeded with a deeper review of the circumstances leading to this problem. It was identified that this configuration inconsistency had previously existed within Toronto’s infrastructure, but without any negative effects. It was only after an upgrade to the software on Linode’s compute hosts on September 28, 2022 that this inconsistency caused a meaningful disruption of service. Cross-data center migrations from other data centers were unaffected by this issue.
To prevent this issue from occurring again, Linode will be implementing a number of improvements to its software testing procedures and network configuration that will account for this failure method.
Timeline:Sep 29, 2022 01:20 UTC -- First migration failure for Toronto occurs
Oct 01, 2022 18:16 UTC -- Linode Support receives alert about migration failure, starts investigation
Oct 01, 2022 20:45 UTC -- Support escalates to Platform Operations
Oct 01, 2022 21:08 UTC -- Platform Operations acknowledges escalation and continues investigation
Oct 01, 2022 23:45 UTC -- Platform Operations pages to System Operations on-call
Oct 02, 2022 02:30 UTC -- Platform and System Operations pages the on-call network administrator
Oct 02, 2022 04:00 UTC -- Network administrator identifies the issue and start exploring potential fixes
Oct 02, 2022 05:30 UTC -- Network administrator fixes the problem
Oct 02, 2022 05:40 UTC -- Beginning of test migration from Toronto to verify fix
Oct 02, 2022 06:00 UTC -- Migration successful, status page moved to Monitoring
Oct 02, 2022 07:00 UTC -- Status page moved to Resolved