At around 23:00 UTC on March 22, 2023, Linode’s Support team noticed a large number of alerts regarding Managed Database clusters and informed the appropriate teams about this issue. An investigation into the issue commenced, and Linode began its incident response procedures to coordinate the response.
By about 00:00 UTC on March 23, 2023, it was clear that a subsystem for Managed Databases had experienced a failure at 21:46 UTC on March 22, 2023, likely due to a general networking disruption shortly before this time. This subsystem does have an automatic recovery process, but this recovery process failed after 1 hour and 15 minutes at around 23:01 UTC on March 22, 2023, matching the timing of the alert flood noticed by Linode Support.
During this investigation, a manual recovery process had begun at 00:01 UTC on March 23, 2023 without an ETA. By 00:20 UTC, an ETA of about 3 hours was available. The recovery was actively monitored during this time and was confirmed to be completed at 03:38 UTC on March 23, 2023.
While this recovery of the subsystem was a necessary prerequisite for service restoration, the step of refreshing the Managed Database clusters themselves was also necessary. This action completed for all Managed Database clusters by 08:53 UTC on March 23, 2023, rendering them fully usable once again. After observing no further reports, the status page was resolved at 12:42 UTC.
To prevent this issue from occurring again, we are exploring ways to improve the resiliency of the subsystems whose failure to automatically recover led to this incident.