Starting at about 22:34 UTC on July 17, 2023, Block Storage volumes for customers in Frankfurt (us-central) became unusable for all operations. This issue came to the attention of the Linode Support team at about 22:52 UTC, with additional reports resulting in an escalation to Linode's Compute Operations team at 23:24 UTC.
Based on the severity of the impact, Linode started its incident response procedures at 23:26 UTC and paged the on-call Storage expert at 23:36 UTC. This expert began reviewing the circumstances at 23:37 UTC, then commented at 23:43 UTC that one of the Block Storage backends had encountered a memory exhaustion condition.
Further investigation by the Storage expert indicated by 23:52 UTC that the cluster had experienced a "slow ops" condition which was stalling all I/O with the cluster. Additionally, it was believed by 23:56 UTC that the condition causing cluster unusability occurred before the memory exhaustion condition.
To remediate this situation, the Storage expert began restarting the backend components experiencing slow ops – slowly, to avoid potential worsening of circumstances. This process began at 23:52 UTC, and by 00:23 UTC, the "slow ops" condition cleared, with test volumes becoming available by 00:26 UTC.
To address this issue on a long-term basis, the Frankfurt cluster is currently undergoing a series of improvements to bolster its resiliency. These improvements are being rolled out carefully so to avoid further impact to the service.