On December 3, 2024, at around 09:00 UTC, block storage clusters in Toronto and Frankfurt went down, leading to customer-facing issues such as connection timeouts and errors. The problem was resolved by 10:30 UTC after force-restarting the affected devices. A similar issue occurred on December 5, 2024, between 06:40 UTC and 07:49 UTC, impacting the Newark cluster. In this instance, the Subject Matter Experts (SMEs) mitigated the impact by manually restarting the affected machines one at a time. Given the recurrence, the scope of the incident was expanded to evaluate all potentially vulnerable sites and eliminate the risk of recurrence. Despite these efforts, on December 9, 2024, the same software malfunction caused an outage in our Sydney data center at 07:25 UTC, which was mitigated by rebooting the nodes. To resolve the underlying issue and prevent further occurrences, we expedited a kernel and Network Interface Card (NIC) firmware upgrade, which was completed on December 12, 2024, at approximately 22:25 UTC. This update eliminated the risk of recurrence across all identified sites.
The efforts to determine the root cause and additional long-term preventive actions are in progress.
This summary provides an overview of our current understanding of the incident given the information available. Our investigation is ongoing and any information herein is subject to change.