At approximately 12:00 UTC on August 13, 2024, our cloud computing infrastructure experienced an issue with migrations. During this time, customers were unable to schedule migrations, including live, warm, and cold options. This issue also affected the ability to resize Linodes, leading to attempts either hanging or failing entirely.
The underlying cause of the situation was identified as an excessive load on an internal dependent database, which was traced back to high volumes of queries generated within a short span of time. This increased database load overwhelmed the system, leading to the migration tool being unable to perform its functions. Further investigation revealed that the Cloud Manager was polling data too frequently, which contributed to the database strain. To resolve the issue, our teams implemented several corrective actions. A script was developed and deployed to automatically terminate long-running queries, which helped reduce the load on the database.
By August 29, 2024, these corrective actions were successfully implemented, allowing us to restore normal operations and move to the next phase, which focuses on ongoing optimization and monitoring. We are committed to learning from this incident to ensure a better experience for our customers.
This summary provides an overview of our current understanding of the incident given the information available. Our investigation is ongoing and any information herein is subject to change.