Starting at around 13:39 UTC on August 7, 2023, a number of Akamai employees noticed latency on internal systems, both through direct interaction and paging alerts for web servers. Akamai’s database administrators were also automatically paged at 13:44 UTC. While the database administrators started investigating the alerts, the impact for Akamai employees overall appeared to be intermittent – internal resources were loading again, and the paging alerts had automatically resolved.
At around 15:18 UTC, the web service alerts paged again, followed by a number of Akamai employees experiencing the recurrence of latency in accessing internal systems at 15:29 UTC. Akamai database’s administrators noticed high CPU on the database systems during their continuing investigation, leading to the declaration of an incident at 15:32 UTC.
Akamai’s Compute Support team were also receiving customer reports of platform-level operations (create, reboot, etc.) not processing starting at 15:51 UTC.
By 16:12 UTC, the database administrators’ investigation was centering around slow queries to a particular table, leading to the buildup of a backlog that was exhausting database resources. A change made to the database at 16:19 UTC appeared to mitigate this backlog and subsequent resource exhaustion, leading to the resolution of the incident.
—-
At 15:19 UTC on August 15, 2023, Akamai observed a recurrence of the issue. This appears to have been related to a query performed on the database with unexpected side effects. A new status page was opened for the recurrence. Multiple teams investigated the issue over the next several hours, identifying a number of leads and making various modifications for marginal improvements, but without an overall resolution.
At 10:34 UTC on August 16, 2023, Akamai received customer reports of issues with resizing Block Storage volumes. It was determined that these issues were related to the ongoing incident. The database administrators and other teams resumed further investigation, focusing on systems aspects related to Block Storage due to the reported customer impact.
Beginning at 19:50 UTC, Akamai’s administrators began implementing changes based on their observations. System improvements were observed nearly immediately after these changes were implemented, with no recurrence of the issue after a period of careful monitoring. The status page was updated to reflect the resolution of the incident at 01:48 UTC.
—
To prevent this issue moving forward, Akamai has identified the most affected aspects of the database during this incident and is addressing their architecture for performance improvements. In addition to these structural improvements, Akamai is also looking to identify periods of heavy load and explore ways to better distribute their load over time.