At around 00:00 UTC on January 24, 2024, one of the multiple clusters responsible for the operation of our Object Storage service in our Washington, DC (us-iad) data center began to exhibit elevated latency for all request types. This continued without notice until 4:49 UTC when we were alerted to issues related to our logging system. By 4:51 UTC, an escalation was created to bring in engineers to investigate further. Their investigation led their first mitigation attempts to be put in place at 5:58 UTC. Over the next few hours, our engineers attempted different ways to fix the issue. Unfortunately, either the fixes didn't work for long or the problem moved to other places within the affected cluster. At 13:00 UTC the engineers reached out to subject matter experts for assistance. As a result of continued investigation, misconfigured client requests were found to be the root cause. At 15:26 UTC we made internal configuration changes to address the requests. The result of this change led to seeing latency beginning to return to normal levels at 15:30 UTC, with full recovery observed at 15:36 UTC.
Along with efforts to mitigate the increased latency issue, our engineers detailed the steps taken both before and after the time in which subject matter experts were sought for assistance. By capturing this information, a run book is being developed in order to reduce the response time should this issue occur again in the future. Additionally, our engineers are looking to make changes in how our Object Storage service handles the request behavior that was observed and recorded in the logs.