At approximately 16:00 UTC on October 27th, a database became overloaded due to a manual query for an internal project. This caused issues with the API and Cloud Manager and affected jobs issued by customers. Our incident response procedures were formally launched at 16:07 UTC. After an initial investigation, the team was able to bring the database to normalized levels.
When the database recovered, it caused additional issues in networking between the affected database and select host machines. After further action by our administrators, we were able to recover from the impact of the situation, and jobs issued by customers resumed and worked as expected around 17:40 UTC.
We are taking actions in order to upgrade our existing infrastructure which will prevent this type of issue from recurring in the future. We are also continuing to make improvements to our technical procedures which will further reinforce reliability of this database moving forward.