At approximately 18:00 UTC on July 27th, 2023, we began to observe that new database cluster deployments were failing and initiated our Incident Response Procedure. At ~21:00 UTC, the reason for these failures was identified as a networking issue causing connections to time out to our service deployment infrastructure. By 23:30 UTC, a fix had been implemented and was confirmed to have resolved the issue after a brief monitoring period.
On July 28th at 11:30 UTC, we began to observe additional instances of clusters failing to provision. Our team reinitiated Incident Response Procedure, and published a new status page here. We determined that connection timeouts had resurfaced and further infrastructure adjustments were made to address the timeouts. A solution was deployed at 23:49 UTC, and we began to monitor cluster provisioning to ensure that the issue was resolved.
On July 30th, we confirmed that the provisioning of clusters had remained in a stable, functioning state, and the incident was considered fully resolved at 19:57 UTC.
In order to prevent this issue from happening in the future, we took steps to improve our service deployment infrastructure to make it more resilient to transient network issues.