Between 20:35 UTC and 21:26 UTC on Tuesday, 8/23, our DNS service experienced an interruption which caused name resolution to fail for all DNS zones hosted on our infrastructure. The root cause of the issue was the deployment of an invalid change to the DNS infrastructure.
Between 19:42 UTC and 20:20 UTC, a change was deployed to each server within our DNS infrastructure to resolve an issue affecting a small amount of customers. Most of our DNS server deployment procedures were automated when we re-architected our DNS infrastructure earlier this year, but this change required deploying to one of the pieces which is still manually managed.
The change to be deployed was tested and approved within the operations team. Our investigation revealed that the change which was reviewed by our team was not the cause of the issue; the issue was ultimately caused by an error in the process of deploying the approved change. The deployment error resulted in the DNS servers performing a faulty DNS configuration and zone file convergence process which inhibited service availability.
At 20:35 UTC, the triggering of service alerts dispatched an immediate escalation. The incident response team worked quickly to investigate the issue, identifying the root cause by 20:41 UTC.
Like the deployment procedure, the rollback procedure needed to be performed manually on a server by server basis. After reverting the change on each server, the incident response engineers needed to then manually re-trigger the convergence processes, which typically happen over a gradual period of time, in order to hasten service recovery. This convergence process updates DNS server configurations and each zone file which we host to ensure they are at their most recent version. As the convergence process completed on each server, an accompanying service recovery followed. By 21:26 UTC, the final convergence process had completed, leading to the final DNS server recovery.