Linode Status

Current Status
DNS Resolution Issues
Incident Report for Linode
Postmortem

Issue Summary

Between 20:35 UTC and 21:26 UTC on Tuesday, 8/23, our DNS service experienced an interruption which caused name resolution to fail for all DNS zones hosted on our infrastructure. The root cause of the issue was the deployment of an invalid change to the DNS infrastructure.

Timeline (UTC)

  • 19:42: Deployment of change begins across infrastructure
  • 20:20: Change deployed to final DNS server
  • 20:32: Convergence begins after change
  • 20:35: Service alerts and interruptions begin
  • 20:41: Incident response team responds to escalation; investigation begins
  • 20:54: Root cause identified; rollbacks begin
  • 21:22: Service recovery begins
  • 21:26: Final DNS server recovers

Root Cause

Between 19:42 UTC and 20:20 UTC, a change was deployed to each server within our DNS infrastructure to resolve an issue affecting a small amount of customers. Most of our DNS server deployment procedures were automated when we re-architected our DNS infrastructure earlier this year, but this change required deploying to one of the pieces which is still manually managed.

The change to be deployed was tested and approved within the operations team. Our investigation revealed that the change which was reviewed by our team was not the cause of the issue; the issue was ultimately caused by an error in the process of deploying the approved change. The deployment error resulted in the DNS servers performing a faulty DNS configuration and zone file convergence process which inhibited service availability.

Resolution and Recovery

At 20:35 UTC, the triggering of service alerts dispatched an immediate escalation. The incident response team worked quickly to investigate the issue, identifying the root cause by 20:41 UTC.

Like the deployment procedure, the rollback procedure needed to be performed manually on a server by server basis. After reverting the change on each server, the incident response engineers needed to then manually re-trigger the convergence processes, which typically happen over a gradual period of time, in order to hasten service recovery. This convergence process updates DNS server configurations and each zone file which we host to ensure they are at their most recent version. As the convergence process completed on each server, an accompanying service recovery followed. By 21:26 UTC, the final convergence process had completed, leading to the final DNS server recovery.

Posted Aug 24, 2016 - 19:09 UTC

Resolved
At this time the issues affecting our Hosted DNS Service have been resolved. If you experience any additional issues with DNS Resolution, please reach out to our Customer Support Team for assistance.
Posted Aug 24, 2016 - 02:10 UTC
Monitoring
We've resolved the issue with our Hosted DNS Service. DNS resolution should again be operating normally. We'll provide a post-mortem on this incident as soon as one is available.
Posted Aug 23, 2016 - 21:51 UTC
Identified
We've identified the cause of this issue, and we're working to restore normal DNS services as quickly as possible.
Posted Aug 23, 2016 - 21:25 UTC
Investigating
We're aware of domain resolution issues with our Hosted DNS Service and are investigating.
Posted Aug 23, 2016 - 20:48 UTC
This incident affected: Hosted DNS Service.