Linode Status

DNS Resolution Issues

Incident Report for Linode

Postmortem

Issue Summary

Between 20:35 UTC and 21:26 UTC on Tuesday, 8/23, our DNS service experienced an interruption which caused name resolution to fail for all DNS zones hosted on our infrastructure. The root cause of the issue was the deployment of an invalid change to the DNS infrastructure.

Timeline (UTC)

19:42: Deployment of change begins across infrastructure
20:20: Change deployed to final DNS server
20:32: Convergence begins after change
20:35: Service alerts and interruptions begin
20:41: Incident response team responds to escalation; investigation begins
20:54: Root cause identified; rollbacks begin
21:22: Service recovery begins
21:26: Final DNS server recovers

Root Cause

Between 19:42 UTC and 20:20 UTC, a change was deployed to each server within our DNS infrastructure to resolve an issue affecting a small amount of customers. Most of our DNS server deployment procedures were automated when we re-architected our DNS infrastructure earlier this year, but this change required deploying to one of the pieces which is still manually managed.

The change to be deployed was tested and approved within the operations team. Our investigation revealed that the change which was reviewed by our team was not the cause of the issue; the issue was ultimately caused by an error in the process of deploying the approved change. The deployment error resulted in the DNS servers performing a faulty DNS configuration and zone file convergence process which inhibited service availability.

Resolution and Recovery

At 20:35 UTC, the triggering of service alerts dispatched an immediate escalation. The incident response team worked quickly to investigate the issue, identifying the root cause by 20:41 UTC.

Like the deployment procedure, the rollback procedure needed to be performed manually on a server by server basis. After reverting the change on each server, the incident response engineers needed to then manually re-trigger the convergence processes, which typically happen over a gradual period of time, in order to hasten service recovery. This convergence process updates DNS server configurations and each zone file which we host to ensure they are at their most recent version. As the convergence process completed on each server, an accompanying service recovery followed. By 21:26 UTC, the final convergence process had completed, leading to the final DNS server recovery.

Posted Aug 24, 2016 - 19:09 UTC

Resolved

At this time the issues affecting our Hosted DNS Service have been resolved. If you experience any additional issues with DNS Resolution, please reach out to our Customer Support Team for assistance.

Posted Aug 24, 2016 - 02:10 UTC

Monitoring

We've resolved the issue with our Hosted DNS Service. DNS resolution should again be operating normally. We'll provide a post-mortem on this incident as soon as one is available.

Posted Aug 23, 2016 - 21:51 UTC

Identified

We've identified the cause of this issue, and we're working to restore normal DNS services as quickly as possible.

Posted Aug 23, 2016 - 21:25 UTC

Investigating

We're aware of domain resolution issues with our Hosted DNS Service and are investigating.

Posted Aug 23, 2016 - 20:48 UTC

This incident affected: Hosted DNS Service.

Compute

Storage

Networking

Databases

Services

Solutions

Pricing

Library

Technical Resources

Community

Marketplace

What's New