Starting at around 17:26 UTC on June 25, 2023, Linode’s operations team was alerted to slow resolution times for the DNS resolvers in the Newark data center. These alerts occurred throughout the day, but appeared to be intermittent and quickly resolved by restarting a service on the resolvers.
At around 14:34 UTC on June 26, 2023, these alerts for slow response times on Newark’s DNS resolvers recurred at an increasing frequency from the previous day. Linode’s operations team began an initial investigation, and by 14:37 UTC, it was clear that a service on the DNS resolvers was regularly crashing. Restarting the service would recover it, but only for a brief period of time, leading to flapping alerts.
These alerts were originally not thought to indicate a customer-impacting issue, but Linode’s Support team started noticing a trend of tickets at around 14:52 UTC on June 26, 2023, with a particularly strong uptick at 15:19 UTC. The combination of these customer reports and recurrent alerts for service crashes prompted an official start of the incident process and deeper investigation at 15:31 UTC.
By 16:18 UTC, a potential fix for the recurrent crashes was identified, and work commenced to implement the fix on the resolvers. This potential fix was fully implemented on all resolvers by 17:18 UTC, after which Linode’s incident response team observed the performance of the resolvers. With no signs of further crashes by 18:31 UTC, the problem was believed to be fixed, and the status page was updated accordingly.
However, at 18:48 UTC, an internal report of failed DNS queries emerged, followed by additional customer reports at 18:55 UTC. This prompted investigation into additional aspects of the DNS resolvers, and it was decided to move the status page back to an investigating state at 19:46 UTC.
At 20:24 UTC, an additional problem involving certain erroneous DNS queries was identified, prompting an exploration of potential fixes involving these queries. By 20:55 UTC, a tentative fix was implemented and saw an immediate improvement in response times from all DNS resolvers. After monitoring this fix, the status page was set to a monitoring status at 23:14 UTC and resolved at 01:10 UTC on June 27, 2023.
To help prevent this issue from occurring again in the future, Linode will be exploring means to improve the resiliency and monitoring of its DNS resolver systems. Additionally, Linode will be pursuing improvements to its documentation and procedures to more quickly detect potential customer impacts and begin the incident process as quickly as possible.