On May 4th at 18:38 UTC, Compute Support was alerted by customers of networking and DNS issues within the Washington, DC (US-IAD) data center. The reported issues appeared to be affecting the ability to connect to newly created nodes in their Linode Kubernetes Engine (LKE) clusters. Initial investigation focused specifically on the LKE service with mitigation steps being taken on individual cases. After receiving additional reports, we expanded the scope of our investigation and brought in subject matter experts to gain a more holistic view of the situation. At this point, engineers found routing inconsistencies in the underlying infrastructure of the US-IAD data center.
At 3:10 UTC on May 6th, network engineers began restarting internal services which serve routing table information within the US-IAD data center in order to mitigate the immediate impact. During this process, we also added underlying compute resources to prevent potential recurrences as the root cause was still being investigated.
In response to this incident, we have made changes to our underlying network services to prevent further problems of this nature. We have also implemented automated alerting which will promptly make us aware should this issue reoccur.
This summary provides an overview of our current understanding of the incident given the information available. Our investigation is ongoing and any information herein is subject to change.