As part of the regular troubleshooting for a customer for an issue on the LKE-E side, we became aware of an intermittent issue causing pod to pod connection timeouts on LKE clusters across all data centers. The investigation at the time indicated "noisy network neighbors" on the hosts that were leading to timeouts. Additional investigation indicated that this issue has been existing since approximately January 20th, 2025.
Our LKE engineering team started testing on standard LKE tier server sets and they were able to replicate the issue for 3 hours in the Los Angeles data center.
Akamai ultimately discovered two different issues which led to the behavior observed. We tracked back most of the occurrences for all server sets running Dedicated Linode plans to problems with the underlying host, and in most cases, it was related to memory pressure and the running guests all had their network affected. We correlated the customer’s reports to their decision to change all premium nodepools to dedicated nodepools at the beginning of the year.
The networking problems we noticed in premium were in fact getting drowned out by the noisy dedicated server sets. Once we isolated only premium nodepools, we were able to correlate the customers' reports to a known issue we had in our envoy proxy configuration.
In order to mitigate the issue, we released a patch with a fix.
Akamai will schedule a meeting to outline lessons learned and next steps to ensure similar incidents do not happen in the future.
This summary provides an overview of our current understanding of the incident given the information available. Our investigation is ongoing and any information herein is subject to change.