Starting at 22:21 UTC on November 19, 2023, Akamai received a large number of alerts for infrastructure in the Newark data center. In response, Akamai’s Compute Operations team declared an incident by 22:26 UTC, identifying an issue with a router in Newark. Compute Operations escalated this router issue to the Network Operations team at 22:27 UTC for review.
The Network Operations team joined the incident discussion at 22:34 UTC and began investigating. By 22:39 UTC, they identified that a networking component in Newark had experienced a software crash. The nature of this crash was causing the usual automatic failover mechanisms to not take effect, leading to internal network disruption. This disruption prevented Linodes in Newark from processing platform-level changes – existing services would continue to run normally, but service creation, power state changes, and other platform-level tasks were impacted.
To resolve this issue, the Network Operations team performed a reboot of the impacted component at 22:45 UTC. This mitigated the proximal cause of the impact, but still required cleanup of latent networking issues induced by the software crash. The Network Operations team completed this cleanup at 22:51 UTC, which fully restored baseline network performance in Newark.
To help prevent this issue from occurring in the future, Akamai will be replacing the failed network component and performing additional debugging to understand how it failed. Akamai will also be reviewing its alerts associated with the affected networking systems to ensure a timely response and resolution for any recurrences.