At 14:57 UTC on October 18, 2023, Akamai detected network disruption involving the Chennai data center. After initial investigation from Akamai's network engineering teams, an incident was declared at 15:29 UTC. The impact was limited to IPv6, with IPv4 being unaffected.
By 16:06 UTC, it was apparent that the networking problems related to interactions with two egress providers:
Akamai engaged Provider 1 at 16:30 UTC and Provider 2 at 16:42 UTC for troubleshooting these problems further. Provider 1 acknowledged Akamai's outreach at 16:43 UTC, but Provider 2 did not provide a timely response to Akamai's outreach. As a result, Akamai engaged its internal contract managers for Provider 2 at 17:09 UTC.
Regarding the issue with Provider 1, the continuing investigation from Akamai's network specialists indicated this provider was advertising a bad route (::/1
), so Akamai disabled this uplink at 18:02 UTC.
After Akamai's internal engagement, they received a reply from Provider 2 which stated that this provider was advertising Akamai's prefixes to the Internet. However, Akamai's tests at 18:02 UTC showed that packets with certain prefixes were still dropping within Provider 2's network. Akamai wrote to Provider 2 at 18:35 UTC asking them to check the routing status of these prefixes.
Provider 2 responded at 18:36 UTC informing Akamai that two of the four prefixes were operational, and that they were continuing to investigate the other two prefixes. Akamai confirmed by 19:20 UTC that the two prefixes were successfully routing to the Internet, asking Provider 2 to continue troubleshooting the other two prefixes. Akamai also asked Provider 2 about a traceroute to a destination that was failing within Provider 2's network.
Provider 2 acknowledged that they were investigating the two non-functional prefixes at 19:54 UTC, also showing a successful traceroute for the requested destination. The traceroute was failing due to an IPv6 session that Akamai had temporarily disabled – re-enabling the session restored connectivity to the destination featured in this traceroute.
Akamai messaged Provider 2 again at 20:49 UTC after seeing that the two non-functional prefixes were still not responsive. By 22:07 UTC, Provider 2 informed Akamai that these prefixes were now up on their end, sending over a trace that Akamai proceeded to review. However, Akamai's testing was not showing that these prefixes were routing to the Internet.
Despite these extensive efforts, the networking problems had not been resolved by 00:31 UTC on October 19, 2023, by which time Akamai's APAC (Asia Pacific) account leadership was engaged. At 00:46 UTC, Akamai's network specialists noticed issues with SDN (Software Defined Networking) on Akamai's border routers and announced plans to reboot the affected routers at an appropriate time.
At 02:17 UTC, Provider 2 replied to Akamai that they would provide Akamai an update as soon as possible during working hours in India. This provider updated Akamai at 08:32 UTC that they were advertising our prefixes. However, Akamai noticed that these prefixes were not appearing on the Internet and that this provider was therefore not advertising them, notifying Provider 2 accordingly at 09:36 UTC.
At 13:46 UTC, Akamai asked Provider 2 again to advertise these prefixes to the Internet. In accordance with the previous day's plans, Akamai proceeded to reboot one of its Chennai edge routers at 16:48 UTC, failing over its traffic beforehand. This did not improve the issue. For completeness, the secondary Chennai edge router was also rebooted in the same fashion at 17:48 UTC. Akamai confirmed optimal route advertisement to Provider 2 at 18:08 UTC.
At 19:55 UTC, Akamai discovered a bad route (::/1
) on the routing tables, prompting Akamai to reboot the routers containing this route. Once the routers were rebooted accordingly, the bad routes were dropped from the routing tables and the problem was fixed.
Analyzing the situation after the mitigation, it appears that Akamai inherited the ::/1
route from Provider 1, who had inherited the route from one of their own providers but later dropped it. Despite this drop, Akamai's systems had retained this route until they were rebooted.
To alleviate this issue, Akamai will be looking to improve visibility on their network throughout the organization to more readily detect network problems such as this. As a long-term strategy, Akamai is planning on connecting the Chennai data center and other sites in India to the existing Akamai backbone network.