At 17:09 UTC on 24 January 2024, a new networking template was released to our Ashburn backbone routers. A few minutes later at 17:21 UTC we began to receive alerts pointing to an issue with networking routes which were impacting connectivity within our Washington, DC (us-iad) region. These alerts included messages for high forward error rates. We began investigating these alerts and at 17:31 UTC it was decided to revert the networking template changes to the previous version. Connectivity continued to be an issue as routing rules repaired themselves, with normal traffic resuming by 18:00 UTC.
After reviewing the error messages and logs, we identified two changes which had an unintended impact on how the template release handled firewall rules and routing advertisements. The issue with the template update was determined to stem largely from inadequate testing conditions unique to the Ashburn site, which neglected to identify the unintended impact prior to deployment.
Based on what we've learned, we've been reviewing the proposed changes in an improved test environment. This involves checking error messages from logs and those encountered during the incident. Moving forward, we will improve our testing by running more accurate simulations in our labs.