At 19:27 UTC on July 27th, 2023, changes to routing policies in our Chicago data center began the process of being rolled out. These changes were intended to implement updates to the networking configuration in that location and had previously been successfully tested at other sites. Due to considerations specific to Chicago, these changes unintentionally caused a drop in network traffic upon their completion at 19:27 UTC, launching a flood of alerts in our monitoring system as well as multiple reports of connectivity issues from affected customers. Upon viewing the sudden impact, our network administrators immediately began investigating the source of the issue.
At 20:47 UTC, it was determined that the implemented changes were the source of the incident, and a rollback of these changes was quickly initiated. These actions resulted in a rapid recovery of networking performance as well as confirmations being received from customers reporting that the situation had already improved. Our incident response team was formally organized at this same time and began investigation into the reason why these changes caused this event to occur. Despite the impact of this incident appearing to have been quickly mitigated, a status page was created at 20:22 UTC in order to inform customers of the issue.
At 20:50 UTC, traffic levels were seen continuing to improve, and the issue was placed in a monitoring status to ensure that no further impact was seen. The incident was considered resolved at 21:02 UTC once no additional alerts had been seen for an extended period and our administrators determined that no other configurations had been negatively impacted by the rollback.
Our administrators have identified the root cause of this incident and what location-specific adjustments to the intended changes are required to prevent further disruptions during its required implementation.