Incident Summary
At approximately 15:56 UTC on August 20th, 2018, our monitoring systems alerted us to multiple network ports going down in one of our Newark, NJ block storage clusters. Due to redundancy within the cluster, no impact to block storage was seen at this time. After troubleshooting, engineers from the network and storage teams determined the problem was linked to a bad line card on one of the redundant switches in the block storage cluster. Customers utilizing block storage would not have seen any impact to their storage services during this time. At 17:15 UTC, a ticket was opened with our data center partner to move the bad line card from the impacted switch into a spare switch chassis for further troubleshooting. At 17:59 UTC, our monitoring systems reported many block storage nodes going completely offline. During the block storage cluster rebalance, client IO is not a priority, so customers would have seen severe impacts to their block storage services. Our engineers determined that our data center partner removed a line card from the redundant, not affected block storage switch, which in-turn brought down network connectivity completely for several block storage nodes. This event caused the performance of our block storage services to be severely impacted. A call was immediately placed to our data center partner to have them insert the line card back into the redundant chassis. Once completed, the block storage servers regained network connectivity at 18:42 UTC and client IO started to recover. Customers would have seen performance for their storage services start to improve at this time. The cluster reported a full recovery at 19:56 UTC, and no customer data was lost during this event.
Summary Timeline
Next Steps
After a thorough review of our processes when communicating with DC partners we have added another step when working on production network equipment. When the DC partner is tasked with any work on production equipment that could impact customers we will now require the DC partner be live on the phone. The expected result is that if a mistake is made the rollback will be implemented in a much shorter period of time.