On 12-16-2020, at approximately 11:45 UTC, our monitoring systems alerted us to a connectivity issue in the Singapore data center. These monitoring systems indicated a large drop in traffic to and from the DC, however, no clear cause was apparent so the Network Operations team continued to troubleshoot. In our Singapore data center, we have a core switch pair in an active/active redundant configuration. The team was able to isolate the issue to the B-side switch, which had stopped forwarding traffic but did not trigger a failover.
The team then decided to force a failover to the A-side switch and this immediately restored full connectivity to the data center. The networking team then proactively rebooted the B-side switch.
When the B-side switch completed its power cycle, diagnostics checked out, so it was returned to service. The network operations team engaged the vendor for further investigation into the cause of the issue but there wasn’t anything that presented out of the ordinary other than the switch was discarding traffic.
On 12-18-2020 at 16:32 UTC, again, our monitoring systems alerted us to connectivity issues in the Singapore data center. This time, though, the A-side switch had stopped forwarding traffic.
A forced failover and power cycle of the affected switch brought the data center back to full service. We continued to work with the vendor, however, we were still unable to determine a root cause for these switches to stop forwarding traffic.
We experienced the same issue two more times, on 12-20-2020 at 11:01 UTC, and again on 12-21-2020 at 16:38 UTC.
Since there wasn't a clear solution as a result of our investigation with the vendor, we began operating under the premise that this was an unidentified bug in the code version we were currently using. Our team decided to schedule an emergency maintenance and proactively upgrade the code version of the switch to another recommended version.
On 12-21-2020 at 21:00 UTC, an emergency maintenance was performed to upgrade the software of each switch. This maintenance was completed without downtime.
We have not experienced any technical issues since the completion of this maintenance and we believe this issue to be fully resolved at this point. We are still working with the vendor to identify the underlying bug, however, a final cause has yet to be identified.