Between 18:14 and 18:20 UTC, two switches in our Newark data center powered off automatically. After further investigation, it was discovered that this was due to temperature issues affecting those switches.
The data center provided temporary replacement cooling for these switches before these switches were power cycled. However, at around 20:55 UTC, the data center's field technicians accidentally powered off the wrong switch. The switch that was powered off handled connectivity for Linode's management network. Due to an architectural redundancy failure, this prevented Linodes in Newark from processing new jobs (power changes, Cloud Firewall updates, etc.).
This management switch was powered back on at 21:20 UTC, but it did not have a configuration in place, and its previous console connection was no longer in use. Linode's network technicians coordinated with remote hands to give it a new console connection in order to supply the switch with a working configuration. The application of this configuration was successful at 23:07 UTC, restoring the management network connectivity.
We have fast-tracked a project to mitigate the network design to prevent this redundancy failure in the future. Additionally, we will be implementing monitoring to alert us when network configuration changes have not been made permanent.