Starting at approximately 21:00 UTC on March 4, 2026, customers utilizing NVIDIA RTX 4000 Ada GPU-backed Linodes began experiencing lockups. At first, the issue was believed to be isolated to worker nodes on the Linode Kubernetes Engine (LKE) platform, but was later confirmed to impact all Linodes using this hardware. Standard Compute and non-RTX4000 GPU instances were unaffected.
After ruling out recent software releases, our subject matter experts isolated the root cause to a recently deployed telemetry script. During a routine system improvement initiative, our teams identified and repaired a broken, legacy monitoring script to restore a missing metric on our internal observability dashboards. While investigating why a GPU monitoring script stopped reporting correct metrics, an update was made to restore it to a working state. The script, originally written for an earlier GPU generation, issued a firmware inspection query that was not apparent from the scope of the fix being made. On the RTX 4000 Ada architecture, this class of query against an active GPU triggers a race condition in the GPU System Processor (GSP), causing the GPU to enter a protective lockup state and become unavailable to running workloads.
We disabled the monitoring script across the GPU fleet and rebooted the nodes to mitigate the impact. The issue was fully mitigated around 17:16 UTC on March 5, 2026.
We sincerely apologize for the disruption this caused to your GPU-accelerated applications and services. We will take appropriate improvement measures and prevention actions to prevent recurrence.
This summary provides an overview of our current understanding of the incident given the information available. Our investigation is ongoing, and any information herein is subject to change.