Linode Status

Current Status

Service Issue: RTX 4000 Ada GPU Errors Across Multiple Regions

Incident Report for Linode

Postmortem

Starting at approximately 21:00 UTC on March 4, 2026, customers utilizing NVIDIA RTX 4000 Ada GPU-backed Linodes began experiencing lockups. At first, the issue was believed to be isolated to worker nodes on the Linode Kubernetes Engine (LKE) platform, but was later confirmed to impact all Linodes using this hardware. Standard Compute and non-RTX4000 GPU instances were unaffected.
After ruling out recent software releases, our subject matter experts isolated the root cause to a recently deployed telemetry script. During a routine system improvement initiative, our teams identified and repaired a broken, legacy monitoring script to restore a missing metric on our internal observability dashboards. While investigating why a GPU monitoring script stopped reporting correct metrics, an update was made to restore it to a working state. The script, originally written for an earlier GPU generation, issued a firmware inspection query that was not apparent from the scope of the fix being made. On the RTX 4000 Ada architecture, this class of query against an active GPU triggers a race condition in the GPU System Processor (GSP), causing the GPU to enter a protective lockup state and become unavailable to running workloads.
We disabled the monitoring script across the GPU fleet and rebooted the nodes to mitigate the impact. The issue was fully mitigated around 17:16 UTC on March 5, 2026.
We sincerely apologize for the disruption this caused to your GPU-accelerated applications and services. We will take appropriate improvement measures and prevention actions to prevent recurrence.
This summary provides an overview of our current understanding of the incident given the information available. Our investigation is ongoing, and any information herein is subject to change.

Posted Mar 09, 2026 - 01:32 UTC

Resolved

We haven’t observed any additional issues with the service, and will now consider this incident resolved. If you continue to experience problems, please open a Support ticket for assistance.
Posted Mar 05, 2026 - 18:11 UTC

Monitoring

At this time we have been able to correct the issues affecting the service. We will be monitoring this to ensure that it remains stable. If you continue to experience problems, please open a Support ticket for assistance.
Posted Mar 05, 2026 - 17:01 UTC

Identified

Our team has identified the issue affecting the service. We are working quickly to implement a fix, and we will provide an update as soon as the solution is in place.
Posted Mar 05, 2026 - 16:23 UTC

Update

We are continuing to investigate and will provide the next update as progress is made.
Posted Mar 05, 2026 - 14:48 UTC

Investigating

We are aware of a recurrence of this issue across multiple regions. We are continuing to investigate and will provide the next update as progress is made.
Posted Mar 05, 2026 - 10:50 UTC

Monitoring

Our team has identified the issue affecting the service and implemented a fix. We will be monitoring this to ensure that it remains stable. If you continue to experience problems, please open a Support ticket for assistance.
Posted Mar 05, 2026 - 07:34 UTC

Update

We are continuing to investigate the issue. We will provide the next update as progress is made.
Posted Mar 05, 2026 - 06:55 UTC

Update

Our subject matter experts are actively investigating the issue. We will provide the next update as progress is made.
Posted Mar 05, 2026 - 05:48 UTC

Investigating

We are investigating a critical service issue affecting NVIDIA RTX 4000 Ada GPU nodes across multiple regions, including Osaka (osa1), Seattle (sea1), and Chicago (ord1).
Affected GPU nodes may report an unrecoverable error state leading to failures in Vulkan initialization and GPU-accelerated workloads. Additionally, some LKE clusters in the Osaka region are currently experiencing Control Plane connectivity issues, resulting in timed-out API requests and errors.
Our engineering teams are currently investigating the root cause, focusing on a potential regression in the underlying host hypervisor or GPU firmware. We will provide more information as it becomes available
Posted Mar 05, 2026 - 02:07 UTC
This incident affected: Linode Kubernetes Engine (US-ORD (Chicago) Linode Kubernetes Engine, US-SEA (Seattle) Linode Kubernetes Engine, JP-OSA (Osaka) Linode Kubernetes Engine, SG-SIN-2 (Singapore 2) Linode Kubernetes Engine).