Device Information
System Model or SKU
BIOS VERSION
03.04 (2025-11-19, Insyde)
DIY Edition information
- Memory: 128 GB on-package LPDDR5-8000 — Micron
MT62F4G32D8DV-023 WT
- Storage: WD_BLACK SN850X 2 TB — firmware
620361WD
Port/Peripheral information
Not relevant to this bug — reproduces with no expansion cards beyond the network/display required to drive the workload.
Describe the bug
Under sustained GPU compute (local LLM inference), the iGPU wedges into a state the kernel driver never recovers from. Journal cuts off mid-line, the machine requires a hard power cycle, and on reboot EXT4 reports orphan cleanup on readonly fs + system.journal … corrupted or uncleanly shut down. /var/crash, /sys/fs/pstore, and /proc/sys/kernel/tainted are all empty/zero — the hang happens below the kernel logging layer and amdgpu's own hangcheck never fires.
Reproduces on two independent userspace GPU stacks:
- ROCm + vLLM (qwen3.6-35B-A3B-AWQ): wedges in 20–90 s under back-to-back prefill.
- Mesa RADV + llama.cpp (Vulkan, same GGUF model): wedges in 1–12 h of agentic load.
Both end in the same silent hang. In one llama.cpp variant (-b 4096 -ub 4096) the failure becomes amdgpu-visible: compute-ring timeout (comp_1.1.0 / comp_1.2.0), ring reset succeeds, [drm] device wedged, but recovered through reset, and a devcoredump is generated. I have a preserved devcoredump-20260421-141528.bin (6.1 MB) available on request.
Consistent precursor in the silent-hang class: pcieport 0000:00:08.1: PME: Spurious native interrupt! — 2–3 events per crashing boot on ROCm, 0–3 on Vulkan. This is the same PME workaround path flagged in the forum report linked below.
Steps To Reproduce
- Boot stock Ubuntu 26.04 LTS with
amdgpu.cwsr_enable=0 (community-recommended mitigation for gfx1151; does not prevent the hang).
- Run any sustained GPU compute workload, e.g. vLLM serving a 30B+ AWQ/GGUF model with back-to-back requests, or llama.cpp
llama-server under RADV with a long-context agentic client.
- Within seconds (ROCm) to hours (Vulkan), the GPU wedges and the machine has to be power-cycled.
Occasional conversational chats don't trigger it — the workload has to pin the GPU at 100 % for a sustained window.
Expected behavior
The GPU either recovers via a reset path visible to the kernel, or doesn't wedge at all.
Operating System
- Distribution: Ubuntu 26.04 LTS (codename
resolute) — stock, no mainline PPAs
- Kernel:
7.0.0-14-generic (the shipping Ubuntu 26.04 LTS kernel)
- Host Mesa:
26.0.3-1ubuntu1 (mesa-vulkan-drivers)
- ROCm / amdgpu userspace for inference runs: kyuz0
strix-halo-vllm-toolboxes and amd-strix-halo-toolboxes:vulkan-radv containers (multiple tags tested)
Important: this is not an exotic setup. It's the stock 26.04 LTS kernel on current BIOS, with standard Ubuntu Mesa. The only non-stock piece is the GPU-firmware override described below, added in an attempt to work around the bug.
Additional context
Userspace ruled out
Both ROCm and Vulkan stacks wedge with the same end-state on the same kernel + firmware. The shared layer is: amdgpu kernel driver + MES firmware + SMU/PMFW + PCIe root complex 00:08.1 → c1:00.0 + SoC silicon. Userspace GPU stack is not the trigger.
Currently loaded GPU firmware
From /sys/kernel/debug/dri/0000:c1:00.0/amdgpu_firmware_info:
- MES:
0x00000086 (upstream linux-firmware 20260410 override in /lib/firmware/updates/amdgpu/)
- MES_KIQ:
0x0000006f
- SMC/PMFW:
0x0a640600 (100.6.0)
- DMCUB:
0x09003f00
- VCN:
0x09118010
- IMU:
0x0b352300
- RLC/RLCP:
0x11530506
Firmware tried
- Distro
linux-firmware 20250901 (ships with 26.04 LTS) — hangs.
- AMD
amdgpu-dkms-firmware 30.30.1 (MES 0x83) — adds a distinct NCCL-init page fault (ROCm#5991 signature), sustained-compute hang still present.
- Upstream
linux-firmware 20260410 override (the values above) — hang still present.
Firmware bumps have not been load-bearing at the kernel-firmware layer.
Related upstream reports
Possibly the same bug from a different trigger path
Framework community forum report, Fedora 43 + Strix Halo: SMU deadlock in dcn35_smu_enable_pme_wa triggered by GPU-accelerated browser workloads → ring timeouts → MES failures → full reset.
https://community.frame.work/t/smu-deadlock-system-freeze-on-fedora-43/81795
That report's trigger is display/teardown and the dcn35_smu_enable_pme_wa code path; mine is sustained compute. But both reach the same end-state on the same silicon + BIOS, and the PME WA is the same path my pcieport … PME: Spurious precursor points at. Plausibly one underlying SMU/PMFW bug reachable from two driver paths.
Device Information
System Model or SKU
1002:1586rev c1 at0000:c1:00.0)BIOS VERSION
03.04(2025-11-19, Insyde)DIY Edition information
MT62F4G32D8DV-023 WT620361WDPort/Peripheral information
Not relevant to this bug — reproduces with no expansion cards beyond the network/display required to drive the workload.
Describe the bug
Under sustained GPU compute (local LLM inference), the iGPU wedges into a state the kernel driver never recovers from. Journal cuts off mid-line, the machine requires a hard power cycle, and on reboot EXT4 reports
orphan cleanup on readonly fs+system.journal … corrupted or uncleanly shut down./var/crash,/sys/fs/pstore, and/proc/sys/kernel/taintedare all empty/zero — the hang happens below the kernel logging layer and amdgpu's own hangcheck never fires.Reproduces on two independent userspace GPU stacks:
Both end in the same silent hang. In one llama.cpp variant (
-b 4096 -ub 4096) the failure becomes amdgpu-visible: compute-ring timeout (comp_1.1.0/comp_1.2.0), ring reset succeeds,[drm] device wedged, but recovered through reset, and a devcoredump is generated. I have a preserveddevcoredump-20260421-141528.bin(6.1 MB) available on request.Consistent precursor in the silent-hang class:
pcieport 0000:00:08.1: PME: Spurious native interrupt!— 2–3 events per crashing boot on ROCm, 0–3 on Vulkan. This is the same PME workaround path flagged in the forum report linked below.Steps To Reproduce
amdgpu.cwsr_enable=0(community-recommended mitigation for gfx1151; does not prevent the hang).llama-serverunder RADV with a long-context agentic client.Occasional conversational chats don't trigger it — the workload has to pin the GPU at 100 % for a sustained window.
Expected behavior
The GPU either recovers via a reset path visible to the kernel, or doesn't wedge at all.
Operating System
resolute) — stock, no mainline PPAs7.0.0-14-generic(the shipping Ubuntu 26.04 LTS kernel)26.0.3-1ubuntu1(mesa-vulkan-drivers)strix-halo-vllm-toolboxesandamd-strix-halo-toolboxes:vulkan-radvcontainers (multiple tags tested)Important: this is not an exotic setup. It's the stock 26.04 LTS kernel on current BIOS, with standard Ubuntu Mesa. The only non-stock piece is the GPU-firmware override described below, added in an attempt to work around the bug.
Additional context
Userspace ruled out
Both ROCm and Vulkan stacks wedge with the same end-state on the same kernel + firmware. The shared layer is: amdgpu kernel driver + MES firmware + SMU/PMFW + PCIe root complex
00:08.1 → c1:00.0+ SoC silicon. Userspace GPU stack is not the trigger.Currently loaded GPU firmware
From
/sys/kernel/debug/dri/0000:c1:00.0/amdgpu_firmware_info:0x00000086(upstreamlinux-firmware 20260410override in/lib/firmware/updates/amdgpu/)0x0000006f0x0a640600(100.6.0)0x09003f000x091180100x0b3523000x11530506Firmware tried
linux-firmware 20250901(ships with 26.04 LTS) — hangs.amdgpu-dkms-firmware 30.30.1(MES 0x83) — adds a distinct NCCL-init page fault (ROCm#5991 signature), sustained-compute hang still present.linux-firmware 20260410override (the values above) — hang still present.Firmware bumps have not been load-bearing at the kernel-firmware layer.
Related upstream reports
Possibly the same bug from a different trigger path
Framework community forum report, Fedora 43 + Strix Halo: SMU deadlock in
dcn35_smu_enable_pme_watriggered by GPU-accelerated browser workloads → ring timeouts → MES failures → full reset.https://community.frame.work/t/smu-deadlock-system-freeze-on-fedora-43/81795
That report's trigger is display/teardown and the
dcn35_smu_enable_pme_wacode path; mine is sustained compute. But both reach the same end-state on the same silicon + BIOS, and the PME WA is the same path mypcieport … PME: Spuriousprecursor points at. Plausibly one underlying SMU/PMFW bug reachable from two driver paths.