Framework Desktop (Ryzen AI MAX+ 395 / gfx1151): silent GPU hard hang under sustained compute on stock Ubuntu 26.04 LTS — suspected SMU/MES/PMFW firmware bug

## Device Information

### System Model or SKU
- [x] Framework Desktop (AMD Ryzen™ AI 300 PRO Series) — Ryzen AI MAX+ 395, Radeon 8060S (gfx1151, PCI `1002:1586` rev c1 at `0000:c1:00.0`)

### BIOS VERSION
`03.04` (2025-11-19, Insyde)

### DIY Edition information
- Memory: 128 GB on-package LPDDR5-8000 — Micron `MT62F4G32D8DV-023 WT`
- Storage: WD_BLACK SN850X 2 TB — firmware `620361WD`

### Port/Peripheral information
Not relevant to this bug — reproduces with no expansion cards beyond the network/display required to drive the workload.

## Describe the bug

Under **sustained GPU compute** (local LLM inference), the iGPU wedges into a state the kernel driver never recovers from. Journal cuts off mid-line, the machine requires a hard power cycle, and on reboot EXT4 reports `orphan cleanup on readonly fs` + `system.journal … corrupted or uncleanly shut down`. `/var/crash`, `/sys/fs/pstore`, and `/proc/sys/kernel/tainted` are all empty/zero — the hang happens *below* the kernel logging layer and amdgpu's own hangcheck never fires.

Reproduces on **two independent userspace GPU stacks**:

1. **ROCm + vLLM** (qwen3.6-35B-A3B-AWQ): wedges in 20–90 s under back-to-back prefill.
2. **Mesa RADV + llama.cpp** (Vulkan, same GGUF model): wedges in 1–12 h of agentic load.

Both end in the same silent hang. In one llama.cpp variant (`-b 4096 -ub 4096`) the failure becomes *amdgpu-visible*: compute-ring timeout (`comp_1.1.0` / `comp_1.2.0`), ring reset succeeds, `[drm] device wedged, but recovered through reset`, and a devcoredump is generated. I have a preserved `devcoredump-20260421-141528.bin` (6.1 MB) available on request.

Consistent precursor in the silent-hang class: `pcieport 0000:00:08.1: PME: Spurious native interrupt!` — 2–3 events per crashing boot on ROCm, 0–3 on Vulkan. This is the same PME workaround path flagged in the forum report linked below.

## Steps To Reproduce

1. Boot stock Ubuntu 26.04 LTS with `amdgpu.cwsr_enable=0` (community-recommended mitigation for gfx1151; does not prevent the hang).
2. Run any sustained GPU compute workload, e.g. vLLM serving a 30B+ AWQ/GGUF model with back-to-back requests, or llama.cpp `llama-server` under RADV with a long-context agentic client.
3. Within seconds (ROCm) to hours (Vulkan), the GPU wedges and the machine has to be power-cycled.

Occasional conversational chats don't trigger it — the workload has to pin the GPU at 100 % for a sustained window.

## Expected behavior

The GPU either recovers via a reset path visible to the kernel, or doesn't wedge at all.

## Operating System

- Distribution: **Ubuntu 26.04 LTS** (codename `resolute`) — stock, no mainline PPAs
- Kernel: `7.0.0-14-generic` (the shipping Ubuntu 26.04 LTS kernel)
- Host Mesa: `26.0.3-1ubuntu1` (`mesa-vulkan-drivers`)
- ROCm / amdgpu userspace for inference runs: kyuz0 `strix-halo-vllm-toolboxes` and `amd-strix-halo-toolboxes:vulkan-radv` containers (multiple tags tested)

Important: this is not an exotic setup. It's the stock 26.04 LTS kernel on current BIOS, with standard Ubuntu Mesa. The only non-stock piece is the GPU-firmware override described below, added in an attempt to *work around* the bug.

## Additional context

### Userspace ruled out
Both ROCm and Vulkan stacks wedge with the same end-state on the same kernel + firmware. The shared layer is: **amdgpu kernel driver + MES firmware + SMU/PMFW + PCIe root complex `00:08.1 → c1:00.0` + SoC silicon**. Userspace GPU stack is not the trigger.

### Currently loaded GPU firmware
From `/sys/kernel/debug/dri/0000:c1:00.0/amdgpu_firmware_info`:

- MES: `0x00000086` (upstream `linux-firmware 20260410` override in `/lib/firmware/updates/amdgpu/`)
- MES_KIQ: `0x0000006f`
- SMC/PMFW: `0x0a640600` (100.6.0)
- DMCUB: `0x09003f00`
- VCN: `0x09118010`
- IMU: `0x0b352300`
- RLC/RLCP: `0x11530506`

### Firmware tried
- Distro `linux-firmware 20250901` (ships with 26.04 LTS) — hangs.
- AMD `amdgpu-dkms-firmware 30.30.1` (MES 0x83) — adds a distinct NCCL-init page fault (ROCm#5991 signature), sustained-compute hang still present.
- Upstream `linux-firmware 20260410` override (the values above) — hang still present.

Firmware bumps have not been load-bearing at the kernel-firmware layer.

### Related upstream reports
- **ROCm/ROCm#6165** (open, mine) — full fence-ring / telemetry evidence, MES ring freezes at 0x570 while amdgpu hangcheck never fires: https://github.com/ROCm/ROCm/issues/6165
- ROCm/ROCm#5724 (closed) — MES 0x83 hang on Strix Halo.
- ROCm/ROCm#5991 (closed) — gfx1151 page fault reproducer.

### Possibly the same bug from a different trigger path
Framework community forum report, Fedora 43 + Strix Halo: **SMU deadlock in `dcn35_smu_enable_pme_wa`** triggered by GPU-accelerated browser workloads → ring timeouts → MES failures → full reset.
https://community.frame.work/t/smu-deadlock-system-freeze-on-fedora-43/81795

That report's trigger is display/teardown and the `dcn35_smu_enable_pme_wa` code path; mine is sustained compute. But both reach the same end-state on the same silicon + BIOS, and the PME WA is the same path my `pcieport … PME: Spurious` precursor points at. Plausibly one underlying SMU/PMFW bug reachable from two driver paths.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Framework Desktop (Ryzen AI MAX+ 395 / gfx1151): silent GPU hard hang under sustained compute on stock Ubuntu 26.04 LTS — suspected SMU/MES/PMFW firmware bug #206

Device Information

System Model or SKU

BIOS VERSION

DIY Edition information

Port/Peripheral information

Describe the bug

Steps To Reproduce

Expected behavior

Operating System

Additional context

Userspace ruled out

Currently loaded GPU firmware

Firmware tried

Related upstream reports

Possibly the same bug from a different trigger path

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Framework Desktop (Ryzen AI MAX+ 395 / gfx1151): silent GPU hard hang under sustained compute on stock Ubuntu 26.04 LTS — suspected SMU/MES/PMFW firmware bug #206

Description

Device Information

System Model or SKU

BIOS VERSION

DIY Edition information

Port/Peripheral information

Describe the bug

Steps To Reproduce

Expected behavior

Operating System

Additional context

Userspace ruled out

Currently loaded GPU firmware

Firmware tried

Related upstream reports

Possibly the same bug from a different trigger path

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions