Skip to content

Framework Desktop (Ryzen AI MAX+ 395 / gfx1151): silent GPU hard hang under sustained compute on stock Ubuntu 26.04 LTS — suspected SMU/MES/PMFW firmware bug #206

@Lafunamor

Description

@Lafunamor

Device Information

System Model or SKU

  • Framework Desktop (AMD Ryzen™ AI 300 PRO Series) — Ryzen AI MAX+ 395, Radeon 8060S (gfx1151, PCI 1002:1586 rev c1 at 0000:c1:00.0)

BIOS VERSION

03.04 (2025-11-19, Insyde)

DIY Edition information

  • Memory: 128 GB on-package LPDDR5-8000 — Micron MT62F4G32D8DV-023 WT
  • Storage: WD_BLACK SN850X 2 TB — firmware 620361WD

Port/Peripheral information

Not relevant to this bug — reproduces with no expansion cards beyond the network/display required to drive the workload.

Describe the bug

Under sustained GPU compute (local LLM inference), the iGPU wedges into a state the kernel driver never recovers from. Journal cuts off mid-line, the machine requires a hard power cycle, and on reboot EXT4 reports orphan cleanup on readonly fs + system.journal … corrupted or uncleanly shut down. /var/crash, /sys/fs/pstore, and /proc/sys/kernel/tainted are all empty/zero — the hang happens below the kernel logging layer and amdgpu's own hangcheck never fires.

Reproduces on two independent userspace GPU stacks:

  1. ROCm + vLLM (qwen3.6-35B-A3B-AWQ): wedges in 20–90 s under back-to-back prefill.
  2. Mesa RADV + llama.cpp (Vulkan, same GGUF model): wedges in 1–12 h of agentic load.

Both end in the same silent hang. In one llama.cpp variant (-b 4096 -ub 4096) the failure becomes amdgpu-visible: compute-ring timeout (comp_1.1.0 / comp_1.2.0), ring reset succeeds, [drm] device wedged, but recovered through reset, and a devcoredump is generated. I have a preserved devcoredump-20260421-141528.bin (6.1 MB) available on request.

Consistent precursor in the silent-hang class: pcieport 0000:00:08.1: PME: Spurious native interrupt! — 2–3 events per crashing boot on ROCm, 0–3 on Vulkan. This is the same PME workaround path flagged in the forum report linked below.

Steps To Reproduce

  1. Boot stock Ubuntu 26.04 LTS with amdgpu.cwsr_enable=0 (community-recommended mitigation for gfx1151; does not prevent the hang).
  2. Run any sustained GPU compute workload, e.g. vLLM serving a 30B+ AWQ/GGUF model with back-to-back requests, or llama.cpp llama-server under RADV with a long-context agentic client.
  3. Within seconds (ROCm) to hours (Vulkan), the GPU wedges and the machine has to be power-cycled.

Occasional conversational chats don't trigger it — the workload has to pin the GPU at 100 % for a sustained window.

Expected behavior

The GPU either recovers via a reset path visible to the kernel, or doesn't wedge at all.

Operating System

  • Distribution: Ubuntu 26.04 LTS (codename resolute) — stock, no mainline PPAs
  • Kernel: 7.0.0-14-generic (the shipping Ubuntu 26.04 LTS kernel)
  • Host Mesa: 26.0.3-1ubuntu1 (mesa-vulkan-drivers)
  • ROCm / amdgpu userspace for inference runs: kyuz0 strix-halo-vllm-toolboxes and amd-strix-halo-toolboxes:vulkan-radv containers (multiple tags tested)

Important: this is not an exotic setup. It's the stock 26.04 LTS kernel on current BIOS, with standard Ubuntu Mesa. The only non-stock piece is the GPU-firmware override described below, added in an attempt to work around the bug.

Additional context

Userspace ruled out

Both ROCm and Vulkan stacks wedge with the same end-state on the same kernel + firmware. The shared layer is: amdgpu kernel driver + MES firmware + SMU/PMFW + PCIe root complex 00:08.1 → c1:00.0 + SoC silicon. Userspace GPU stack is not the trigger.

Currently loaded GPU firmware

From /sys/kernel/debug/dri/0000:c1:00.0/amdgpu_firmware_info:

  • MES: 0x00000086 (upstream linux-firmware 20260410 override in /lib/firmware/updates/amdgpu/)
  • MES_KIQ: 0x0000006f
  • SMC/PMFW: 0x0a640600 (100.6.0)
  • DMCUB: 0x09003f00
  • VCN: 0x09118010
  • IMU: 0x0b352300
  • RLC/RLCP: 0x11530506

Firmware tried

  • Distro linux-firmware 20250901 (ships with 26.04 LTS) — hangs.
  • AMD amdgpu-dkms-firmware 30.30.1 (MES 0x83) — adds a distinct NCCL-init page fault (ROCm#5991 signature), sustained-compute hang still present.
  • Upstream linux-firmware 20260410 override (the values above) — hang still present.

Firmware bumps have not been load-bearing at the kernel-firmware layer.

Related upstream reports

Possibly the same bug from a different trigger path

Framework community forum report, Fedora 43 + Strix Halo: SMU deadlock in dcn35_smu_enable_pme_wa triggered by GPU-accelerated browser workloads → ring timeouts → MES failures → full reset.
https://community.frame.work/t/smu-deadlock-system-freeze-on-fedora-43/81795

That report's trigger is display/teardown and the dcn35_smu_enable_pme_wa code path; mine is sustained compute. But both reach the same end-state on the same silicon + BIOS, and the PME WA is the same path my pcieport … PME: Spurious precursor points at. Plausibly one underlying SMU/PMFW bug reachable from two driver paths.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions