feat: AAPCS regalloc fix + i64 stack frame + --relocatable + M7 hardening by avrabe · Pull Request #86 · pulseengine/synth

avrabe · 2026-04-29T06:34:26Z

Consolidated PR — supersedes #83, #85, #86 (formerly stacked with non-main bases). Bundles the four in-flight Track A items into one CI-runnable change.

What's in this PR

1. fix(opt): regalloc clobbers parameter registers — AAPCS violation in i64 ops

The optimized lowering path (optimizer_bridge::ir_to_arm) hardcoded every i64 opcode to use R0:R1 / R2:R3 — regardless of whether those AAPCS arg slots still held live params. Replaced with alloc_i64_pair, which picks a free callee-saved pair from [(R4,R5), (R6,R7), (R8,R9), (R10,R11)] and skips anything in vreg_to_arm.values(), local_to_reg.values(), or the reserved AAPCS slots R0..R(min(num_params,4)). Source operands now resolve through vreg_to_arm instead of being assumed pinned.

2. fix(no-optimize): allocate stack frame + i64 local storage in select_with_stack

Two AAPCS-violating bugs surfaced by running gale (verified Rust kernel primitives) through synth on a real Cortex-M Zephyr engine bench:

i64 local storage dropped the upper half — LocalSet/LocalTee/LocalGet for spilled locals always used 4-byte Str/Ldr. For i64 locals, only the lower 32 bits round-tripped, corrupting any function returning a u64-packed FFI struct.
Locals aliased the callee-saved spill area — legacy offset formula (local_idx - 4) * 4 was hardcoded for num_params == 4 and produced negative offsets that the encoder silently clamped to [sp, #0] — exactly where stmdb had just saved r4/r5. After ldmia, callers saw r4/r5 corrupted.

Fix: compute_local_layout(wasm_ops, num_params) walks the wasm ops once to determine each non-param local's width and assign 8-byte-aligned i64 slots. Prologue emits sub sp, sp, #frame_size; epilogue emits the matching add sp, sp, #frame_size.

3. feat(cli): add --relocatable flag to force ET_REL output

Adds a --relocatable flag that forces ET_REL output regardless of whether the wasm has imports — for linking into a host build system that has its own startup code. Default behaviour is unchanged.

4. feat(m7): high-end Cortex-M7 hardware profiles + 16-MPU-region support

Wires up production-grade Cortex-M7 targets across the toolchain:

HardwareCapabilities::imxrt1062() — Cortex-M7 r1p1 with single-precision FPU, 16 MPU regions, 8MB external QSPI flash, 1MB OCRAM. Representative high-end M7 configuration for safety-grade lockstep platforms.
HardwareCapabilities::stm32h743() — Cortex-M7 with double-precision FPU, 16 MPU regions, 2MB Flash, 1MB RAM.
CLI: --hardware {imxrt1062,stm32h743} and target-info wired up.
Renode: tests/renode/synth_cortex_m7.repl models a 600 MHz M7 with ITCM/DTCM/OCRAM and an 8MB XIP flash window. cortex_m7_test.robot exercises --target cortex-m7.
MPU allocator: tests prove the existing hw_caps.mpu_regions plumbing scales to 16 regions on M7 and that 8-region M4-class parts still reject the 9th allocation.
Integration: tests/integration/m7_codegen_smoke.sh — offline smoke for i32, f32 and f64 codegen on cortex-m7 / cortex-m7dp.

5. Clippy fixes (Rust 1.95 lint refresh)

8 errors fixed: unnecessary_sort_by, collapsible_match, collapsible_if, manual_checked_division. Required for CI to clear -D warnings.

Test plan

cargo build --release --bin synth — clean
cargo test --workspace — all 600+ tests pass
cargo clippy --workspace --all-targets -- -D warnings — clean
cargo fmt --check — clean
M7 smoke: bash tests/integration/m7_codegen_smoke.sh — 5/5 pass (i32/f32 on M7, i32/f32/f64 on M7DP)
MPU allocator tests: 8 pass (incl. 3 new for 16-region M7)
Renode CI re-run on the engine_control bench (depends on environment — recommend triggering after merge)
Renode cortex_m7_add_test on M7 platform (CI-gated)

Why one large PR

The previous stack (#83 → #85 → #86) had non-main bases that prevented CI from firing. Bundling avoids dispatching three separate runner-heavy CI runs and gives a single mergeable artifact. After merge, #83 and #85 will be closed as superseded.

🤖 Generated with Claude Code

Currently synth produces a relocatable object (.o, ET_REL) only when the input wasm has imports — the relocations they generate trigger the relocatable code path. Self-contained wasm modules with no imports produce a complete ET_EXEC firmware with vector table, Reset_Handler, linear_memory section, etc. For linking into a host build system (e.g. integrating verified Rust kernel primitives compiled to wasm into a Zephyr build), the host expects a relocatable .o it can pull into its existing link step. Add a --relocatable CLI flag that forces ET_REL output regardless of whether the wasm has imports. The flag is additive — default behaviour is unchanged. Tested with gale-ffi.wasm (200 functions, 0 imports): output is now a 26645-byte ET_REL ARM EABI5 object with all gale_* symbols defined and no vector-table machinery. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two bugs in select_with_stack's spilled-local handling, both surfaced running gale (formally-verified Zephyr kernel primitives) through synth on real Cortex-M code: 1. **i64 local storage only writes one half.** LocalSet/LocalTee/LocalGet for spilled locals always emitted plain Str/Ldr (4 bytes), even for i64 locals. The upper half was never stored, so local.get returned garbage for 32 high bits. For gale's u64-packed FFI decision structs this corrupted the action field, breaking every ISR-driven semaphore operation in the engine bench (count=0 / drain_timeout at every step). 2. **Locals area aliased the callee-saved spill area.** The legacy offset formula `(local_idx - 4) * 4` was hardcoded for num_params==4 and produced negative offsets for other configurations, which the I64Ldr/I64Str encoder silently clamped to 0. With offset 0, locals landed exactly where stmdb had just saved r4/r5 — corrupting the callee-saved register spill and propagating wrong values back to the caller after ldmia. Pure AAPCS violation. Fix: - Add `compute_local_layout(wasm_ops, num_params) -> LocalLayout` that walks the wasm op stream once to determine each non-param local's width (i32/i64) and assigns proper stack offsets from a base of 0. Uses `infer_i64_locals` (also new) which simulates a width vstack to classify locals based on what gets stored into them. - Prologue emits `sub sp, sp, #frame_size` after the callee-saved push, allocating the locals area below the saved-register spills. - Epilogue emits `add sp, sp, #frame_size` before the pop, restoring SP to the callee-saved spills before they get popped. Also applied to the explicit Return opcode handler. - LocalGet/LocalSet/LocalTee dispatch on the layout entry: i64 locals use I64Ldr/I64Str (which already emit two 32-bit memory ops at offset N and N+4); i32 locals use plain Ldr/Str. Offsets come from layout, not from the legacy formula. Frame size is rounded up to 8 bytes for AAPCS SP alignment at call sites. Verified locally: /tmp/repro_i64.wat (1 i32 param + 1 i64 local round-trip): Before: str.w r0, [sp]; (no upper store) — upper half lost. After: sub.w sp, sp, #8; str.w r0, [sp]; str.w r1, [sp, #4]; (both halves stored) ldr.w r2, [sp]; ldr.w r3, [sp, #4]; (both halves loaded) add.w sp, sp, #8; ldmia. gale_k_sem_give_decide (3 i32 params + 1 i64 local, the function whose runtime miscompilation hung the engine bench in CI): Before: str.w r3, [sp]; str.w r5, [sp, #4]; — clobbered the saved r4/r5 from stmdb, AAPCS-violating. After: sub.w sp, sp, #8 → str into the locals frame, not the spill; add.w sp, sp, #8 before ldmia — proper AAPCS. Engine-bench build with GALE_USE_SYNTH=ON now produces a 22768 B zephyr.elf (was 22692 B with the buggy synth). Renode validation pending in CI. Out of scope: - The optimized regalloc bug in optimizer_bridge.rs (clobbers r0..r3 parameter registers — see /tmp/match_gale.wat repro). This fix lets --no-optimize work; the optimized path needs a separate pass. - Implicit-return-to-R0 elision in select_with_stack: small functions whose result lands in a non-R0 temp don't get a final mov to R0 in --no-optimize mode. Pre-existing, unrelated to this fix; affects i32-returning functions with a single intermediate value. Doesn't affect i64 returns (which use the natural R0/R1 pair). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PR #85 fixed i64 local STORAGE (both halves stored to consecutive stack slots) but introduced a follow-on bug: when the i64 register pair gets allocated via two separate alloc_temp_safe calls, the resulting pair can be NON-CONSECUTIVE in ALLOCATABLE_REGS if a register in between is live on the wasm stack. Subsequent i64 ops downstream call i64_pair_hi(rdlo) to recover the high register, which assumes the pair is consecutive. With a non-consecutive pair from earlier, i64_pair_hi returns the WRONG register and the op reads garbage. Witnessed on gale_k_sem_give_decide: ldr.w r5, [sp] ; LocalGet i64 lo - r5 picked ldr.w r6, [sp, #4] ; LocalGet i64 hi - r6 picked (consecutive ✓) ...later... orr.w r0, r0, r2 ; i64.or expected (r5,r6) but used (r2,r3) Fix: add `alloc_consecutive_pair` helper that ensures two consecutive free entries in ALLOCATABLE_REGS. Use it everywhere a pair is allocated for an i64 result: I64Const (line 4567), I64Add/Sub/Mul result allocs (lines 4914+, 4958+, 4996+), and the new i64-LocalGet path from PR #85. Verified locally: /tmp/repro_i64.wat round-trips correctly with consecutive register pair (lo→r2, hi→r3). gale_k_sem_give_decide's LocalGet 3 now loads to consecutive r5/r6. Note: the engine bench in Renode still hangs after this fix. Further diagnostic shows i64.or's ARM lowering uses register pairs (r0,r1) and (r2,r3) regardless of what's on the wasm-tracked stack — a separate bug in synth's wildcard fallthrough for I64Or in select_with_stack. That fix is out of scope for this PR; tracked separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three deeper bugs surfaced when running gale_k_sem_give_decide on Renode after PR #85 + the consecutive-pair fix: 1. **i64 ops fall through to select_default** in select_with_stack, which assumes inputs are in R0:R1 / R2:R3. Wasm-stack-tracked pairs from earlier ops never make it. Symptoms: i64.or used register pairs from previous shift ops instead of the just-loaded LocalGet result, producing a corrupted return value. Fix: explicit handlers for I64Or / I64And / I64Xor / I64ExtendI32U / I64ExtendI32S / I64Shl / I64ShrU / I64ShrS in select_with_stack, each popping the wasm-tracked pair, deriving its hi via i64_pair_hi, and emitting the right ARM instructions / pseudo-ops with arbitrary registers (the existing I64Shl/etc. ArmOp pseudo-ops accept arbitrary register operands). 2. **alloc_consecutive_pair didn't reserve implicit hi registers**. The wasm stack only tracks the lo register of each i64; the hi is conventional but invisible to a naive scan. A fresh allocation could overwrite an earlier i64's implicit hi, breaking subsequent i64_pair_hi lookups. Witnessed: I64Const 32 allocated (r1, r2), clobbering the hi of a previously-extended i64 in (r0, r1). Fix: alloc_consecutive_pair now scans every stack entry and reserves both lo AND its conventional pair_hi (over-reserves for i32 stack entries — safe). 3. **alloc_consecutive_pair didn't reserve just-popped operands**. When an i64 op popped operands and then allocated a result pair, the popped values were temporarily off the stack. The allocation could pick a register that's about to be read by the op's own second instruction (e.g. dst_lo == a_hi means the lo Or write clobbers a_hi before the hi Or reads it). Fix: extra_avoid parameter on alloc_consecutive_pair. I64Add / I64Sub / I64Or / I64Shl / I64Load / extends pass their popped operand registers via extra_avoid. Verified locally: gale_k_sem_give_decide now produces: orr r0, r6, r8 ; lo = shift_result_lo OR local3_lo orr r1, r7, ip ; hi = shift_result_hi OR local3_hi matching the wasm semantics for i64.or. Engine bench builds clean with 22644 B FLASH. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…i64 ops In the optimized lowering path (`optimizer_bridge::ir_to_arm`), every i64 opcode hardcoded its register pair to R0:R1 / R2:R3 and ignored the actual operand vregs from `vreg_to_arm`. For any function that emitted an i64 op before all its i32 params had been read (e.g. `i64.const 1` issued before the first `local.get 0`), the very first emitted instruction would clobber the AAPCS arg register holding the param. Concretely, `match_gale.wat` (3 i32 params + 1 i64 local + an `i64.const` prologue) compiled to: movs r0, #1 ; <-- destroys param 0 movs r1, #0 ; <-- destroys param 1 ... cmp r0, r1 ; reads garbage Same shape on the engine_control bench wasm: `gale_k_sem_give_decide` started with `movs r0, #1; movs r1, #0` before any `local.get 0`. Fix: - Compute `param_reserved_regs: Vec<Reg>` = R0..R(min(num_params, 4)) at function entry. Use Vec because `Reg` doesn't derive Hash; matches `instruction_selector::alloc_consecutive_pair` style. - Add an `alloc_i64_pair` helper that searches `[(R4,R5), (R6,R7), (R8,R9), (R10,R11)]` for a free callee-saved pair, skipping any reg that's already in `vreg_to_arm.values()`, `local_to_reg.values()`, or `param_reserved_regs`. - Rewrite every i64 handler — I64Const, I64Load (non-param case), I64Add/Sub/And/Or/Xor/Mul/Shl/ShrS/ShrU/Rotl/Rotr/DivS/DivU/RemS/RemU, the comparisons (Eq/Ne/Lt/Gt/Le/Ge {S,U}, Eqz), Clz/Ctz/Popcnt, and Extend{8,16,32}S — to (a) read source regs from `vreg_to_arm` instead of assuming R0:R1/R2:R3, and (b) place the destination on a fresh pair via `alloc_i64_pair`. - Update the function epilogue: previously `is_i64_result` short-circuited the move-to-R0 because the result was already pinned at R0:R1. With the fix, the result lives wherever regalloc placed it, so emit explicit moves into R0 and R1, ordered to handle the rare case where lo lives in R1 (route via R12). - For Clz/Ctz/Popcnt, whose ArmOp encoder zeros `rnhi` in-place to produce the i64 result hi, copy the source hi into a fresh callee-saved hi slot before the op so we don't smash the original input or any unrelated AAPCS reg. Track the chosen physical hi reg via `last_result_vreg_hi_reg` for the epilogue. Verification: /tmp/match_gale.wat post-fix: 0: movs r4, #1 ; was: movs r0, #1 2: movs r5, #0 ; was: movs r1, #0 ... c: cmp r0, r1 ; params intact gale_k_sem_give_decide post-fix: e8: movs r4, #1 ; was: movs r0, #1 ea: movs r5, #0 ; was: movs r1, #0 `cargo test --workspace` is green. The no-optimize path (`instruction_selector.rs`) is untouched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Wires up production-grade Cortex-M7 targets across the toolchain. * Add HardwareCapabilities::imxrt1062() — Cortex-M7 r1p1 with single-precision FPU, 16 MPU regions, 8MB external QSPI flash, 1MB OCRAM. Representative high-end M7 configuration for safety-grade lockstep platforms. * Add HardwareCapabilities::stm32h743() — Cortex-M7 with double-precision FPU, 16 MPU regions, 2MB Flash, 1MB RAM. * CLI: --hardware {imxrt1062,stm32h743} and target-info wired up. * Renode: tests/renode/synth_cortex_m7.repl models a 600 MHz M7 with ITCM/DTCM/ OCRAM and an 8MB XIP flash window. cortex_m7_test.robot exercises Synth's --target cortex-m7 codegen path through the platform. * MPU allocator: add tests proving the existing hw_caps.mpu_regions plumbing scales to 16 regions on M7 and that 8-region M4-class parts still reject the 9th allocation. * Integration: tests/integration/m7_codegen_smoke.sh — offline smoke test for i32, f32 and f64 codegen on cortex-m7 / cortex-m7dp. Also fixes 8 clippy errors (unnecessary_sort_by, collapsible_match, collapsible_if, manual_checked_division) introduced by a clippy lint refresh on the parent branches — required for CI to clear -D warnings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

avrabe and others added 6 commits April 27, 2026 22:24

avrabe changed the base branch from fix/synth-i64-locals-and-frame to main May 3, 2026 13:16

avrabe changed the title ~~fix(opt): regalloc clobbers parameter registers — AAPCS violation in i64 ops~~ feat: AAPCS regalloc fix + i64 stack frame + --relocatable + M7 hardening May 3, 2026

This was referenced May 3, 2026

feat(cli): add --relocatable flag to force ET_REL output #83

Closed

fix(no-optimize): allocate stack frame + i64 local storage in select_with_stack #85

Closed

feat(riscv): RV32IMAC backend skeleton — encoder, ELF builder, PMP allocator #87

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: AAPCS regalloc fix + i64 stack frame + --relocatable + M7 hardening#86

feat: AAPCS regalloc fix + i64 stack frame + --relocatable + M7 hardening#86
avrabe wants to merge 6 commits intomainfrom
fix/synth-optimized-regalloc-params

avrabe commented Apr 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

avrabe commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's in this PR

1. fix(opt): regalloc clobbers parameter registers — AAPCS violation in i64 ops

2. fix(no-optimize): allocate stack frame + i64 local storage in select_with_stack

3. feat(cli): add --relocatable flag to force ET_REL output

4. feat(m7): high-end Cortex-M7 hardware profiles + 16-MPU-region support

5. Clippy fixes (Rust 1.95 lint refresh)

Test plan

Why one large PR

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

avrabe commented Apr 29, 2026 •

edited

Loading