Skip to content

feat: AAPCS regalloc fix + i64 stack frame + --relocatable + M7 hardening#86

Open
avrabe wants to merge 6 commits intomainfrom
fix/synth-optimized-regalloc-params
Open

feat: AAPCS regalloc fix + i64 stack frame + --relocatable + M7 hardening#86
avrabe wants to merge 6 commits intomainfrom
fix/synth-optimized-regalloc-params

Conversation

@avrabe
Copy link
Copy Markdown
Contributor

@avrabe avrabe commented Apr 29, 2026

Consolidated PR — supersedes #83, #85, #86 (formerly stacked with non-main bases). Bundles the four in-flight Track A items into one CI-runnable change.

What's in this PR

1. fix(opt): regalloc clobbers parameter registers — AAPCS violation in i64 ops

The optimized lowering path (optimizer_bridge::ir_to_arm) hardcoded every i64 opcode to use R0:R1 / R2:R3 — regardless of whether those AAPCS arg slots still held live params. Replaced with alloc_i64_pair, which picks a free callee-saved pair from [(R4,R5), (R6,R7), (R8,R9), (R10,R11)] and skips anything in vreg_to_arm.values(), local_to_reg.values(), or the reserved AAPCS slots R0..R(min(num_params,4)). Source operands now resolve through vreg_to_arm instead of being assumed pinned.

2. fix(no-optimize): allocate stack frame + i64 local storage in select_with_stack

Two AAPCS-violating bugs surfaced by running gale (verified Rust kernel primitives) through synth on a real Cortex-M Zephyr engine bench:

  • i64 local storage dropped the upper halfLocalSet/LocalTee/LocalGet for spilled locals always used 4-byte Str/Ldr. For i64 locals, only the lower 32 bits round-tripped, corrupting any function returning a u64-packed FFI struct.
  • Locals aliased the callee-saved spill area — legacy offset formula (local_idx - 4) * 4 was hardcoded for num_params == 4 and produced negative offsets that the encoder silently clamped to [sp, #0] — exactly where stmdb had just saved r4/r5. After ldmia, callers saw r4/r5 corrupted.

Fix: compute_local_layout(wasm_ops, num_params) walks the wasm ops once to determine each non-param local's width and assign 8-byte-aligned i64 slots. Prologue emits sub sp, sp, #frame_size; epilogue emits the matching add sp, sp, #frame_size.

3. feat(cli): add --relocatable flag to force ET_REL output

Adds a --relocatable flag that forces ET_REL output regardless of whether the wasm has imports — for linking into a host build system that has its own startup code. Default behaviour is unchanged.

4. feat(m7): high-end Cortex-M7 hardware profiles + 16-MPU-region support

Wires up production-grade Cortex-M7 targets across the toolchain:

  • HardwareCapabilities::imxrt1062() — Cortex-M7 r1p1 with single-precision FPU, 16 MPU regions, 8MB external QSPI flash, 1MB OCRAM. Representative high-end M7 configuration for safety-grade lockstep platforms.
  • HardwareCapabilities::stm32h743() — Cortex-M7 with double-precision FPU, 16 MPU regions, 2MB Flash, 1MB RAM.
  • CLI: --hardware {imxrt1062,stm32h743} and target-info wired up.
  • Renode: tests/renode/synth_cortex_m7.repl models a 600 MHz M7 with ITCM/DTCM/OCRAM and an 8MB XIP flash window. cortex_m7_test.robot exercises --target cortex-m7.
  • MPU allocator: tests prove the existing hw_caps.mpu_regions plumbing scales to 16 regions on M7 and that 8-region M4-class parts still reject the 9th allocation.
  • Integration: tests/integration/m7_codegen_smoke.sh — offline smoke for i32, f32 and f64 codegen on cortex-m7 / cortex-m7dp.

5. Clippy fixes (Rust 1.95 lint refresh)

8 errors fixed: unnecessary_sort_by, collapsible_match, collapsible_if, manual_checked_division. Required for CI to clear -D warnings.

Test plan

  • cargo build --release --bin synth — clean
  • cargo test --workspace — all 600+ tests pass
  • cargo clippy --workspace --all-targets -- -D warnings — clean
  • cargo fmt --check — clean
  • M7 smoke: bash tests/integration/m7_codegen_smoke.sh — 5/5 pass (i32/f32 on M7, i32/f32/f64 on M7DP)
  • MPU allocator tests: 8 pass (incl. 3 new for 16-region M7)
  • Renode CI re-run on the engine_control bench (depends on environment — recommend triggering after merge)
  • Renode cortex_m7_add_test on M7 platform (CI-gated)

Why one large PR

The previous stack (#83#85#86) had non-main bases that prevented CI from firing. Bundling avoids dispatching three separate runner-heavy CI runs and gives a single mergeable artifact. After merge, #83 and #85 will be closed as superseded.

🤖 Generated with Claude Code

avrabe and others added 6 commits April 27, 2026 22:24
Currently synth produces a relocatable object (.o, ET_REL) only when the
input wasm has imports — the relocations they generate trigger the
relocatable code path. Self-contained wasm modules with no imports
produce a complete ET_EXEC firmware with vector table, Reset_Handler,
linear_memory section, etc.

For linking into a host build system (e.g. integrating verified Rust
kernel primitives compiled to wasm into a Zephyr build), the host
expects a relocatable .o it can pull into its existing link step. Add
a --relocatable CLI flag that forces ET_REL output regardless of
whether the wasm has imports.

The flag is additive — default behaviour is unchanged.

Tested with gale-ffi.wasm (200 functions, 0 imports): output is now
a 26645-byte ET_REL ARM EABI5 object with all gale_* symbols defined
and no vector-table machinery.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two bugs in select_with_stack's spilled-local handling, both surfaced
running gale (formally-verified Zephyr kernel primitives) through synth
on real Cortex-M code:

1. **i64 local storage only writes one half.** LocalSet/LocalTee/LocalGet
   for spilled locals always emitted plain Str/Ldr (4 bytes), even for
   i64 locals. The upper half was never stored, so local.get returned
   garbage for 32 high bits. For gale's u64-packed FFI decision structs
   this corrupted the action field, breaking every ISR-driven semaphore
   operation in the engine bench (count=0 / drain_timeout at every step).

2. **Locals area aliased the callee-saved spill area.** The legacy
   offset formula `(local_idx - 4) * 4` was hardcoded for num_params==4
   and produced negative offsets for other configurations, which the
   I64Ldr/I64Str encoder silently clamped to 0. With offset 0, locals
   landed exactly where stmdb had just saved r4/r5 — corrupting the
   callee-saved register spill and propagating wrong values back to the
   caller after ldmia. Pure AAPCS violation.

Fix:

- Add `compute_local_layout(wasm_ops, num_params) -> LocalLayout` that
  walks the wasm op stream once to determine each non-param local's
  width (i32/i64) and assigns proper stack offsets from a base of 0.
  Uses `infer_i64_locals` (also new) which simulates a width vstack
  to classify locals based on what gets stored into them.

- Prologue emits `sub sp, sp, #frame_size` after the callee-saved
  push, allocating the locals area below the saved-register spills.

- Epilogue emits `add sp, sp, #frame_size` before the pop, restoring
  SP to the callee-saved spills before they get popped. Also applied
  to the explicit Return opcode handler.

- LocalGet/LocalSet/LocalTee dispatch on the layout entry: i64 locals
  use I64Ldr/I64Str (which already emit two 32-bit memory ops at offset
  N and N+4); i32 locals use plain Ldr/Str. Offsets come from layout,
  not from the legacy formula. Frame size is rounded up to 8 bytes for
  AAPCS SP alignment at call sites.

Verified locally:

  /tmp/repro_i64.wat (1 i32 param + 1 i64 local round-trip):
    Before: str.w r0, [sp]; (no upper store) — upper half lost.
    After:  sub.w sp, sp, #8;
            str.w r0, [sp]; str.w r1, [sp, #4]; (both halves stored)
            ldr.w r2, [sp]; ldr.w r3, [sp, #4]; (both halves loaded)
            add.w sp, sp, #8; ldmia.

  gale_k_sem_give_decide (3 i32 params + 1 i64 local, the function
  whose runtime miscompilation hung the engine bench in CI):
    Before: str.w r3, [sp]; str.w r5, [sp, #4]; — clobbered the saved
            r4/r5 from stmdb, AAPCS-violating.
    After:  sub.w sp, sp, #8 → str into the locals frame, not the spill;
            add.w sp, sp, #8 before ldmia — proper AAPCS.

Engine-bench build with GALE_USE_SYNTH=ON now produces a 22768 B
zephyr.elf (was 22692 B with the buggy synth). Renode validation
pending in CI.

Out of scope:
- The optimized regalloc bug in optimizer_bridge.rs (clobbers r0..r3
  parameter registers — see /tmp/match_gale.wat repro). This fix lets
  --no-optimize work; the optimized path needs a separate pass.
- Implicit-return-to-R0 elision in select_with_stack: small functions
  whose result lands in a non-R0 temp don't get a final mov to R0 in
  --no-optimize mode. Pre-existing, unrelated to this fix; affects
  i32-returning functions with a single intermediate value. Doesn't
  affect i64 returns (which use the natural R0/R1 pair).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #85 fixed i64 local STORAGE (both halves stored to consecutive stack
slots) but introduced a follow-on bug: when the i64 register pair gets
allocated via two separate alloc_temp_safe calls, the resulting pair
can be NON-CONSECUTIVE in ALLOCATABLE_REGS if a register in between is
live on the wasm stack.

Subsequent i64 ops downstream call i64_pair_hi(rdlo) to recover the
high register, which assumes the pair is consecutive. With a
non-consecutive pair from earlier, i64_pair_hi returns the WRONG
register and the op reads garbage.

Witnessed on gale_k_sem_give_decide:

  ldr.w r5, [sp]      ; LocalGet i64 lo - r5 picked
  ldr.w r6, [sp, #4]  ; LocalGet i64 hi - r6 picked (consecutive ✓)
  ...later...
  orr.w r0, r0, r2    ; i64.or expected (r5,r6) but used (r2,r3)

Fix: add `alloc_consecutive_pair` helper that ensures two consecutive
free entries in ALLOCATABLE_REGS. Use it everywhere a pair is allocated
for an i64 result: I64Const (line 4567), I64Add/Sub/Mul result allocs
(lines 4914+, 4958+, 4996+), and the new i64-LocalGet path from PR #85.

Verified locally: /tmp/repro_i64.wat round-trips correctly with consecutive
register pair (lo→r2, hi→r3). gale_k_sem_give_decide's LocalGet 3 now
loads to consecutive r5/r6.

Note: the engine bench in Renode still hangs after this fix. Further
diagnostic shows i64.or's ARM lowering uses register pairs (r0,r1) and
(r2,r3) regardless of what's on the wasm-tracked stack — a separate
bug in synth's wildcard fallthrough for I64Or in select_with_stack.
That fix is out of scope for this PR; tracked separately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three deeper bugs surfaced when running gale_k_sem_give_decide on
Renode after PR #85 + the consecutive-pair fix:

1. **i64 ops fall through to select_default** in select_with_stack,
   which assumes inputs are in R0:R1 / R2:R3. Wasm-stack-tracked
   pairs from earlier ops never make it. Symptoms: i64.or used
   register pairs from previous shift ops instead of the just-loaded
   LocalGet result, producing a corrupted return value.

   Fix: explicit handlers for I64Or / I64And / I64Xor / I64ExtendI32U /
   I64ExtendI32S / I64Shl / I64ShrU / I64ShrS in select_with_stack,
   each popping the wasm-tracked pair, deriving its hi via
   i64_pair_hi, and emitting the right ARM instructions / pseudo-ops
   with arbitrary registers (the existing I64Shl/etc. ArmOp pseudo-ops
   accept arbitrary register operands).

2. **alloc_consecutive_pair didn't reserve implicit hi registers**.
   The wasm stack only tracks the lo register of each i64; the hi is
   conventional but invisible to a naive scan. A fresh allocation
   could overwrite an earlier i64's implicit hi, breaking subsequent
   i64_pair_hi lookups. Witnessed: I64Const 32 allocated (r1, r2),
   clobbering the hi of a previously-extended i64 in (r0, r1).

   Fix: alloc_consecutive_pair now scans every stack entry and
   reserves both lo AND its conventional pair_hi (over-reserves for
   i32 stack entries — safe).

3. **alloc_consecutive_pair didn't reserve just-popped operands**.
   When an i64 op popped operands and then allocated a result pair,
   the popped values were temporarily off the stack. The allocation
   could pick a register that's about to be read by the op's own
   second instruction (e.g. dst_lo == a_hi means the lo Or write
   clobbers a_hi before the hi Or reads it).

   Fix: extra_avoid parameter on alloc_consecutive_pair. I64Add /
   I64Sub / I64Or / I64Shl / I64Load / extends pass their popped
   operand registers via extra_avoid.

Verified locally: gale_k_sem_give_decide now produces:
  orr r0, r6, r8   ; lo = shift_result_lo OR local3_lo
  orr r1, r7, ip   ; hi = shift_result_hi OR local3_hi
matching the wasm semantics for i64.or. Engine bench builds clean
with 22644 B FLASH.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…i64 ops

In the optimized lowering path (`optimizer_bridge::ir_to_arm`), every i64
opcode hardcoded its register pair to R0:R1 / R2:R3 and ignored the actual
operand vregs from `vreg_to_arm`. For any function that emitted an i64 op
before all its i32 params had been read (e.g. `i64.const 1` issued before
the first `local.get 0`), the very first emitted instruction would clobber
the AAPCS arg register holding the param.

Concretely, `match_gale.wat` (3 i32 params + 1 i64 local + an `i64.const`
prologue) compiled to:

  movs r0, #1     ; <-- destroys param 0
  movs r1, #0     ; <-- destroys param 1
  ...
  cmp  r0, r1     ; reads garbage

Same shape on the engine_control bench wasm: `gale_k_sem_give_decide`
started with `movs r0, #1; movs r1, #0` before any `local.get 0`.

Fix:
- Compute `param_reserved_regs: Vec<Reg>` = R0..R(min(num_params, 4))
  at function entry. Use Vec because `Reg` doesn't derive Hash; matches
  `instruction_selector::alloc_consecutive_pair` style.
- Add an `alloc_i64_pair` helper that searches `[(R4,R5), (R6,R7),
  (R8,R9), (R10,R11)]` for a free callee-saved pair, skipping any reg
  that's already in `vreg_to_arm.values()`, `local_to_reg.values()`, or
  `param_reserved_regs`.
- Rewrite every i64 handler — I64Const, I64Load (non-param case),
  I64Add/Sub/And/Or/Xor/Mul/Shl/ShrS/ShrU/Rotl/Rotr/DivS/DivU/RemS/RemU,
  the comparisons (Eq/Ne/Lt/Gt/Le/Ge {S,U}, Eqz), Clz/Ctz/Popcnt, and
  Extend{8,16,32}S — to (a) read source regs from `vreg_to_arm` instead
  of assuming R0:R1/R2:R3, and (b) place the destination on a fresh
  pair via `alloc_i64_pair`.
- Update the function epilogue: previously `is_i64_result` short-circuited
  the move-to-R0 because the result was already pinned at R0:R1. With
  the fix, the result lives wherever regalloc placed it, so emit explicit
  moves into R0 and R1, ordered to handle the rare case where lo lives
  in R1 (route via R12).
- For Clz/Ctz/Popcnt, whose ArmOp encoder zeros `rnhi` in-place to
  produce the i64 result hi, copy the source hi into a fresh callee-saved
  hi slot before the op so we don't smash the original input or any
  unrelated AAPCS reg. Track the chosen physical hi reg via
  `last_result_vreg_hi_reg` for the epilogue.

Verification:

  /tmp/match_gale.wat post-fix:
    0:  movs  r4, #1   ; was: movs r0, #1
    2:  movs  r5, #0   ; was: movs r1, #0
    ...
    c:  cmp   r0, r1   ; params intact

  gale_k_sem_give_decide post-fix:
    e8:  movs  r4, #1  ; was: movs r0, #1
    ea:  movs  r5, #0  ; was: movs r1, #0

`cargo test --workspace` is green. The no-optimize path
(`instruction_selector.rs`) is untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires up production-grade Cortex-M7 targets across the toolchain.

* Add HardwareCapabilities::imxrt1062() — Cortex-M7 r1p1 with single-precision FPU,
  16 MPU regions, 8MB external QSPI flash, 1MB OCRAM. Representative high-end M7
  configuration for safety-grade lockstep platforms.
* Add HardwareCapabilities::stm32h743() — Cortex-M7 with double-precision FPU,
  16 MPU regions, 2MB Flash, 1MB RAM.
* CLI: --hardware {imxrt1062,stm32h743} and target-info wired up.
* Renode: tests/renode/synth_cortex_m7.repl models a 600 MHz M7 with ITCM/DTCM/
  OCRAM and an 8MB XIP flash window. cortex_m7_test.robot exercises Synth's
  --target cortex-m7 codegen path through the platform.
* MPU allocator: add tests proving the existing hw_caps.mpu_regions plumbing
  scales to 16 regions on M7 and that 8-region M4-class parts still reject the
  9th allocation.
* Integration: tests/integration/m7_codegen_smoke.sh — offline smoke test for
  i32, f32 and f64 codegen on cortex-m7 / cortex-m7dp.

Also fixes 8 clippy errors (unnecessary_sort_by, collapsible_match,
collapsible_if, manual_checked_division) introduced by a clippy lint refresh
on the parent branches — required for CI to clear -D warnings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@avrabe avrabe changed the base branch from fix/synth-i64-locals-and-frame to main May 3, 2026 13:16
@avrabe avrabe changed the title fix(opt): regalloc clobbers parameter registers — AAPCS violation in i64 ops feat: AAPCS regalloc fix + i64 stack frame + --relocatable + M7 hardening May 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant