feat: AAPCS regalloc fix + i64 stack frame + --relocatable + M7 hardening#86
Open
feat: AAPCS regalloc fix + i64 stack frame + --relocatable + M7 hardening#86
Conversation
Currently synth produces a relocatable object (.o, ET_REL) only when the input wasm has imports — the relocations they generate trigger the relocatable code path. Self-contained wasm modules with no imports produce a complete ET_EXEC firmware with vector table, Reset_Handler, linear_memory section, etc. For linking into a host build system (e.g. integrating verified Rust kernel primitives compiled to wasm into a Zephyr build), the host expects a relocatable .o it can pull into its existing link step. Add a --relocatable CLI flag that forces ET_REL output regardless of whether the wasm has imports. The flag is additive — default behaviour is unchanged. Tested with gale-ffi.wasm (200 functions, 0 imports): output is now a 26645-byte ET_REL ARM EABI5 object with all gale_* symbols defined and no vector-table machinery. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two bugs in select_with_stack's spilled-local handling, both surfaced
running gale (formally-verified Zephyr kernel primitives) through synth
on real Cortex-M code:
1. **i64 local storage only writes one half.** LocalSet/LocalTee/LocalGet
for spilled locals always emitted plain Str/Ldr (4 bytes), even for
i64 locals. The upper half was never stored, so local.get returned
garbage for 32 high bits. For gale's u64-packed FFI decision structs
this corrupted the action field, breaking every ISR-driven semaphore
operation in the engine bench (count=0 / drain_timeout at every step).
2. **Locals area aliased the callee-saved spill area.** The legacy
offset formula `(local_idx - 4) * 4` was hardcoded for num_params==4
and produced negative offsets for other configurations, which the
I64Ldr/I64Str encoder silently clamped to 0. With offset 0, locals
landed exactly where stmdb had just saved r4/r5 — corrupting the
callee-saved register spill and propagating wrong values back to the
caller after ldmia. Pure AAPCS violation.
Fix:
- Add `compute_local_layout(wasm_ops, num_params) -> LocalLayout` that
walks the wasm op stream once to determine each non-param local's
width (i32/i64) and assigns proper stack offsets from a base of 0.
Uses `infer_i64_locals` (also new) which simulates a width vstack
to classify locals based on what gets stored into them.
- Prologue emits `sub sp, sp, #frame_size` after the callee-saved
push, allocating the locals area below the saved-register spills.
- Epilogue emits `add sp, sp, #frame_size` before the pop, restoring
SP to the callee-saved spills before they get popped. Also applied
to the explicit Return opcode handler.
- LocalGet/LocalSet/LocalTee dispatch on the layout entry: i64 locals
use I64Ldr/I64Str (which already emit two 32-bit memory ops at offset
N and N+4); i32 locals use plain Ldr/Str. Offsets come from layout,
not from the legacy formula. Frame size is rounded up to 8 bytes for
AAPCS SP alignment at call sites.
Verified locally:
/tmp/repro_i64.wat (1 i32 param + 1 i64 local round-trip):
Before: str.w r0, [sp]; (no upper store) — upper half lost.
After: sub.w sp, sp, #8;
str.w r0, [sp]; str.w r1, [sp, #4]; (both halves stored)
ldr.w r2, [sp]; ldr.w r3, [sp, #4]; (both halves loaded)
add.w sp, sp, #8; ldmia.
gale_k_sem_give_decide (3 i32 params + 1 i64 local, the function
whose runtime miscompilation hung the engine bench in CI):
Before: str.w r3, [sp]; str.w r5, [sp, #4]; — clobbered the saved
r4/r5 from stmdb, AAPCS-violating.
After: sub.w sp, sp, #8 → str into the locals frame, not the spill;
add.w sp, sp, #8 before ldmia — proper AAPCS.
Engine-bench build with GALE_USE_SYNTH=ON now produces a 22768 B
zephyr.elf (was 22692 B with the buggy synth). Renode validation
pending in CI.
Out of scope:
- The optimized regalloc bug in optimizer_bridge.rs (clobbers r0..r3
parameter registers — see /tmp/match_gale.wat repro). This fix lets
--no-optimize work; the optimized path needs a separate pass.
- Implicit-return-to-R0 elision in select_with_stack: small functions
whose result lands in a non-R0 temp don't get a final mov to R0 in
--no-optimize mode. Pre-existing, unrelated to this fix; affects
i32-returning functions with a single intermediate value. Doesn't
affect i64 returns (which use the natural R0/R1 pair).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #85 fixed i64 local STORAGE (both halves stored to consecutive stack slots) but introduced a follow-on bug: when the i64 register pair gets allocated via two separate alloc_temp_safe calls, the resulting pair can be NON-CONSECUTIVE in ALLOCATABLE_REGS if a register in between is live on the wasm stack. Subsequent i64 ops downstream call i64_pair_hi(rdlo) to recover the high register, which assumes the pair is consecutive. With a non-consecutive pair from earlier, i64_pair_hi returns the WRONG register and the op reads garbage. Witnessed on gale_k_sem_give_decide: ldr.w r5, [sp] ; LocalGet i64 lo - r5 picked ldr.w r6, [sp, #4] ; LocalGet i64 hi - r6 picked (consecutive ✓) ...later... orr.w r0, r0, r2 ; i64.or expected (r5,r6) but used (r2,r3) Fix: add `alloc_consecutive_pair` helper that ensures two consecutive free entries in ALLOCATABLE_REGS. Use it everywhere a pair is allocated for an i64 result: I64Const (line 4567), I64Add/Sub/Mul result allocs (lines 4914+, 4958+, 4996+), and the new i64-LocalGet path from PR #85. Verified locally: /tmp/repro_i64.wat round-trips correctly with consecutive register pair (lo→r2, hi→r3). gale_k_sem_give_decide's LocalGet 3 now loads to consecutive r5/r6. Note: the engine bench in Renode still hangs after this fix. Further diagnostic shows i64.or's ARM lowering uses register pairs (r0,r1) and (r2,r3) regardless of what's on the wasm-tracked stack — a separate bug in synth's wildcard fallthrough for I64Or in select_with_stack. That fix is out of scope for this PR; tracked separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three deeper bugs surfaced when running gale_k_sem_give_decide on Renode after PR #85 + the consecutive-pair fix: 1. **i64 ops fall through to select_default** in select_with_stack, which assumes inputs are in R0:R1 / R2:R3. Wasm-stack-tracked pairs from earlier ops never make it. Symptoms: i64.or used register pairs from previous shift ops instead of the just-loaded LocalGet result, producing a corrupted return value. Fix: explicit handlers for I64Or / I64And / I64Xor / I64ExtendI32U / I64ExtendI32S / I64Shl / I64ShrU / I64ShrS in select_with_stack, each popping the wasm-tracked pair, deriving its hi via i64_pair_hi, and emitting the right ARM instructions / pseudo-ops with arbitrary registers (the existing I64Shl/etc. ArmOp pseudo-ops accept arbitrary register operands). 2. **alloc_consecutive_pair didn't reserve implicit hi registers**. The wasm stack only tracks the lo register of each i64; the hi is conventional but invisible to a naive scan. A fresh allocation could overwrite an earlier i64's implicit hi, breaking subsequent i64_pair_hi lookups. Witnessed: I64Const 32 allocated (r1, r2), clobbering the hi of a previously-extended i64 in (r0, r1). Fix: alloc_consecutive_pair now scans every stack entry and reserves both lo AND its conventional pair_hi (over-reserves for i32 stack entries — safe). 3. **alloc_consecutive_pair didn't reserve just-popped operands**. When an i64 op popped operands and then allocated a result pair, the popped values were temporarily off the stack. The allocation could pick a register that's about to be read by the op's own second instruction (e.g. dst_lo == a_hi means the lo Or write clobbers a_hi before the hi Or reads it). Fix: extra_avoid parameter on alloc_consecutive_pair. I64Add / I64Sub / I64Or / I64Shl / I64Load / extends pass their popped operand registers via extra_avoid. Verified locally: gale_k_sem_give_decide now produces: orr r0, r6, r8 ; lo = shift_result_lo OR local3_lo orr r1, r7, ip ; hi = shift_result_hi OR local3_hi matching the wasm semantics for i64.or. Engine bench builds clean with 22644 B FLASH. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…i64 ops In the optimized lowering path (`optimizer_bridge::ir_to_arm`), every i64 opcode hardcoded its register pair to R0:R1 / R2:R3 and ignored the actual operand vregs from `vreg_to_arm`. For any function that emitted an i64 op before all its i32 params had been read (e.g. `i64.const 1` issued before the first `local.get 0`), the very first emitted instruction would clobber the AAPCS arg register holding the param. Concretely, `match_gale.wat` (3 i32 params + 1 i64 local + an `i64.const` prologue) compiled to: movs r0, #1 ; <-- destroys param 0 movs r1, #0 ; <-- destroys param 1 ... cmp r0, r1 ; reads garbage Same shape on the engine_control bench wasm: `gale_k_sem_give_decide` started with `movs r0, #1; movs r1, #0` before any `local.get 0`. Fix: - Compute `param_reserved_regs: Vec<Reg>` = R0..R(min(num_params, 4)) at function entry. Use Vec because `Reg` doesn't derive Hash; matches `instruction_selector::alloc_consecutive_pair` style. - Add an `alloc_i64_pair` helper that searches `[(R4,R5), (R6,R7), (R8,R9), (R10,R11)]` for a free callee-saved pair, skipping any reg that's already in `vreg_to_arm.values()`, `local_to_reg.values()`, or `param_reserved_regs`. - Rewrite every i64 handler — I64Const, I64Load (non-param case), I64Add/Sub/And/Or/Xor/Mul/Shl/ShrS/ShrU/Rotl/Rotr/DivS/DivU/RemS/RemU, the comparisons (Eq/Ne/Lt/Gt/Le/Ge {S,U}, Eqz), Clz/Ctz/Popcnt, and Extend{8,16,32}S — to (a) read source regs from `vreg_to_arm` instead of assuming R0:R1/R2:R3, and (b) place the destination on a fresh pair via `alloc_i64_pair`. - Update the function epilogue: previously `is_i64_result` short-circuited the move-to-R0 because the result was already pinned at R0:R1. With the fix, the result lives wherever regalloc placed it, so emit explicit moves into R0 and R1, ordered to handle the rare case where lo lives in R1 (route via R12). - For Clz/Ctz/Popcnt, whose ArmOp encoder zeros `rnhi` in-place to produce the i64 result hi, copy the source hi into a fresh callee-saved hi slot before the op so we don't smash the original input or any unrelated AAPCS reg. Track the chosen physical hi reg via `last_result_vreg_hi_reg` for the epilogue. Verification: /tmp/match_gale.wat post-fix: 0: movs r4, #1 ; was: movs r0, #1 2: movs r5, #0 ; was: movs r1, #0 ... c: cmp r0, r1 ; params intact gale_k_sem_give_decide post-fix: e8: movs r4, #1 ; was: movs r0, #1 ea: movs r5, #0 ; was: movs r1, #0 `cargo test --workspace` is green. The no-optimize path (`instruction_selector.rs`) is untouched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires up production-grade Cortex-M7 targets across the toolchain.
* Add HardwareCapabilities::imxrt1062() — Cortex-M7 r1p1 with single-precision FPU,
16 MPU regions, 8MB external QSPI flash, 1MB OCRAM. Representative high-end M7
configuration for safety-grade lockstep platforms.
* Add HardwareCapabilities::stm32h743() — Cortex-M7 with double-precision FPU,
16 MPU regions, 2MB Flash, 1MB RAM.
* CLI: --hardware {imxrt1062,stm32h743} and target-info wired up.
* Renode: tests/renode/synth_cortex_m7.repl models a 600 MHz M7 with ITCM/DTCM/
OCRAM and an 8MB XIP flash window. cortex_m7_test.robot exercises Synth's
--target cortex-m7 codegen path through the platform.
* MPU allocator: add tests proving the existing hw_caps.mpu_regions plumbing
scales to 16 regions on M7 and that 8-region M4-class parts still reject the
9th allocation.
* Integration: tests/integration/m7_codegen_smoke.sh — offline smoke test for
i32, f32 and f64 codegen on cortex-m7 / cortex-m7dp.
Also fixes 8 clippy errors (unnecessary_sort_by, collapsible_match,
collapsible_if, manual_checked_division) introduced by a clippy lint refresh
on the parent branches — required for CI to clear -D warnings.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Consolidated PR — supersedes #83, #85, #86 (formerly stacked with non-main bases). Bundles the four in-flight Track A items into one CI-runnable change.
What's in this PR
1. fix(opt): regalloc clobbers parameter registers — AAPCS violation in i64 ops
The optimized lowering path (
optimizer_bridge::ir_to_arm) hardcoded every i64 opcode to use R0:R1 / R2:R3 — regardless of whether those AAPCS arg slots still held live params. Replaced withalloc_i64_pair, which picks a free callee-saved pair from[(R4,R5), (R6,R7), (R8,R9), (R10,R11)]and skips anything invreg_to_arm.values(),local_to_reg.values(), or the reserved AAPCS slotsR0..R(min(num_params,4)). Source operands now resolve throughvreg_to_arminstead of being assumed pinned.2. fix(no-optimize): allocate stack frame + i64 local storage in select_with_stack
Two AAPCS-violating bugs surfaced by running gale (verified Rust kernel primitives) through synth on a real Cortex-M Zephyr engine bench:
LocalSet/LocalTee/LocalGetfor spilled locals always used 4-byteStr/Ldr. For i64 locals, only the lower 32 bits round-tripped, corrupting any function returning a u64-packed FFI struct.(local_idx - 4) * 4was hardcoded fornum_params == 4and produced negative offsets that the encoder silently clamped to[sp, #0]— exactly wherestmdbhad just saved r4/r5. Afterldmia, callers saw r4/r5 corrupted.Fix:
compute_local_layout(wasm_ops, num_params)walks the wasm ops once to determine each non-param local's width and assign 8-byte-aligned i64 slots. Prologue emitssub sp, sp, #frame_size; epilogue emits the matchingadd sp, sp, #frame_size.3. feat(cli): add --relocatable flag to force ET_REL output
Adds a
--relocatableflag that forces ET_REL output regardless of whether the wasm has imports — for linking into a host build system that has its own startup code. Default behaviour is unchanged.4. feat(m7): high-end Cortex-M7 hardware profiles + 16-MPU-region support
Wires up production-grade Cortex-M7 targets across the toolchain:
HardwareCapabilities::imxrt1062()— Cortex-M7 r1p1 with single-precision FPU, 16 MPU regions, 8MB external QSPI flash, 1MB OCRAM. Representative high-end M7 configuration for safety-grade lockstep platforms.HardwareCapabilities::stm32h743()— Cortex-M7 with double-precision FPU, 16 MPU regions, 2MB Flash, 1MB RAM.--hardware {imxrt1062,stm32h743}and target-info wired up.tests/renode/synth_cortex_m7.replmodels a 600 MHz M7 with ITCM/DTCM/OCRAM and an 8MB XIP flash window.cortex_m7_test.robotexercises--target cortex-m7.hw_caps.mpu_regionsplumbing scales to 16 regions on M7 and that 8-region M4-class parts still reject the 9th allocation.tests/integration/m7_codegen_smoke.sh— offline smoke for i32, f32 and f64 codegen on cortex-m7 / cortex-m7dp.5. Clippy fixes (Rust 1.95 lint refresh)
8 errors fixed:
unnecessary_sort_by,collapsible_match,collapsible_if,manual_checked_division. Required for CI to clear-D warnings.Test plan
cargo build --release --bin synth— cleancargo test --workspace— all 600+ tests passcargo clippy --workspace --all-targets -- -D warnings— cleancargo fmt --check— cleanbash tests/integration/m7_codegen_smoke.sh— 5/5 pass (i32/f32 on M7, i32/f32/f64 on M7DP)cortex_m7_add_teston M7 platform (CI-gated)Why one large PR
The previous stack (#83 → #85 → #86) had non-main bases that prevented CI from firing. Bundling avoids dispatching three separate runner-heavy CI runs and gives a single mergeable artifact. After merge, #83 and #85 will be closed as superseded.
🤖 Generated with Claude Code