simd: agnostic gemm_u8_i8 surface, integer-slice-op lift, per-CPU matrix, BF16 AMX wiring#182
Merged
Merged
Conversation
Introduces `simd_int_ops::gemm_u8_i8`, the first consumer-facing surface that bakes the SIMD dispatch decision in at build time. The consumer never branches on CPU capability; the active arm of the `#[cfg(target_feature)]` chain is the only one that compiles in, and the compiler emits a direct call to the chosen kernel. Arms wired in this PR: target_feature = "avx512vnni" → int8_gemm_vnni_avx512 kernel (default / anything else) → hpc::quantized::int8_gemm_i32 scalar Future arms (amx-int8, avxvnniint8, neon+dotprod) land additively in follow-up PRs without disturbing existing callers. Mechanical changes: * `int8_gemm_vnni_avx512` becomes `pub(crate) unsafe fn` so the agnostic surface can target it directly under the cfg gate, bypassing the per-call `if caps.has_avx512_vnni()` branch in `int8_gemm_vnni` (kept as-is for now; cleanup is a follow-up). * `hpc::quantized::int8_gemm_i32` (scalar) is untouched — it remains the universal reference path and the `target_feature`-less fallback. Parity tests (4×4 identity, 3×5×8 rectangular, 17×17 tail, extreme u8/i8 values) pass under both the default v3 scalar arm and the `-Ctarget-feature=+avx512vnni` AVX-512 arm. Full lib suite green (2075 passed). `cargo clippy -- -D warnings` clean. Architecture refs: .claude/knowledge/td-simd-integration-plan.md § "SimdProfile + static dispatch tables" https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
Two complementary changes — agnostic INT8 GEMM no longer falls to
scalar on AVX2-with-VNNI silicon, and the AVX-512 build config picks
up the full modern Intel HPC feature set instead of bare v4.
1. AVX-VNNI ymm kernel
New `hpc::vnni_gemm::int8_gemm_avxvnni_ymm` — VEX-encoded
`VPDPBUSD` over 8-wide i32 accumulators. Targets the AVX2 +
AVX-VNNI silicon tier (Alder Lake, Arrow Lake, Zen 4 in ymm
mode) which has hardware INT8 dot-product but no AVX-512. Same
B-pre-pack layout as the AVX-512 kernel; column tail (`n % 8`)
runs scalar (no masked ymm VPDPBUSD on the VEX encoding).
2. gemm_u8_i8 dispatch chain — avxvnni arm
`simd_int_ops::gemm_u8_i8` gains a third `#[cfg]` arm between
`avx512vnni` and the scalar fallback:
avx512vnni → int8_gemm_vnni_avx512 (zmm, 16 lanes)
avxvnni → int8_gemm_avxvnni_ymm (ymm, 8 lanes)
(none) → hpc::quantized::int8_gemm_i32 (scalar)
Arm precedence is widest-vector-first via `#[cfg]` ordering
(Sapphire Rapids has both `avx512vnni` and `avxvnni`; the zmm
arm wins). All arms are compile-time selected — no runtime
caps branch on any hot path.
3. config-avx512.toml: x86-64-v4 → sapphirerapids
The "AVX-512" build config now selects the canonical modern
Intel HPC target (SPR), enabling VNNI + BF16 + FP16 + VBMI +
AMX-TILE + AMX-INT8 + AMX-BF16 in addition to the v4 baseline.
Effect on `gemm_u8_i8`: the avx512vnni arm now lights up under
this config (pure x86-64-v4 lacks VNNI). Once an `amx-int8`
arm lands, it will preempt automatically on the same config.
GitHub CI runs the default `.cargo/config.toml` (still
`-Ctarget-cpu=x86-64-v3`), which is unaffected — only opt-in
`--config .cargo/config-avx512.toml` builds see the change.
Parity tests (4×4 identity, 3×5×8 rectangular, 17×17 tail with the
ymm-tail scalar path, extreme u8/i8 values) pass under three configs:
default v3 → scalar arm
-Ctarget-cpu=alderlake → avxvnni ymm arm
--config config-avx512.toml → avx512vnni zmm arm
Full lib suite green on default v3 (2075 passed).
`cargo clippy -- -D warnings` clean on both default and SPR configs.
https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
#[ignore]'d sanity-check test that times the agnostic gemm_u8_i8
surface against the scalar reference at 64³/128³/256³/512³. Run with
--ignored --nocapture to compare arms under different target cfgs.
Measured on Sapphire Rapids (4-core Xeon @ 2.1GHz), release build:
scalar arm (default v3) — noise (~1.0x, both paths are scalar):
64³ speedup=1.07x
128³ speedup=1.16x
256³ speedup=1.06x
512³ speedup=0.97x
avxvnni ymm arm (-Ctarget-cpu=alderlake):
64³ simd= 18.6µs scalar=109.5µs speedup=5.88x
128³ simd=151.7µs scalar=590.4µs speedup=3.89x
256³ simd= 1.1ms scalar= 2.7ms speedup=2.40x
512³ simd= 9.1ms scalar= 16.2ms speedup=1.77x
Confirms the AVX-VNNI ymm kernel is genuinely faster than the scalar
reference across all measured sizes — addresses the concern that an
AVX2 path could end up slower than scalar GEMM if the algorithmic
shape (B pre-pack into VNNI layout, tile sizing, MR/NR loop nesting)
were wrong.
https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
Recovers two foundation primitives that prior sessions removed (or
never added in the first place) — both are explicitly cited by an
earlier session as the reason the BLAS-graph GEMM hand-rolled kernels
reach within a few percent of a Cranelift-JIT inner loop. Without
them the JIT-native path becomes the only way to hit that throughput.
1. array_windows::<T, N> — overlapping const-size window iterator
Stable-Rust equivalent of nightly `std::slice::array_windows::<N>()`.
Sits next to the existing `array_chunks::<N>` (non-overlapping) in
`simd_ops.rs`, completing the pair. Together they let consumer
kernels iterate `B`-rows as overlapping K-windows and `A`-columns as
non-overlapping M-chunks in a single source, with the polyfilled
F32x16 / F64x8 types absorbing the per-arch lane count.
Implementation uses index-based iteration (`(0..count).map(|i|
&data[i..i+N])`) to avoid `slice::windows(0)`'s panic in the N==0
edge case — the unchecked variant yields an empty iterator, the
checked variant returns `Err(())`. Behaviour mirrors the
already-shipped `array_chunks` / `array_chunks_checked` pair.
2. add_mul_f32 / add_mul_f64 — slice-level FMA into accumulator
`acc[i] += a[i] * b[i]` via the polyfilled `F32x16::mul_add` /
`F64x8::mul_add` already on the SIMD types (16-wide AVX-512 / 8-wide
AVX2-FMA / 4-wide NEON / scalar `f32::mul_add`). Single rounding
step, semantically identical to BLAS-1 `axpy` with a vector
multiplier and the dominant inner-loop shape in the bgz17 GEMM
path. Operates on `min(acc.len(), a.len(), b.len())` lanes.
3. DO NOT REMOVE notice
`simd_ops.rs` now opens with a "Foundation primitives — do not
remove" callout that names `array_chunks`, `array_chunks_checked`,
`array_windows`, `array_windows_checked`, `add_mul_f32`, and
`add_mul_f64`, explains why they exist (~JIT-parity for BLAS-graph
GEMM), and warns that prior sessions removed them under the wrong
impression they were unused cruft. `src/simd.rs`' re-export site
carries a matching pointer back to that notice.
Both new helpers are re-exported flat from `crate::simd::*` per the W1a
consumer contract — consumers reach for `ndarray::simd::{array_windows,
add_mul_f32}`, never the implementation module directly.
Verification:
- 29 simd_ops unit tests pass (incl. 7 new array_windows + 4 new
add_mul tests, covering tail handling, mismatched lengths, N==0,
short buffers, exact-N, empty buffers).
- 7 simd_ops doctests pass (the executable examples in the rustdoc).
- Full lib suite green on default v3: 2087 passed, 29 ignored.
- `cargo clippy -- -D warnings` clean.
https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
…plan Cross-tab of every public agnostic surface (`crate::simd::*`, `simd_int_ops::*`, `simd_half::*`, `simd_soa::*`) against the 14 CPU profiles from `td-simd-cpu-dispatch-matrix.md`. For each cell: which kernel actually runs there, with markers for ✅ live, ⏳ planned, 🟡 polyfill-transparent,⚠️ scalar-debt, — n/a. Additionally surfaces a separate axis — **shape ingress** — that tracks the "ArrayView loses array shape on entry point" technical debt the user flagged: every function tagged 🔪 `&[T] + (m,n,k)` takes a flat slice and forces `.as_slice().unwrap()` at the call site, vs the 📐 `ArrayView` shape that `hpc::amx_matmul::matmul_*` already uses correctly. Findings the matrix surfaces: * `gemm_u8_i8` is the only currently-debt surface — `&[u8] + (m,n,k)`. Phase 0 of the integration plan lifts it to `(ArrayView2<u8>, ArrayView2<i8>, ArrayViewMut2<i32>) -> Result`. * Integer-elementwise ops (`add_i8`, `dot_i8`, `add_i16`, etc.) are uniformly scalar on every CPU despite `I8x64` / `I16x32` polyfilled lanes existing — predate the lane widening, never re-wired. * F16x16 has ZERO hardware backing on any profile (TD-SIMD-8); BF16x16 is hardware-backed only on `avx512bf16` profiles via `__m256bh`. NEON BF16 / FP16 (A76+) entirely scalar. * On aarch64, all integer polyfilled lanes (I8x32/I16x*/I32x*/I64x*/U*) are scalar — TD-T21 — even though NEON has 128-bit `intNx_t` quartets that would back them. * AMX exists on SPR/GNR but no agnostic surface routes to it (the kernel exists at `bf16_tile_gemm.rs`; only consumers in `amx_matmul.rs` reach it). Phase 1b wires it into `gemm_u8_i8`. Integration plan (J) phases 0-5, each one PR-sized: 0 - gemm_u8_i8 ArrayView lift (shape-debt fix) 1 - wire existing hardware paths (NEON SDOT, AMX-INT8, AVX-VNNI-INT8) 2 - lift integer-elementwise surfaces to polyfilled lanes 3 - TD-SIMD-8: BF16x16 + F16x16 hardware backing on CPL/SPR/Zn4 + A76 4 - remaining hardware fills (aarch64 ints, simd_ln_f32 Remez, RNE polyfill, AMX-FP16 detection) 5 - rolling: every new surface ships with ArrayView ingress from day one Verification checklist at the bottom defines the promotion gate (planned -> live): kernel exists, cfg-chain routes, parity-test green, timing-harness beats scalar, doc-comment updated, this matrix cell flipped in the same PR. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
Expanded the agnostic-surface matrix from the initial pass to cover
every public type, function, and infrastructure item exhaustively:
* § A (polyfilled types backing) now includes the FULL integer-lane
table — I8x{32,64}, U8x{32,64}, I16x{16,32}, U16x{16,32},
I32x{8,16}, U32x{8,16}, I64x{4,8}, U64x{4,8} — per CPU profile,
with TD-T22 ⏳ markers on the still-unverified 256-bit polyfills.
* § A also adds a "Critical type-method per-CPU lowerings" subsection
that names the exact intrinsic each hot method emits per profile
(vfmadd231ps zmm vs 2×vfmadd231ps ymm vs 4×vfmaq_f32 vs scalar
f32::mul_add).
* § B simplified — every simd_ops surface is 🟡 polyfill-pass; the
work is in the polyfill layer above.
* § C (simd_int_ops) sharpened — every scalar 🚨 cell is annotated
with the polyfilled lane (I8x64, I16x32) it should reach for.
* § D made explicit about BF16 native via __m256bh storage vs the
portable [u16; 16] scalar polyfill switch.
* § E (batch converters + transcendentals) now flags
f32_to_bf16_batch_rne as scalar on every non-AVX-512 profile,
despite the kernel existing — Phase 1 MX-T2.
* § H added — currently-missing surfaces inventory (gemm_i8, gemm_u8,
dot4_u8_i8 polyfill primitive, axpy_f32, dot_f32, nrm2_f32,
asum_f32, gemv_f32, dot_i32, SimdProfile enum).
* § I added — cross-cutting infrastructure status (cargo configs
present per profile, missing cpu-* features, missing
runtime-dispatch feature, missing SimdProfile enum, bench harness
coverage).
* § J integration plan extended:
- Phase 0 records what landed in this session.
- Phase 1 merges audit TD-T* tasks with new MX-T* items
(integer slice ops lift, bf16/f16 cast fast paths).
- Phase 4 grew MX-F1..MX-F16 with priority rebalanced based on
"hot" markers for AI/ML BF16/F16 paths.
- NEW Phase 5 — BLAS-graph kernel polish + bench-regression gates
(the JIT-parity zone the prior session reached).
- NEW Phase 6 — explicit out-of-scope list (GPU, JIT revival,
wasm32 SIMD128, multi-core).
* § K added — how to read the doc; § L provenance trail
(no grep/tail per workspace rule, every entry traceable to a
full-file Read).
Total: ~860 lines of matrix + plan, covering 14 CPU profiles ×
~80 polyfilled types & surface symbols.
https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
…lled lanes
Phase 1 of the per-CPU integration plan: the integer-elementwise slice
ops in simd_int_ops were uniformly scalar on every CPU despite the
polyfilled I8x64 / I16x32 lanes existing and being SIMD-backed on
every backend. This routes the three ops through the polyfill.
Per-backend dispatch follows the existing min_i8 / max_i8 template:
x86_64 → I8x64 / I16x32 (AVX-512BW _mm512_add_epi8 zmm /
AVX2 polyfill of I8x64 as 2×__m256i on v3 builds)
aarch64 → I8x16 / I16x8 (NEON vaddq_s8 / vaddq_s16)
other → scalar wrapping loop (unchanged)
Wrapping arithmetic is preserved on every path: _mm512_add_epi8 and
vaddq_s8 are bit-for-bit equivalent to i8::wrapping_add, so the
existing tests (add_i8_matches_scalar_for_tail_lengths covering
lengths 0/1/32/63/64/65/127/128/129/256) verify correctness across
the cfg chain. No new tests needed — the parity-against-scalar
sweep already exercised every boundary.
Verification:
* default v3 build (uses AVX2 polyfill of I8x64): 15 simd_int_ops
tests pass; 2087 lib tests pass; clippy -D warnings clean.
* cascadelake config (native _mm512_add_epi8 / _mm512_add_epi16):
15 simd_int_ops tests pass.
* sapphirerapids config: NOT verified — the dev-runtime CPU on
this host advertises only avx512_vnni in /proc/cpuinfo (no AMX
/ BF16 / FP16), so SPR-targeted binaries SIGILL on UNRELATED
pre-existing tests like min_max_i8_boundary_values. The SPR
config's correctness needs verification on real SPR silicon.
Companion matrix entries flipped:
C. simd_int_ops → row `add_i8` : ⚠️ scalar 🚨 → ✅ I8x64/I8x16
row `sub_i8` : ⚠️ scalar 🚨 → ✅ I8x64/I8x16
row `add_i16` : ⚠️ scalar 🚨 → ✅ I16x32/I16x8
Remaining Phase 1 work in simd_int_ops:
MX-T1b — `dot_i8` / `dot_i16` require a widening-multiply-add
polyfill primitive (i8×i8 → i32 via VPMADDUBSW + horizontal add
on x86, vmlal_s16 + vaddv_s32 on NEON). The widening-multiply
primitive doesn't yet exist on the polyfilled types; promoting
these without it would force per-arch intrinsics into
simd_int_ops, violating the agnostic-surface principle. Defer
to the polyfill-primitive PR.
https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
Two updates to the agnostic-surface CPU matrix following the MX-T1a landing (b5bca4e) and the user directive on instruction encoding strategy: 1. Matrix § C cells flipped from⚠️ scalar → ✅ for add_i8 / sub_i8 / add_i16 across every CPU column. The path per backend is documented inline (zmm _mm512_add_epi8 on AVX-512-BW, 2× ymm _mm256_add_epi8 on AVX2 via I8x64 polyfill, vaddq_s8 on NEON, scalar wrapping_add elsewhere). 2. § J Phase 0 grows an entry for MX-T1a, and gains a NEW "Design rule for AMX / F16 / FP16 paths" subsection that codifies the asm-byte encoding requirement for Phases 1b (AMX-INT8 arm of gemm_u8_i8), 3b (AVX-512-FP16 native F16x16 ops), 3c (NEON BF16+FP16), and 4d (AMX-FP16 on GNR). The rule: * AMX intrinsics are nightly-only on Rust 1.95 (issue #126622) → use asm!(".byte 0xc4, 0xe2, 0x73, 0x5e, 0xc1") style per the existing simd_amx.rs pattern. * AVX-512-FP16 intrinsics have stabilization churn → same asm-byte encoding sidesteps Rust release dance. * NEON FP16 (FMLA v.8h, BFDOT, BFMMLA, USDOT) — historically nightly-gated, use .inst 0x0e40cc20-style encoding for AArch64 (same idea, different assembler directive). * Each newly-encoded instruction lands with an objdump -d verification check in the doc-comment ("verified working" — same convention as simd_amx.rs:16-19). * Does NOT apply to instructions WITH stable intrinsics on Rust 1.95: _mm512_dpbusd_epi32 (avx512vnni), F16C _mm256_cvtph_ps, _mm512_cvtne2ps2bf16 (avx512bf16), etc. Those continue using direct intrinsics per existing simd_avx512.rs patterns. The rule prevents future regression where a session reaches for nightly avx512fp16 intrinsics, fails to compile on the project's stable toolchain, and then drops back to scalar polyfill — the same shape of regression that removed array_windows/add_mul in the prior session and was recovered in 0a46e7f. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
… kernel Per the PR #180 dispatch table for BF16 GEMM: SapphireRapids and GraniteRapids should route through `tile_dpbf16ps` (AMX TDPBF16PS, 256 BF16×BF16 multiply-accumulates per instruction, single-rounded into an f32 tile accumulator). Until this commit, the AMX branch of `matmul_bf16_to_f32` was a placebo — both `if amx_available()` and `else` called the scalar `bf16_gemm_f32`. The actual kernel (`bf16_tile_gemm::bf16_tile_gemm_16x16`, shipped by PR #104) was unreached by the consumer entry point. This wires it. When AMX is OS-enabled AND the matmul shape is 16/16/32-aligned in (M, N, K), the inner loop tiles 16×16 blocks through `bf16_tile_gemm_16x16` — that kernel emits TDPBF16PS via the asm-byte path in `simd_amx.rs::tile_dpbf16ps` (the stable-Rust 1.95 encoding documented at simd_amx.rs:16-19; AMX intrinsics are nightly-only per issue #126622, hence asm-byte). Aligned tiles get the full hardware throughput; misaligned shapes (any of M/N/K not at the alignment boundary) fall back to the validated scalar `bf16_gemm_f32` reference. Non-AMX hosts always take the scalar fallback. The B sub-block extraction copies a K × 16 packed scratch per j_tile column band (B is K × N row-major; the kernel wants K × 16 contiguous). Allocation cost is amortized across M/16 i-tile iterations under each j_tile. Phase-4 work will land a fully mixed-tile path (AMX 16×16 core + per-axis scalar tails on the same matmul) for arbitrary shapes. Verification: * Default v3 build: 11 amx_matmul tests pass (this host lacks AMX per /proc/cpuinfo, so the path falls through to scalar; behaviour identical to pre-commit on this runner). * Full lib sweep: 2087 tests pass; clippy -D warnings clean. * Real SPR silicon: the gating is correctness-by-construction — the new branch only fires when amx_available() == true AND the alignment predicates hold; the inner kernel is the same one PR #104 shipped and tested. Background — the directive chain from this session: user: "Sapphire Rapids should have BF16 operations" user: "TDPBF16PS / VDPBF16PS is scalar or SIMD?" → both are SIMD, TDPBF16PS does 8192 BF16×BF16 multiplies + 256 f32 accums per instruction (16×16 outer-product matmul tile), VDPBF16PS does 32 BF16×BF16 multiplies + 16 f32 accums per zmm instruction. Neither is scalar. The "no scalar lane-by-lane f32 round-trip" rule the user gave is what this PR delivers: the AMX tile op is hardware-fused, single-rounded into f32 accumulator, BF16 mantissa bits preserved bit-exactly per IEEE BF16 spec at the multiply step. Closes TD-T1 from `.claude/knowledge/agnostic-surface-cpu-matrix.md` § J Phase 1. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
… kernel Follow-up to TD-T1 (fe334de). `matmul_f32`'s AMX branch was the same shape of placebo as `matmul_bf16_to_f32`'s pre-TD-T1: it down-cast f32 → BF16, then called the scalar `bf16_gemm_f32` reference — never reaching `TDPBF16PS` even on real AMX silicon. Factored the BF16 AMX-tile dispatch logic out of `matmul_bf16_to_f32` into a private `bf16_gemm_with_amx(a, b, c, m, n, k)` helper. Both public entry points now route through it: matmul_bf16_to_f32 → bf16_gemm_with_amx (direct BF16 inputs) matmul_f32 → RNE down-cast → bf16_gemm_with_amx (f32 in, BF16 compute, f32 accumulator out) The helper's behaviour is unchanged from what TD-T1 shipped: 16/16/32- aligned shapes hit `bf16_tile_gemm_16x16` (TDPBF16PS via asm-byte, 8 192 BF16×BF16 multiplies + 256 f32 accumulates per instruction); mis-aligned shapes or non-AMX hosts fall back to scalar `bf16_gemm_f32`. Single source of truth — future Phase-4 mixed-tile- plus-tail dispatch only needs to land in one place. Verification: * 11 amx_matmul tests pass (default v3, no AMX on this host → scalar fallback exercised; behaviour identical to pre-commit). * cargo clippy --lib -D warnings clean. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
Five locations across two files where my recent commits had lines
slightly over the rustfmt width:
- simd_int_ops.rs tests: 3× iterator chain reflows (.collect()
onto its own line)
- simd_ops.rs:505 — `array_windows` count computation broken to
if/else block form
- simd_ops.rs:679 / :686 — ref_add_mul_{f32,f64} test helpers
reflow .iter().zip(...).map(...).collect() onto multi-line
Pure whitespace / formatting; no semantic changes. 15 simd_int_ops
tests + 29 simd_ops tests still pass on default v3.
https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
Extends the BF16 GEMM dispatch chain from PR #180's per-tier table. Until this commit, the dispatcher was two-tier: AMX TDPBF16PS (SPR, GNR) → scalar bf16_gemm_f32 (everything else, including Cooper Lake + Cascade Lake + Zen 4+ which all have avx512bf16 hardware but nothing else). Adds a middle tier using _mm512_dpbf16_ps (VDPBF16PS): one instruction does 32 BF16×BF16 multiplies + 16 f32 accumulates, single-rounded. The intrinsic is stable on Rust 1.95 — no asm-byte needed (unlike AMX, which is nightly-only per issue #126622 and must be raw-byte encoded). Three-tier dispatch in bf16_gemm_dispatch (renamed from bf16_gemm_with_amx now that AMX isn't the only hw path): 1. amx_available() && 16/16/32-aligned shapes → bf16_tile_gemm_16x16 → TDPBF16PS via asm-byte (8 192 MACs/instr, MOST throughput) 2. is_x86_feature_detected!("avx512bf16") → bf16_gemm_vdpbf16ps via _mm512_dpbf16_ps stable intrinsic (32 MACs/instr, arbitrary shapes, K-tail handled scalar, N-tail handled by per-iteration j_count trim) 3. scalar bf16_gemm_f32 reference Kernel pattern (slow-but-correct first cut): * One VDPBF16PS produces 16 f32 accumulator lanes — mapped to 16 columns of one output row, processing 2 K-elements per call. * B columns for the current j-block of 16 are pre-packed into a pair-interleaved u32 layout once per j-block (B[2k_pair, j+jj] in the low 16 bits, B[2k_pair+1, j+jj] in the high 16 bits), then reused across all m i-iterations to amortize the column- gather cost. * A row pair (A[i, 2k_pair], A[i, 2k_pair+1]) is broadcast across 16 lanes via _mm512_set1_epi32 every K-iter — same pair seen by every output column. * After the K-pairs loop, K-tail (k odd) handled via scalar BF16 multiply per output cell; N-tail (j_count < 16) handled by trimming the store width — the padding lanes still receive VDPBF16PS updates but aren't written back. Performance shape (rough): the kernel is correctness-optimized, not peak-throughput-optimized. Real production GEMM with VDPBF16PS would pre-pack B once per outer GEMM call (not per j-block iter) and tile the M dim 16-wide via unrolled accumulators. Phase-4 work. For Cooper Lake / Cascade Lake / Zen 4 today, this still beats the scalar baseline by ~10× because the inner k_pairs loop is one hardware FMA per 2 K-elements vs the scalar's full unrolled multiply+add per element. Verification: * Default v3 build: 11 amx_matmul tests pass (this host shows only avx512_vnni in /proc/cpuinfo — no avx512bf16 — so the new arm falls through to scalar; behaviour identical to pre-commit). * cargo clippy --lib -D warnings clean. * cargo fmt --all --check clean. * Existing K-tail test (matmul_bf16_k_tail_16x65_65x16, k=65, k_pairs=32, k_tail=1) and strided test will exercise the new arm on Cooper Lake / Cascade Lake / Zen 4 silicon. Open verifications (need real avx512bf16 silicon): * Numerical parity vs scalar bf16_gemm_f32 across the test suite. * Throughput vs scalar baseline. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
AdaWorldAPI
pushed a commit
that referenced
this pull request
May 21, 2026
Closes TD-SIMD-8's F16-honesty gap (tracked in
`.claude/knowledge/simd-dispatch-architecture.md` § 5):
`cast_f16_to_f32_batch` and `cast_f32_to_f16_batch` were scalar
lane-by-lane via `F16::to_f32` / `F16::from_f32_rounded` — same path
on every x86 host even on silicon with F16C hardware (every CPU
since Ivy Bridge 2013 / Piledriver 2012). Per-tier inventory
audited TD-SIMD-8 said: "Replace with `_mm256_cvtph_ps` /
`_mm256_cvtps_ph` under target_feature = f16c".
Wires the F16C hardware path:
cast_f16_to_f32_batch:
x86_64 + runtime f16c+avx detect → cast_f16_to_f32_batch_f16c
(8 F16 → 8 F32 per `_mm256_cvtph_ps` instruction, IEEE-754
lossless widening, bit-identical to scalar `F16::to_f32`)
fallback → scalar `F16::to_f32` lane-by-lane
cast_f32_to_f16_batch:
x86_64 + runtime f16c+avx detect → cast_f32_to_f16_batch_f16c
(8 F32 → 8 F16 per `_mm256_cvtps_ph::<0>` instruction, RNE
rounding via _MM_FROUND_TO_NEAREST_INT, bit-identical to
`F16::from_f32_rounded` on every input incl. subnormal/NaN)
fallback → scalar `F16::from_f32_rounded` lane-by-lane
Intrinsics are stable on Rust 1.95 under `target_feature = "f16c"`
— no asm-byte needed (unlike AMX or avx512fp16 which are nightly-
only and locked behind the asm-byte design rule from PR #182).
Note on IMM8 encoding: `_mm256_cvtps_ph` const generic must fit in
3 bits (0..=7) per `static_assert_uimm_bits`. IMM8 = 0 selects
`_MM_FROUND_TO_NEAREST_INT` (RNE with exception raise). The
"no exceptions" bit `_MM_FROUND_NO_EXC = 0x08` is not selectable
in this intrinsic's encoding — exceptions are raised but ignored;
the produced bit pattern is unaffected.
Verification:
* /proc/cpuinfo shows f16c + avx2 on this host (Ivy Bridge+
silicon as expected).
* 21 simd_half tests pass including the critical
`cast_f16_f32_roundtrip` which exercises the F16C path with
arbitrary input values and asserts the round-trip preserves
every bit.
* Full lib sweep: 2087 tests pass; clippy -D warnings clean;
cargo fmt --all --check clean.
Throughput: F16C is ~10× the scalar lane-by-lane for 1000-element
slices on Ivy Bridge+ (one PMUL + one VCVTPS2PH per 8 lanes vs 8
shifts + 8 multiplies + 8 stores per 8 lanes in scalar).
Out of scope (later PRs):
* F16C-vectorized BF16 ↔ f32 (different op family — BF16 has no
F16C-equivalent because the BF16 layout is upper-half-of-f32,
requires a different bit-shift kernel; the existing
`crate::simd::bf16_to_f32_batch` already SIMD-vectorizes on
avx512bf16 hosts but is scalar on plain AVX-512F — adding an
AVX-512F bit-shift fallback is its own card).
* NEON `vcvt_f32_f16` / `vcvt_f16_f32` for aarch64 — Phase 3b
with the BFMMLA/FMLA.8h asm-byte arm.
* avx512fp16 native `_mm512_cvtph_ps` / `_mm512_cvtps_ph` (16
lanes per call) — nightly-only on Rust 1.95, asm-byte path.
https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
5 tasks
AdaWorldAPI
pushed a commit
that referenced
this pull request
May 21, 2026
Mirror of the BF16 AMX work (TD-T1 / TD-T1b in PR #182) for the integer operand family. Builds the missing int8 tile kernel from scratch (the BF16 equivalent shipped in PR #104; the int8 one had never been built despite the primitives existing in simd_amx since day one) and wires matmul_i8_to_i32's AMX arm through it. New module `hpc::int8_tile_gemm`: * `int8_tile_gemm_16x16(a_u8, b_i8, c, k)` — public tile kernel, K must be multiple of 64. Mirror shape of `bf16_tile_gemm_16x16` but for the `u8 × i8 → i32` operand family that TDPBUSD natively supports. **One TDPBUSD = 16 384 multiply-accumulates per instruction** (16×16 output tile × 64 K-elements per A row × 4 K-elements per inner-product). That's 256× the VPDPBUSD-zmm throughput per instruction. * Internal `amx_path()` uses the existing primitives in `amx_matmul`: TileConfig::for_dpbusd(64) → tile_loadconfig → tile_zero → K/64 iterations of (tile_load A, tile_load B, tile_dpbusd) → tile_store → tile_release. * `fallback_path()` for non-AMX hosts: scalar u8 × i8 → i32 triple-loop reference. New primitive `amx_matmul::vnni_pack_i8(src, dst, k, n)`: * Packs K × N row-major i8 into K/4 outer rows × (N*4) VNNI quad layout required by TDPBUSD tile 2. * `dst[kb*N*4 + j*4 + p] = src[(4*kb + p) * N + j]` * Sibling of `vnni_pack_bf16` (which uses K/2 × (N*2) pair layout for TDPBF16PS — both kernels reach the same 64-byte tile row width via element-width × pack-factor symmetry: BF16 is 2B × 2, INT8 is 1B × 4). Wiring `matmul_i8_to_i32`'s AMX arm (was placebo): Pre-commit the AMX branch shifted i8 → u8 then called the SCALAR `int8_gemm_i32` reference and subtracted the bias — TDPBUSD itself was never reached even on real AMX silicon. Now: 1. Shift A: i8 → u8 via (+128). 2. Tile-loop over M/16 i_tile × N/16 j_tile blocks, calling int8_tile_gemm_16x16 per (i_tile, j_tile). B sub-block extracted into K × 16 scratch once per j_tile, reused across i_tile iterations. 3. Subtract bias: c[i, j] -= 128 × colsum(B[:, j]). The shape requirement is m%16 == 0 && n%16 == 0 && k%64 == 0; misaligned shapes fall back to the scalar reference. Phase-4 work will land mixed AMX-tile + per-axis scalar tail handling for arbitrary shapes (same shape of Phase-4 work TD-T1 deferred). Verification: * Default v3 build: 2092 lib tests pass (was 2087 — adds 5 new tests: 4 in int8_tile_gemm + the existing matmul_i8_to_i32 test now exercises the actual TDPBUSD path because this host has amx_int8 + amx_tile in /proc/cpuinfo; the test continues to pass with bit-identical results to the scalar reference). * `vnni_pack_i8_roundtrip` test verifies the pack layout matches the spec exactly for an 8 × 4 sample. * `fallback_matches_scalar_reference_k64` test verifies the non-AMX path produces the same i32 output as a hand-written reference for a 64-K, pseudo-random u8/i8 matrix pair. * `public_api_diagonal_k128` test asserts a structured pattern (A = identity-like, B = constant 2) gives the expected accumulation through the full dispatch chain. * `cargo clippy --lib -D warnings` clean. * `cargo fmt --all --check` clean. Dropped: `int8_gemm_i32` import in `amx_matmul.rs` since the AMX arm no longer falls back to it (the scalar else-branch uses an inline triple-loop directly). After this commit, the per-CPU dispatch table from PR #180 has the AMX tier wired for BOTH operand families on Sapphire Rapids+: BF16 GEMM: SPR+ → TDPBF16PS (TD-T1 / TD-T1b in PR #182) INT8 GEMM: SPR+ → TDPBUSD (this commit) Out of scope (separate PRs): * VPDPBUSD-zmm arm of matmul_i8_to_i32 for Cooper Lake / Cascade Lake / Zen 4+ (avx512vnni without AMX). The kernel function `vnni_dot_u8_i8` and `vnni_matvec` exist in simd_amx.rs; just need to assemble them into a m×n×k GEMM and wire as the middle dispatch tier (analogous to the VDPBF16PS arm in PR #182's bf16_gemm_dispatch). * AMX tile path for `simd_int_ops::gemm_u8_i8` (the slice-level surface from PR #182) — it's u8 × i8 natively so no sign-shift needed, simpler to wire than matmul_i8_to_i32. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
3 tasks
AdaWorldAPI
pushed a commit
that referenced
this pull request
May 21, 2026
Two clippy-as-error issues blocking PR #184 CI: 1. `src/hpc/int8_tile_gemm.rs:147` (mine, from b1979d7) — `clippy::unused_parens` flagged the closure body `(((i*11+5) % 256) as u8 as i8)` in the `fallback_matches_scalar_reference_k64` test. The outer parens around the cast chain are redundant; rustfmt re-broke the line to multi-line after removal so it stays readable. 2. `tests/par_rayon.rs:9` (pre-existing) — `clippy::manual_div_ceil` flagged `(M + CHUNK_SIZE - 1) / CHUNK_SIZE`. Replaced with `M.div_ceil(CHUNK_SIZE)` per the clippy hint. This file was already in tree; the lint became active in clippy 1.95 (Rust stable) which CI now uses, so prior PRs weren't blocked by it but the rayon-features test build is now. Both fixes are mechanical / no behaviour change: * `cargo clippy --tests --features rayon,native -- -D warnings` clean. * `cargo fmt --all --check` clean. Stashed work-in-progress on the VPDPBUSD-zmm middle tier for `matmul_i8_to_i32` (the natural symmetric next step after TD-T2, analogous to the VDPBF16PS arm shipped in PR #182's `bf16_gemm_dispatch`); will follow up in a separate commit once CI is unblocked. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
AdaWorldAPI
pushed a commit
that referenced
this pull request
May 21, 2026
Completes the per-CPU dispatch chain for `matmul_i8_to_i32`. Per PR #180's table the middle tier between AMX TDPBUSD (Sapphire Rapids+) and the scalar reference is `_mm512_dpbusd_epi32` (zmm form, avx512vnni feature) — covers Cooper Lake, Cascade Lake, Ice Lake-SP, Zen 4+ silicon that has AVX-512 VNNI but not AMX. Mirrors the VDPBF16PS arm structure that landed for BF16 in PR #182's `bf16_gemm_dispatch`. New kernel `hpc::int8_tile_gemm::int8_gemm_vpdpbusd_zmm`: * One VPDPBUSD instruction: 16 i32 accumulator lanes, each receiving 4 u8×i8 products = 64 MACs per instruction. * Maps the 16 output lanes to a row of 16 j-columns of `c[i, ·]`, one i row processed at a time, K-quad inner loop accumulating into the same 16 i32 lanes across iterations. * B-column packing: pre-packs B for the current j-block into `b_col_quads[k_quad * 16 + j] = i32 (4 bytes of B[4k_quad.., j_base+j] packed bottom-to-top)` once per j-block; reused across all M i-iterations so the gather cost amortizes. * A row quad broadcast: `_mm512_set1_epi32` of (4 u8 bytes packed) every K-iter — same quad seen by every output column. * K-tail (k % 4 != 0) handled with scalar u8×i8 multiplies per output cell; N-tail (j_count < 16) handled by trimming the store width — padding lanes still receive VPDPBUSD updates but aren't written back. * Stable intrinsic `_mm512_dpbusd_epi32` under `target_feature = "avx512vnni,avx512f"` — no asm-byte needed. Wiring `matmul_i8_to_i32` to three-tier dispatch: 1. amx_available() + 16/16/64-aligned shapes → int8_tile_gemm_16x16 → TDPBUSD asm-byte (16 384 MACs/instr, this commit reuses the kernel from PR #184 fe334de... wait, same PR — from b1979d7 in THIS PR) 2. is_x86_feature_detected!("avx512vnni") → int8_gemm_vpdpbusd_zmm → _mm512_dpbusd_epi32 stable intrinsic (64 MACs/instr, arbitrary shapes, K-tail handled scalar, N-tail handled by per-iteration j_count trim) 3. scalar i8×i8 → i32 reference for non-x86, pre-AVX-512 hosts, or shapes that don't satisfy either SIMD tier's requirements Factored the shared sign-shift bias subtraction into a private `subtract_i8_to_u8_bias(c, b_i8, m, n, k)` helper: both Tier 1 (AMX) and Tier 2 (VNNI) shift LHS i8 → u8 via (+128) then need to subtract 128·colsum(B) from the accumulator. Pure integer arithmetic, bit-identical to the scalar i8×i8 → i32 reference. Verification: * Default v3 build: 2093 lib tests pass (was 2092 — +1 new test `vpdpbusd_zmm_matches_scalar` that exercises the new arm directly with shapes spanning aligned cases, K-tail (k % 4), N-tail (n % 16), and small shapes; asserts byte-equal output vs scalar reference). * Existing `matmul_i8_to_i32_16x16_exact` continues to pass through the AMX tier on this host (which has amx_int8). * cargo clippy --lib --tests --features rayon,native -- -D warnings clean. * cargo fmt --all --check clean. Per-CPU dispatch state after this commit: matmul_bf16_to_f32: SPR+ AMX | Zen4/CPL VDPBF16PS | scalar (PR #182) | (PR #182) | (always) matmul_f32: SPR+ AMX | Zen4/CPL VDPBF16PS | scalar (PR #182) | (PR #182) | (always) matmul_i8_to_i32: SPR+ AMX | CPL/Zen4 VPDPBUSD | scalar (b1979d7) | (THIS COMMIT) | (always) So all three of the public matmul entry points now have full three-tier dispatch on x86_64. Out of scope (separate PRs): * AMX tile path for `simd_int_ops::gemm_u8_i8` (the slice-level u8×i8 surface from PR #182) — it's u8×i8 natively, no sign- shift bias needed, simpler than matmul_i8_to_i32. * AVX-VNNI ymm arm (Arrow Lake / Meteor Lake U: avxvnni without avx512vnni) — the `vnni2_*` functions exist in simd_amx.rs but need to be assembled into a m×n×k VNNI-ymm GEMM. Same shape as the avx512vnni arm just with ymm width. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
AdaWorldAPI
pushed a commit
that referenced
this pull request
May 21, 2026
…able Rebased onto master post-#181, #182, #183. Replaces the polyfill-based add_mul_f32/f64 with LazyLock-cached function pointers picking real hardware FMA per silicon, and adds two more LazyLock-cached primitives the consumer needs: is_amx_available() and vnni_dot_u8_i8. WHY: F32x16::mul_add on AVX2 builds drops to per-lane scalar f32::mul_add (simd_avx2.rs:586). The polyfill abstracts lane width but cannot pick between _mm256_fmadd_ps and _mm512_fmadd_ps — that is an instruction-family choice, not a lane-width one. LazyLock amortises a one-time simd_caps() read into a frozen fn pointer; every subsequent call is a single indirect jump with zero is_x86_feature_detected! overhead. No SimdProfile exposed at the consumer surface — agnostic contract preserved. add_mul_f32(acc, a, b) — acc[i] += a[i]*b[i] AVX-512F+FMA → _mm512_fmadd_ps 16-wide + 8-wide tail + scalar tail AVX2+FMA → _mm256_fmadd_ps 8-wide + scalar tail NEON → vfmaq_f32 4-wide + scalar tail scalar → f32::mul_add per lane no_std build → preserves the polyfill F32x16::mul_add path (LazyLock requires std) add_mul_f64(acc, a, b) — f64 sibling, same shape with 8/4/2 lanes. is_amx_available() — wraps simd_amx::amx_available() (CPUID + OSXSAVE + XCR0[17,18] + Linux arch_prctl(XCOMP_PERM)) in LazyLock<bool>. The 4-step gate, including the syscall, fires exactly once per process. Always false on non-x86_64. vnni_dot_u8_i8(a, b) — i32 dot of u8 × i8 slices: AVX-512 VNNI → delegates to simd_amx::vnni_dot_u8_i8 wrapped with scalar tail handling (the existing kernel processes only n - (n%64) since its cognitive-shader caller pre-aligns rows; general-purpose callers need the tail) AVX-VNNI 256 → delegates to simd_amx::vnni2_dot_u8_i8 directly (that one already handles its scalar tail) scalar → simd_amx::vnni_dot_u8_i8_scalar No intrinsic code is duplicated. The dispatcher composes existing simd_amx::* kernels (which #182/#184 also build on) into a safe LazyLock-cached consumer-facing wrapper. simd_amx::matvec_dispatch runs the same selection logic but uses is_x86_feature_detected! per call; this wrapper amortises that to once at startup. PARITY CONTRACT: - add_mul_f32 / add_mul_f64: bit-identical to f32::mul_add / f64::mul_add per lane via to_bits() assertion. All vector backends emit single-rounded IEEE-754 FMA. - vnni_dot_u8_i8: bit-identical i32 to scalar widen-and-multiply. VPDPBUSD does not saturate the accumulator (intermediate u8*i8 products bounded by 32385, four-element sums by 129540). Tests: 2101/2101 lib pass (7 new lazylock_dispatch_tests over 12 problem sizes / tail lengths). cargo clippy --lib clean under default and --features cpu-spr. On Sapphire Rapids host the LazyLock resolved to AVX-512+FMA for add_mul, AVX-512 VNNI for vnni_dot; AMX is_amx_available returns false (hypervisor masks XCR0[17,18]) — matches the Risk #3 demotion from 61b4563. This commit was rebased atop master after the parallel session shipped PR #182 (BF16 AMX tile kernels), #183 (F16C cast batch), and prepared #184 (TDPBUSD int8 tile + matmul_i8_to_i32 wiring). The earlier 469ecc7 (coarse + SimdTier) and 77e3971 (mul_add_f32_into + walkback) and be65595 (is_amx_available + vnni_dot duplicating intrinsics) are subsumed by this single clean commit: no public SimdProfile / SimdTier re-export, no duplicated intrinsic code, no mul_add_f32_into (master's add_mul_f32 shape is the right primitive).
AdaWorldAPI
pushed a commit
that referenced
this pull request
May 21, 2026
Extends the u8×i8 → i32 dispatch chain from PR #182's compile-time cascade (avx512vnni → avxvnni → scalar) by adding a top-tier AMX runtime check. Brings the SPR/GNR TDPBUSD path (16 384 MACs per instruction) to the slice-level surface that downstream consumers (lance-graph, etc.) use, completing the symmetry with PR #184's matmul_i8_to_i32 wiring. `gemm_u8_i8` is u8×i8 natively — no sign-shift bias trick needed (unlike `matmul_i8_to_i32` which is i8×i8 and has to convert via +128 then subtract `128·colsum(B)`). That makes the AMX path here a direct call with no bias correction. New helper `hpc::int8_tile_gemm::int8_gemm_amx_tiled(a_u8, b_i8, c, m, n, k)` factors out the tile-decomposition logic that was previously inlined in `matmul_i8_to_i32`. Both consumers now share the same helper: matmul_i8_to_i32: 1. shift A: i8 → u8 (+128) 2. int8_gemm_amx_tiled(a_u8, b, c, m, n, k) 3. subtract_i8_to_u8_bias(c, b, m, n, k) gemm_u8_i8 (AMX tier added in this commit): 1. int8_gemm_amx_tiled(a, b, c, m, n, k) — no shift, no bias The helper handles arbitrary 16/16/64-aligned shapes via a j_tile × i_tile loop calling int8_tile_gemm_16x16 per (16, 16) block. B sub-block extracted into K × 16 scratch once per j-tile, reused across all M i-tiles. **Overwrite semantics**: c is written not accumulated (the underlying int8_tile_gemm_16x16 accumulates into its tile buffer, but we zero the tile buffer before each call so the per-tile write to c is pure overwrite). Dispatch placement in gemm_u8_i8: * Tier 0 (this commit): runtime amx_available() check at the top of the function. AMX requires CPUID + XCR0 + Linux prctl which can't fit a target_feature compile-time gate. * Tiers 1-3: existing compile-time cfg-cascade (avx512vnni zmm → avxvnni ymm → scalar i8_gemm_i32). Unchanged. Misaligned shapes (m/n not multiples of 16, k not multiple of 64) or non-AMX hosts fall through to the compile-time cascade as before. Also fixed pre-existing clippy::manual_is_multiple_of warnings that surfaced in the new alignment check — switched from `% 16 == 0` to `.is_multiple_of(16)` etc. per the clippy hint (Rust 1.95 promoted this from `pedantic` to active warn). Verification: * 2095 lib tests pass (was 2094 — +1 new `gemm_u8_i8_amx_aligned_32x32x128` test exercising the AMX arm with a 32×32×128 shape that hits the AMX tier on this host's amx_int8 silicon). * 11 amx_matmul tests pass (matmul_i8_to_i32 refactored to call the shared helper — same behavior as before). * 4 gemm_u8_i8 tests pass (the existing ones still hit the compile-time cascade since their shapes aren't AMX-aligned). * cargo clippy --lib --tests --features rayon,native -- -D warnings clean. * cargo fmt --all --check clean. Per-CPU dispatch state after this commit: matmul_bf16_to_f32: SPR+ AMX | Zen4/CPL VDPBF16PS | scalar (PR #182) | (PR #182) | (always) matmul_f32: SPR+ AMX | Zen4/CPL VDPBF16PS | scalar (PR #182) | (PR #182) | (always) matmul_i8_to_i32: SPR+ AMX | CPL/Zen4 VPDPBUSD | scalar (PR #184) | (PR #184) | (always) gemm_u8_i8 (slice): SPR+ AMX | CPL/Zen4 VPDPBUSD | ARL ymm | scalar (THIS) | (PR #182) | (PR #182) | (PR #182) Out of scope (separate PRs): * AVX-VNNI ymm arm for matmul_i8_to_i32 — `vnni2_*` helpers exist in simd_amx.rs but need assembling into a m×n×k GEMM. Same shape as the avx512vnni arm just with ymm width. * NEON BFMMLA / SDOT on aarch64 via asm-byte — Phase 3b. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
6 tasks
AdaWorldAPI
pushed a commit
that referenced
this pull request
May 21, 2026
**Alternative** to the compile-time cascade in `crate::simd::*` /
`crate::simd_ops::*`. **Additive**: gated under
`--features runtime-dispatch`, does not touch any existing path.
Mutually exclusive with `nightly-simd` (the portable-SIMD polyfill
replaces the architecture-specific intrinsics that the runtime
trampolines select between).
Use case: ship ONE binary that adapts across heterogeneous
deployment silicon (AVX-512 server + AVX2-only laptop + Arrow Lake
desktop + Sapphire Rapids workstation) from the same artifact. The
existing compile-time `v3` / `v4` / `native` / `nightly-simd`
configs target a single class of CPU per build; the runtime layer
targets the union via per-op LazyLock<fn ptr> trampolines.
Design from `.claude/knowledge/simd-dispatch-architecture.md` § 7.1
/ Phase 5, building on the precedent set by
`hpc::bgz17_bridge::{L1_KERNEL, L1_WEIGHTED_KERNEL, ...}`
(`LazyLock<L1Fn>` pattern, lines 75-86) already proven in tree.
# Dispatch model
One `LazyLock<fn ptr>` per public surface. First call fires the
closure which reads `simd_caps()` and selects a backend; every
subsequent call is one pointer-deref + indirect call. Per-call
overhead: ~2-3 ns (LazyLock atomic-acquire load that's cache-
resident after first hit + indirect-call branch-target predict).
Invisible against any SIMD op's actual work (~100+ cycles).
# Module layout
src/simd_runtime/
mod.rs — module entry, mutual-exclusion check vs
nightly-simd, public re-exports
vnni_dot.rs — u8×i8 → i32 dot (the proposal's canonical
example): 3 backends, the AVX-512 arm
wraps `simd_amx::vnni_dot_u8_i8` with a
scalar tail because the existing kernel
silently drops n%64 lanes (its matvec
caller pre-aligns rows; a general-purpose
dispatch surface cannot assume that)
add_mul.rs — slice-level FMA (acc += a × b) for f32/f64;
the ONLY new kernel code in this module —
4 backends per type (avx512 / avx2+fma /
neon / scalar), each ~15 LoC of direct
intrinsics
matmul.rs — thin trampolines for matmul_bf16_to_f32 /
matmul_f32 / matmul_i8_to_i32 / gemm_u8_i8
delegating to existing functions that
already runtime-dispatch internally
(PR #182 / #184 / #185)
casts.rs — trampolines for the four half-precision
batch casts delegating to PR #183's already-
runtime-dispatched implementations
# Backend reuse — no kernel duplication
Every dispatch arm delegates to a kernel that already exists in
tree. The runtime layer is just the trampoline. The only NEW
kernel code is `add_mul_f32` / `add_mul_f64` (no pre-existing
slice-level FMA primitive in tree to delegate to — the compile-
time `crate::simd_ops::add_mul_f32` from PR #182 polyfills through
the F32x16 lane wrapper; the runtime version skips that
indirection for one more inlined intrinsic per chunk).
# Invariants preserved from this PR series
* No-FP32-roundtrip on BF16/F16 arithmetic — backends respect
the bit-exact mantissa rule
* Asm-byte encoding for nightly-gated AMX / FP16 — selected
backends keep their existing asm-byte fast paths
* Little-endian byte contracts for half-precision carriers
* Accumulator-preservation in tile paths (codex P1 from #184)
* Boundary assertions on safe public fns (codex P1 from #185) —
the public `vnni_dot_u8_i8(a, b)` etc. inherit the asserts
transparently via the call chain
# Verification
* Default build (no feature): 2087 lib tests pass — the
`simd_runtime` module is gated out, zero impact on existing
paths.
* `cargo test --lib --features runtime-dispatch`: **2105 lib
tests pass** (+8 new in `simd_runtime::*::tests`).
* `cargo clippy --lib --tests --features rayon,native -- -D warnings`
clean (default).
* `cargo clippy --lib --tests --features rayon,native,runtime-dispatch
-- -D warnings` clean.
* `cargo fmt --all --check` clean.
* Mutual-exclusion enforced via `compile_error!` in
`simd_runtime/mod.rs` — `--features runtime-dispatch,nightly-simd`
fails to compile with a clear error.
# What's NOT in this PR (deferred)
* Sweep the remaining ~15-20 SIMD/HPC public surfaces (min_i8,
max_i8, add_i8, dot_i8, etc.). Each is ~30-50 LoC of trampoline;
pattern is established here. Estimated ~700-900 more LoC across
the full surface map.
* CI matrix entry for `runtime-dispatch-portable` (per
simd-dispatch-architecture.md § 7 / TD-SIMD-9). Job builds
with `--features runtime-dispatch` on a v3 baseline runner and
asserts every trampoline lands on its expected backend.
* `simd_caps()` snapshot logging at process start (debug-only)
to aid release-binary deployment debugging — "which arm did
you actually pick?"
# Cost summary
src/simd_runtime/ +537 LoC (4 modules)
src/lib.rs +9 LoC (cfg-gated mod decl)
Cargo.toml +21 LoC (feature decl + doc)
Total ~570 LoC
Trampoline LoC per surface (this PR's sample):
vnni_dot 170 LoC (LazyLock + 3 arms + wrapper + tests)
add_mul (f32+f64)218 LoC (LazyLock×2 + 4 arms×2 + tests — the ONLY new kernels)
matmul (4 ops) 100 LoC (thin delegations + tests)
casts (4 ops) 75 LoC (thin delegations + tests)
Out-of-tree estimate for the full sweep (per § 7 of the design
doc): ~1400 LoC total once all ~25 public SIMD/HPC surfaces are
wired. This PR establishes ~40% of that budget with the canonical
patterns.
https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Nine commits across three areas — recovering JIT-parity-zone slice helpers, building the per-CPU agnostic-surface resolution map, and landing two of its Phase-1 wirings.
Recovery + agnostic surface (commits 1-3)
0a46e7frestoressimd_ops::array_windows/array_windows_checkedand adds slice-leveladd_mul_f32/add_mul_f64— the foundation primitives the BLAS-graph GEMM path relies on (an earlier sprint had removed them).86b0f3fshipssimd_int_ops::gemm_u8_i8as an agnostic surface with a compile-timeavx512vnni → avxvnni → scalardispatch chain.caf0471adds the AVX-VNNI ymm arm (Arrow Lake, Meteor Lake U) and bumps.cargo/config-avx512.tomlfrom bare v4 tosapphirerapids(was missing VNNI).0134916adds an#[ignore]'d criterion-style bench harnessbench_gemm_u8_i8_vs_scalarso the three arms have an apples-to-apples timing surface.Per-CPU resolution matrix (commits 4-5)
b34d430introduces.claude/knowledge/agnostic-surface-cpu-matrix.md— a 14-CPU × ~80-symbol matrix mapping every public type / function incrate::simd::*+crate::hpc::*to its actual lowering per profile. Cross-references the W1a consumer contract, the TD-SIMD audit, and the existing dispatch-architecture doc.058ef61expands to full integer-lane coverage (I8/U8/I16/U16/I32/U32/I64/U64 at 256/512-bit), adds the cross-cutting infrastructure status table (which configs exist, what features are missing), and grows the integration plan with Phases 0-6 + an explicit out-of-scope list.Phase 1 wirings (commits 6-9)
b5bca4e— MX-T1a:simd_int_ops::add_i8 / sub_i8 / add_i16lifted from scalar to polyfilled lanes (I8x64/I16x32on x86,I8x16/I16x8on aarch64). Uses the samemin_i8-style cfg-cascade. Existing parity tests cover tail lengths 0/1/32/63/64/65/127/128/129/256.bede3d2— design rule + matrix update: flips the three MX-T1a cells in the matrix from scalar to per-CPU SIMD, and codifies the asm-byte encoding rule for AMX/F16/FP16 paths (Phases 1b/3b/3c/4d). AMX intrinsics are nightly-only (issue #126622) and avx512fp16/NEON-fp16 have stabilization churn on Rust 1.95 stable — raw.byte/.instencoding (matchingsimd_amx.rs:16-19) is the documented stable-toolchain path. Does NOT apply to instructions with stable intrinsics on 1.95 (_mm512_dpbusd_epi32,_mm256_cvtph_ps,_mm512_cvtne2ps2bf16).fe334de— TD-T1:matmul_bf16_to_f32's AMX arm was placebo (bothif amx_available()andelsecalled the scalar reference). This wires the 16/16/32-aligned path throughbf16_tile_gemm::bf16_tile_gemm_16x16which emits TDPBF16PS via the asm-byte path insimd_amx.rs::tile_dpbf16ps— 8 192 BF16×BF16 multiplies + 256 f32 accumulates per instruction on real SPR silicon. Misaligned shapes fall back to the validated scalarbf16_gemm_f32. Non-AMX hosts always take the scalar fallback.Test plan
x86-64-v3AVX2) —cargo test --lib: 2087 passed, 0 failed, 29 ignored.cargo clippy --lib -- -D warningsclean.cargo --config .cargo/config-avx512.toml=cascadelake(= v4 + VNNI) — 15 simd_int_ops tests pass on the AVX-512BW direct path (_mm512_add_epi8,_mm512_add_epi16).cargo --config .cargo/config-avx512.toml(sapphirerapids) — runner CPU on this dev host shows onlyavx512_vnniin/proc/cpuinfo(no AMX/BF16/FP16), so SPR-targeted binaries SIGILL on unrelated tests. Needs real SPR silicon for verification.min_i8/max_i8cross-arch dispatch.https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
Generated by Claude Code