docs(simd): tier debt audit + 4-phase integration plan#180
Conversation
22 verified findings across 7 files, in `.claude/knowledge/td-simd-tier-audit.md`.
Read end-to-end for this pass:
src/simd.rs (720 LoC), simd_amx.rs (421), simd_dispatch.rs (361),
hpc/amx_matmul.rs (671), hpc/bf16_tile_gemm.rs (205),
hpc/simd_caps.rs (514), backend/native.rs (763), plus relevant
sections of simd_avx512.rs, simd_neon_bf16.rs, simd_neon_dotprod.rs,
hpc/bgz17_bridge.rs, hpc/nibble.rs, hpc/quantized.rs, hpc/vnni_gemm.rs.
Headline findings (CRITICAL):
TD-T1/T2/T3 — `matmul_bf16_to_f32`, `matmul_f32`, `matmul_i8_to_i32`
in `hpc/amx_matmul.rs:319-412` all have `if amx_available()`
branches whose AMX arm calls the scalar reference kernel.
`bf16_tile_gemm_16x16` (the real TDPBF16PS dispatch) and
`int8_gemm_vnni` (real VPDPBUSD dispatch) exist and are
tested — they're just not wired through.
TD-T4 — `quantized.rs::bf16_gemm_f32` (line 444) is a triple-nested
scalar loop with per-element `.to_f32()`. No `crate::simd::*`,
no F32x16 mul_add. This IS the fallback that everything bottoms
out at.
TD-T5 — `quantized.rs::int8_gemm_i32` (line 618) is the same shape.
`int8_gemm_vnni` exists at `vnni_gemm.rs:46` but no caller
routes through it.
TD-T6 — `backend/native.rs::avx2::{scal,nrm2,asum}_*` (line 544-561)
all `super::scalar::*`-delegate. Dispatch macro thinks AVX2;
body is scalar.
TD-T7 — `backend/native.rs::gemv_f32/f64` (line 271-278) skip the
dispatch macro entirely. Scalar on every CPU.
Headline findings (HIGH):
TD-T8 — `simd_dispatch.rs:150-163` aarch64 tier reports
`NeonDotProd`/`Neon` but fills every fn pointer from
`Self::scalar()`. Pi 5 / M2 / Apple silicon = scalar.
TD-T9 — `simd_dispatch.rs:128-134` AVX-512 tier uses AVX2 wrappers
for 4 of 6 ops (squared_distances_f32, nibble_unpack,
nibble_above_threshold, batch_sq_dist).
TD-T10/T11 — `simd_neon_bf16.rs:149-177` and
`simd_neon_dotprod.rs:115-148` have `*Stub` types with
`unimplemented!()` panic bodies. Module docs spell out the
asm-byte BFMMLA/BFDOT/fmla.8h encodings; nothing wired.
TD-T12/T13/T14 — three independent `Tier` enums (simd.rs,
backend/native.rs, hpc/simd_dispatch.rs) all collapse
Skylake-X through Granite Rapids into one `Avx512` bucket.
TD-T15 — `simd.rs:291-292` gates `BF16x16` re-export on
compile-time `target_feature = "avx512bf16"`. Default cargo
is v3; even on SPR/Zen 4 silicon, `crate::simd::BF16x16` is
the scalar `simd_half::BF16x16`.
Headline findings (MEDIUM):
TD-T16/T17 — `hpc/nibble.rs` ops cap at AVX2 (no AVX-512BW for
nibble_unpack / nibble_above_threshold), and the "avx2" funcs
are scalar loops under `#[target_feature(enable = "avx2")]`.
TD-T18 — `simd_ln_f32` (simd.rs:479) is `arr[i].ln()` per lane,
asymmetric with the Remez-polynomial `simd_exp_f32`.
TD-T19/T20 — `distance::squared_distances` and
`spatial_hash::batch_sq_dist` check only `avx2`; no AVX-512F
variant.
TD-T21 — `simd.rs:351-354` on aarch64, every integer type
(`I32x16`, `U8x64`, `U16x32`, ...) comes from `scalar::*`,
not `simd_neon`. Pi 5 / M2 get scalar integer SIMD despite
NEON int32x4_t / uint8x16_t being available.
Two agent-claimed findings verified false (`simd_amx.rs:291`,
`simd_avx512.rs:695`) — recorded as such in the audit so they
don't get re-flagged on the next sweep.
Next-sweep targets listed at the bottom — `blas_level{1,2,3}.rs`,
`statistics.rs`, `lapack.rs`, `simd_avx2.rs`, `simd_neon.rs`
unread. Grep showed no `crate::simd::*` use in the BLAS / LAPACK /
statistics files; flagship public API may be entirely scalar.
Verification requires actually reading them, not grep — same
mistake the agent made.
No code changes in this commit. Documentation only.
Companion to td-simd-tier-audit.md. Architecture decision for
intra-bucket dispatch (SPR vs ICX vs CLX within Tier::Avx512,
A76 vs A72 within Tier::Neon).
Core proposal: SimdProfile enum (SapphireRapids, GraniteRapids,
IceLakeSp, CooperLake, CascadeLake, SkylakeX, Zen4Avx512,
ArrowLake, HaswellAvx2, A76DotProd, A72Fast, A53Baseline,
Scalar) with one static *Dispatch table per profile. Detect
once via LazyLock<SimdProfile>, then every dispatch is one
pointer deref + one indirect call. Branch predictor warms
once.
Compile-time pinning via cargo features (cpu-spr / cpu-icx /
cpu-zen4 / etc.) shortcuts the LazyLock to a const &'static
*Dispatch reference; rustc folds the entire dispatch out at
the callsite. Same source, two ergonomics.
Four phases:
Phase 1 (~10-12h, P0): Wire existing kernels through existing
dispatch sites. Closes 7 CRITICAL audit findings (TD-T1..T7).
Pure routing, no new abstractions.
Phase 2 (~22h, P1): Real NEON BF16/F16/integer types via
asm-byte BFMMLA/BFDOT/fmla.8h. Closes TD-T8, T10, T11, T21.
Needs Pi 5 / M2 hardware verification.
Phase 3 (~21h, P1): SimdProfile architecture. Closes the three
Tier enum collapses (TD-T12, T13, T14). Migration is additive
— simd_profile() ships alongside tier(); the three local
enums get deleted last.
Phase 4 (30-50h, P2, parallelizable): Intra-bucket fills.
13 named tasks listed with profile-unlock and target file.
Quick reference at the bottom shows, for bf16_gemm_f32,
int8_gemm_i32, gemv_f32, BF16x16, F16x16, the silicon-by-silicon
route after all 4 phases land. This is the "what to dispatch
where" the user asked for.
7 risks / open questions at the end. Most important:
- cpu-* features must be mutually exclusive (compile_error!)
AND documented as non-portable.
- caps.amx_fp16 doesn't exist yet — add before Phase 3 (CPUID
leaf 7,1 EAX bit 21 for GraniteRapids dispatch).
- BLAS-1/2/3, LAPACK, statistics, quantized.rs grep showed NO
crate::simd::* use. Flagship public surfaces are scalar.
Phase 4 must decide: route through dispatch tables, or
rewrite in place.
No code changes in this commit. Documentation only.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: da66c21167
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| bf16_gemm: amx_bf16_tile_gemm, // TDPBF16PS, 256 mul-adds/instr | ||
| int8_gemm: amx_int8_tile_gemm, // TDPBUSD, 256 mul-adds/instr |
There was a problem hiding this comment.
Fix AMX MAC counts in SPR dispatch comments
The comments understate AMX throughput by a large factor (TDPBUSD/TDPBF16PS are documented in this repo as tile-level dot products, not 256 total mul-adds), which makes the plan’s performance rationale misleading and can skew prioritization decisions in later implementation PRs. Please align these numbers with the existing AMX primitive documentation in src/hpc/amx_matmul.rs (e.g., tile-level operation counts) so the integration plan reflects realistic speedup expectations.
Useful? React with 👍 / 👎.
| if caps.neon && caps.aes /* heuristic for A72 vs A53 */ { | ||
| return SimdProfile::A72Fast; |
There was a problem hiding this comment.
Replace AES heuristic for A72 vs A53 profile detection
The proposed caps.neon && caps.aes check cannot reliably distinguish A72Fast from A53Baseline (the current capability model already treats crypto as present across multiple ARMv8 tiers), so this branch will classify many non-A72 devices as A72 and route them to the wrong profile table. That undermines the goal of accurate per-silicon dispatch and should be replaced with an explicit “cannot distinguish” fallback or a different discriminator.
Useful? React with 👍 / 👎.
… kernel Per the PR #180 dispatch table for BF16 GEMM: SapphireRapids and GraniteRapids should route through `tile_dpbf16ps` (AMX TDPBF16PS, 256 BF16×BF16 multiply-accumulates per instruction, single-rounded into an f32 tile accumulator). Until this commit, the AMX branch of `matmul_bf16_to_f32` was a placebo — both `if amx_available()` and `else` called the scalar `bf16_gemm_f32`. The actual kernel (`bf16_tile_gemm::bf16_tile_gemm_16x16`, shipped by PR #104) was unreached by the consumer entry point. This wires it. When AMX is OS-enabled AND the matmul shape is 16/16/32-aligned in (M, N, K), the inner loop tiles 16×16 blocks through `bf16_tile_gemm_16x16` — that kernel emits TDPBF16PS via the asm-byte path in `simd_amx.rs::tile_dpbf16ps` (the stable-Rust 1.95 encoding documented at simd_amx.rs:16-19; AMX intrinsics are nightly-only per issue #126622, hence asm-byte). Aligned tiles get the full hardware throughput; misaligned shapes (any of M/N/K not at the alignment boundary) fall back to the validated scalar `bf16_gemm_f32` reference. Non-AMX hosts always take the scalar fallback. The B sub-block extraction copies a K × 16 packed scratch per j_tile column band (B is K × N row-major; the kernel wants K × 16 contiguous). Allocation cost is amortized across M/16 i-tile iterations under each j_tile. Phase-4 work will land a fully mixed-tile path (AMX 16×16 core + per-axis scalar tails on the same matmul) for arbitrary shapes. Verification: * Default v3 build: 11 amx_matmul tests pass (this host lacks AMX per /proc/cpuinfo, so the path falls through to scalar; behaviour identical to pre-commit on this runner). * Full lib sweep: 2087 tests pass; clippy -D warnings clean. * Real SPR silicon: the gating is correctness-by-construction — the new branch only fires when amx_available() == true AND the alignment predicates hold; the inner kernel is the same one PR #104 shipped and tested. Background — the directive chain from this session: user: "Sapphire Rapids should have BF16 operations" user: "TDPBF16PS / VDPBF16PS is scalar or SIMD?" → both are SIMD, TDPBF16PS does 8192 BF16×BF16 multiplies + 256 f32 accums per instruction (16×16 outer-product matmul tile), VDPBF16PS does 32 BF16×BF16 multiplies + 16 f32 accums per zmm instruction. Neither is scalar. The "no scalar lane-by-lane f32 round-trip" rule the user gave is what this PR delivers: the AMX tile op is hardware-fused, single-rounded into f32 accumulator, BF16 mantissa bits preserved bit-exactly per IEEE BF16 spec at the multiply step. Closes TD-T1 from `.claude/knowledge/agnostic-surface-cpu-matrix.md` § J Phase 1. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
Extends the BF16 GEMM dispatch chain from PR #180's per-tier table. Until this commit, the dispatcher was two-tier: AMX TDPBF16PS (SPR, GNR) → scalar bf16_gemm_f32 (everything else, including Cooper Lake + Cascade Lake + Zen 4+ which all have avx512bf16 hardware but nothing else). Adds a middle tier using _mm512_dpbf16_ps (VDPBF16PS): one instruction does 32 BF16×BF16 multiplies + 16 f32 accumulates, single-rounded. The intrinsic is stable on Rust 1.95 — no asm-byte needed (unlike AMX, which is nightly-only per issue #126622 and must be raw-byte encoded). Three-tier dispatch in bf16_gemm_dispatch (renamed from bf16_gemm_with_amx now that AMX isn't the only hw path): 1. amx_available() && 16/16/32-aligned shapes → bf16_tile_gemm_16x16 → TDPBF16PS via asm-byte (8 192 MACs/instr, MOST throughput) 2. is_x86_feature_detected!("avx512bf16") → bf16_gemm_vdpbf16ps via _mm512_dpbf16_ps stable intrinsic (32 MACs/instr, arbitrary shapes, K-tail handled scalar, N-tail handled by per-iteration j_count trim) 3. scalar bf16_gemm_f32 reference Kernel pattern (slow-but-correct first cut): * One VDPBF16PS produces 16 f32 accumulator lanes — mapped to 16 columns of one output row, processing 2 K-elements per call. * B columns for the current j-block of 16 are pre-packed into a pair-interleaved u32 layout once per j-block (B[2k_pair, j+jj] in the low 16 bits, B[2k_pair+1, j+jj] in the high 16 bits), then reused across all m i-iterations to amortize the column- gather cost. * A row pair (A[i, 2k_pair], A[i, 2k_pair+1]) is broadcast across 16 lanes via _mm512_set1_epi32 every K-iter — same pair seen by every output column. * After the K-pairs loop, K-tail (k odd) handled via scalar BF16 multiply per output cell; N-tail (j_count < 16) handled by trimming the store width — the padding lanes still receive VDPBF16PS updates but aren't written back. Performance shape (rough): the kernel is correctness-optimized, not peak-throughput-optimized. Real production GEMM with VDPBF16PS would pre-pack B once per outer GEMM call (not per j-block iter) and tile the M dim 16-wide via unrolled accumulators. Phase-4 work. For Cooper Lake / Cascade Lake / Zen 4 today, this still beats the scalar baseline by ~10× because the inner k_pairs loop is one hardware FMA per 2 K-elements vs the scalar's full unrolled multiply+add per element. Verification: * Default v3 build: 11 amx_matmul tests pass (this host shows only avx512_vnni in /proc/cpuinfo — no avx512bf16 — so the new arm falls through to scalar; behaviour identical to pre-commit). * cargo clippy --lib -D warnings clean. * cargo fmt --all --check clean. * Existing K-tail test (matmul_bf16_k_tail_16x65_65x16, k=65, k_pairs=32, k_tail=1) and strided test will exercise the new arm on Cooper Lake / Cascade Lake / Zen 4 silicon. Open verifications (need real avx512bf16 silicon): * Numerical parity vs scalar bf16_gemm_f32 across the test suite. * Throughput vs scalar baseline. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
Mirror of the BF16 AMX work (TD-T1 / TD-T1b in PR #182) for the integer operand family. Builds the missing int8 tile kernel from scratch (the BF16 equivalent shipped in PR #104; the int8 one had never been built despite the primitives existing in simd_amx since day one) and wires matmul_i8_to_i32's AMX arm through it. New module `hpc::int8_tile_gemm`: * `int8_tile_gemm_16x16(a_u8, b_i8, c, k)` — public tile kernel, K must be multiple of 64. Mirror shape of `bf16_tile_gemm_16x16` but for the `u8 × i8 → i32` operand family that TDPBUSD natively supports. **One TDPBUSD = 16 384 multiply-accumulates per instruction** (16×16 output tile × 64 K-elements per A row × 4 K-elements per inner-product). That's 256× the VPDPBUSD-zmm throughput per instruction. * Internal `amx_path()` uses the existing primitives in `amx_matmul`: TileConfig::for_dpbusd(64) → tile_loadconfig → tile_zero → K/64 iterations of (tile_load A, tile_load B, tile_dpbusd) → tile_store → tile_release. * `fallback_path()` for non-AMX hosts: scalar u8 × i8 → i32 triple-loop reference. New primitive `amx_matmul::vnni_pack_i8(src, dst, k, n)`: * Packs K × N row-major i8 into K/4 outer rows × (N*4) VNNI quad layout required by TDPBUSD tile 2. * `dst[kb*N*4 + j*4 + p] = src[(4*kb + p) * N + j]` * Sibling of `vnni_pack_bf16` (which uses K/2 × (N*2) pair layout for TDPBF16PS — both kernels reach the same 64-byte tile row width via element-width × pack-factor symmetry: BF16 is 2B × 2, INT8 is 1B × 4). Wiring `matmul_i8_to_i32`'s AMX arm (was placebo): Pre-commit the AMX branch shifted i8 → u8 then called the SCALAR `int8_gemm_i32` reference and subtracted the bias — TDPBUSD itself was never reached even on real AMX silicon. Now: 1. Shift A: i8 → u8 via (+128). 2. Tile-loop over M/16 i_tile × N/16 j_tile blocks, calling int8_tile_gemm_16x16 per (i_tile, j_tile). B sub-block extracted into K × 16 scratch once per j_tile, reused across i_tile iterations. 3. Subtract bias: c[i, j] -= 128 × colsum(B[:, j]). The shape requirement is m%16 == 0 && n%16 == 0 && k%64 == 0; misaligned shapes fall back to the scalar reference. Phase-4 work will land mixed AMX-tile + per-axis scalar tail handling for arbitrary shapes (same shape of Phase-4 work TD-T1 deferred). Verification: * Default v3 build: 2092 lib tests pass (was 2087 — adds 5 new tests: 4 in int8_tile_gemm + the existing matmul_i8_to_i32 test now exercises the actual TDPBUSD path because this host has amx_int8 + amx_tile in /proc/cpuinfo; the test continues to pass with bit-identical results to the scalar reference). * `vnni_pack_i8_roundtrip` test verifies the pack layout matches the spec exactly for an 8 × 4 sample. * `fallback_matches_scalar_reference_k64` test verifies the non-AMX path produces the same i32 output as a hand-written reference for a 64-K, pseudo-random u8/i8 matrix pair. * `public_api_diagonal_k128` test asserts a structured pattern (A = identity-like, B = constant 2) gives the expected accumulation through the full dispatch chain. * `cargo clippy --lib -D warnings` clean. * `cargo fmt --all --check` clean. Dropped: `int8_gemm_i32` import in `amx_matmul.rs` since the AMX arm no longer falls back to it (the scalar else-branch uses an inline triple-loop directly). After this commit, the per-CPU dispatch table from PR #180 has the AMX tier wired for BOTH operand families on Sapphire Rapids+: BF16 GEMM: SPR+ → TDPBF16PS (TD-T1 / TD-T1b in PR #182) INT8 GEMM: SPR+ → TDPBUSD (this commit) Out of scope (separate PRs): * VPDPBUSD-zmm arm of matmul_i8_to_i32 for Cooper Lake / Cascade Lake / Zen 4+ (avx512vnni without AMX). The kernel function `vnni_dot_u8_i8` and `vnni_matvec` exist in simd_amx.rs; just need to assemble them into a m×n×k GEMM and wire as the middle dispatch tier (analogous to the VDPBF16PS arm in PR #182's bf16_gemm_dispatch). * AMX tile path for `simd_int_ops::gemm_u8_i8` (the slice-level surface from PR #182) — it's u8 × i8 natively so no sign-shift needed, simpler to wire than matmul_i8_to_i32. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
Completes the per-CPU dispatch chain for `matmul_i8_to_i32`. Per PR #180's table the middle tier between AMX TDPBUSD (Sapphire Rapids+) and the scalar reference is `_mm512_dpbusd_epi32` (zmm form, avx512vnni feature) — covers Cooper Lake, Cascade Lake, Ice Lake-SP, Zen 4+ silicon that has AVX-512 VNNI but not AMX. Mirrors the VDPBF16PS arm structure that landed for BF16 in PR #182's `bf16_gemm_dispatch`. New kernel `hpc::int8_tile_gemm::int8_gemm_vpdpbusd_zmm`: * One VPDPBUSD instruction: 16 i32 accumulator lanes, each receiving 4 u8×i8 products = 64 MACs per instruction. * Maps the 16 output lanes to a row of 16 j-columns of `c[i, ·]`, one i row processed at a time, K-quad inner loop accumulating into the same 16 i32 lanes across iterations. * B-column packing: pre-packs B for the current j-block into `b_col_quads[k_quad * 16 + j] = i32 (4 bytes of B[4k_quad.., j_base+j] packed bottom-to-top)` once per j-block; reused across all M i-iterations so the gather cost amortizes. * A row quad broadcast: `_mm512_set1_epi32` of (4 u8 bytes packed) every K-iter — same quad seen by every output column. * K-tail (k % 4 != 0) handled with scalar u8×i8 multiplies per output cell; N-tail (j_count < 16) handled by trimming the store width — padding lanes still receive VPDPBUSD updates but aren't written back. * Stable intrinsic `_mm512_dpbusd_epi32` under `target_feature = "avx512vnni,avx512f"` — no asm-byte needed. Wiring `matmul_i8_to_i32` to three-tier dispatch: 1. amx_available() + 16/16/64-aligned shapes → int8_tile_gemm_16x16 → TDPBUSD asm-byte (16 384 MACs/instr, this commit reuses the kernel from PR #184 fe334de... wait, same PR — from b1979d7 in THIS PR) 2. is_x86_feature_detected!("avx512vnni") → int8_gemm_vpdpbusd_zmm → _mm512_dpbusd_epi32 stable intrinsic (64 MACs/instr, arbitrary shapes, K-tail handled scalar, N-tail handled by per-iteration j_count trim) 3. scalar i8×i8 → i32 reference for non-x86, pre-AVX-512 hosts, or shapes that don't satisfy either SIMD tier's requirements Factored the shared sign-shift bias subtraction into a private `subtract_i8_to_u8_bias(c, b_i8, m, n, k)` helper: both Tier 1 (AMX) and Tier 2 (VNNI) shift LHS i8 → u8 via (+128) then need to subtract 128·colsum(B) from the accumulator. Pure integer arithmetic, bit-identical to the scalar i8×i8 → i32 reference. Verification: * Default v3 build: 2093 lib tests pass (was 2092 — +1 new test `vpdpbusd_zmm_matches_scalar` that exercises the new arm directly with shapes spanning aligned cases, K-tail (k % 4), N-tail (n % 16), and small shapes; asserts byte-equal output vs scalar reference). * Existing `matmul_i8_to_i32_16x16_exact` continues to pass through the AMX tier on this host (which has amx_int8). * cargo clippy --lib --tests --features rayon,native -- -D warnings clean. * cargo fmt --all --check clean. Per-CPU dispatch state after this commit: matmul_bf16_to_f32: SPR+ AMX | Zen4/CPL VDPBF16PS | scalar (PR #182) | (PR #182) | (always) matmul_f32: SPR+ AMX | Zen4/CPL VDPBF16PS | scalar (PR #182) | (PR #182) | (always) matmul_i8_to_i32: SPR+ AMX | CPL/Zen4 VPDPBUSD | scalar (b1979d7) | (THIS COMMIT) | (always) So all three of the public matmul entry points now have full three-tier dispatch on x86_64. Out of scope (separate PRs): * AMX tile path for `simd_int_ops::gemm_u8_i8` (the slice-level u8×i8 surface from PR #182) — it's u8×i8 natively, no sign- shift bias needed, simpler than matmul_i8_to_i32. * AVX-VNNI ymm arm (Arrow Lake / Meteor Lake U: avxvnni without avx512vnni) — the `vnni2_*` functions exist in simd_amx.rs but need to be assembled into a m×n×k VNNI-ymm GEMM. Same shape as the avx512vnni arm just with ymm width. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
Completes the per-CPU dispatch chain for `matmul_i8_to_i32` by adding the AVX-VNNI ymm tier — Arrow Lake, Meteor Lake U, Alder Lake silicon that has AVX-VNNI but dropped AVX-512. Mirrors the shape of the avx512vnni-zmm arm shipped in PR #184 with the narrower 8-wide kernel. New kernel `hpc::int8_tile_gemm::int8_gemm_vpdpbusd_ymm`: * One `_mm256_dpbusd_avx_epi32` instruction: 8 i32 accumulator lanes, each receiving 4 u8×i8 products = 32 MACs per instruction. Half the throughput-per-instruction of the `_mm512_dpbusd_epi32` zmm version. * Same B-pre-pack scheme (quad-interleaved per 8-wide j-block), same K-tail / N-tail handling. Just narrower. * Stable intrinsic under `target_feature = "avxvnni,avx2"` — no asm-byte needed. Wiring `matmul_i8_to_i32`'s dispatch as Tier 3: 1. amx_available() + 16/16/64-aligned → AMX TDPBUSD (PR #184: int8_gemm_amx_tiled, 16 384 MACs/instr) 2. is_x86_feature_detected!("avx512vnni") → VPDPBUSD-zmm (PR #184: int8_gemm_vpdpbusd_zmm, 64 MACs/instr) 3. is_x86_feature_detected!("avxvnni") → VPDPBUSD-ymm (THIS COMMIT: int8_gemm_vpdpbusd_ymm, 32 MACs/instr) 4. scalar i8×i8 → i32 reference (was Tier 3) All three SIMD tiers share the sign-shift bias trick: shift LHS i8 → u8 (+128), run the kernel, subtract 128·colsum(B). Same `subtract_i8_to_u8_bias` helper (factored in PR #184). New direct test `vpdpbusd_ymm_matches_scalar` mirrors the zmm version's test: sweeps shapes spanning 8-aligned, K-tail (k % 4), N-tail (n % 8), and small shapes, asserts byte-equal output vs scalar reference. Verification: * Default v3 (this host has avx512vnni so the new arm doesn't fire from matmul_i8_to_i32 — Tier 2 catches first): 2096 lib tests pass (was 2095 — +1 new direct test). * Direct test exercises int8_gemm_vpdpbusd_ymm on this host since avxvnni is present alongside avx512vnni. * cargo clippy --lib --tests --features rayon,native -- -D warnings clean. * cargo fmt --all --check clean. Per-CPU dispatch state after this commit (final on the int8 side): matmul_i8_to_i32: SPR+ AMX | CPL/Zen4 zmm | ARL ymm | scalar (PR #184) | (PR #184) | (THIS) | (always) The matmul_i8_to_i32 column of PR #180's dispatch table is now fully filled. The gemm_u8_i8 slice surface (in PR #185) already has AVX-VNNI ymm via its existing compile-time cascade — both i8-related public surfaces now cover every x86_64 tier with a hardware-accelerated arm. Out of scope (separate PRs): * NEON BFMMLA / SDOT on aarch64 via asm-byte — Phase 3b, needs aarch64 CI runner verification. * TD-T6: real _mm256_* for AVX2 BLAS-1 (scal/nrm2/asum). https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
Closes TD-T6 (critical audit finding from the per-CPU matrix doc).
Before this commit, the AVX2 native BLAS-1 module had:
pub fn scal_f32(alpha: f32, x: &mut [f32]) {
super::scalar::scal_f32(alpha, x); // ← scalar shim, no AVX2
}
pub fn nrm2_f32(x: &[f32]) -> f32 {
super::scalar::nrm2_f32(x) // ← scalar shim
}
pub fn asum_f32(x: &[f32]) -> f32 {
super::scalar::asum_f32(x) // ← scalar shim
}
// ... and f64 siblings, same shape
These were the documented "// No AVX2 specialization — fall through
to scalar" path. Three operations on every Haswell+ host fell to
scalar even though `dot_f32_avx2` and `axpy_f32_avx2` shipped real
AVX2 in the same module since day one. PR #180's audit flagged this
as TD-T6 (critical: blocks BLAS-1 throughput on Haswell / Arrow
Lake / Zen 1-3).
New AVX2 kernels (6 total — f32 + f64 for each of scal / nrm2 / asum):
scal: broadcast α to ymm via `_mm256_set1_ps`, multiply 8/4 lanes
at a time via `_mm256_mul_ps`/`_mm256_mul_pd`, scalar tail.
Stores result back to the same buffer in-place.
nrm2: two-accumulator unroll with `_mm256_fmadd_ps`/`_pd` (x²
accumulated via FMA, single-rounded per IEEE), horizontal
reduce + scalar sqrt. Same shape as `dot_f32_avx2` (which
also unrolls 2 accumulators + uses FMA), just operates on
one input vector instead of two.
asum: abs via `_mm256_and_ps`/`_pd` with a sign-bit-cleared mask
(0x7FFFFFFF for f32, 0x7FFFFFFFFFFFFFFF for f64) — one
AVX instruction (VANDPS) is faster than calling f32::abs()
lane-by-lane. Two-accumulator unroll + horizontal reduce.
All three follow the existing `dot_f32_avx2` template:
- `#[target_feature(enable = "avx2[,fma]")]` on the inner unsafe fn.
- Public wrapper does `cfg(target_arch = "x86_64")` and dispatches
to the unsafe fn (tier detection in caller-of-caller verified
AVX2 before reaching this module).
- Non-x86_64 builds: pass through to `super::scalar::*`.
- Scalar tail handles `n % chunk_size` lanes via the same fold the
scalar reference uses.
Numerical contract:
scal: byte-equal to scalar (`x[i] *= α` is the same op).
asum: small ULP drift on long vectors because the SIMD horizontal
reduce orders the sum differently from strict left-fold.
Test tolerance: `|got - expected| <= |expected|*1e-5 + 1e-6`.
nrm2: same — drifts ~1-2 ULP on long vectors via reduce-order +
sqrt rounding. Same tolerance.
3 new parity tests (`td_t6_scal_f32_parity`,
`td_t6_nrm2_f32_parity`, `td_t6_asum_f32_parity`) sweep
n ∈ {0, 1, 7, 8, 9, 15, 16, 17, 31, 32, 64, 100} — covers the
chunk-of-16 unroll path, the chunk-of-8 cleanup path, and the
scalar tail for every kernel.
Verification:
* 2090 lib tests pass (was 2087 — +3 new parity tests; the
existing test_scal_f32 / test_nrm2_f64 / test_asum_f32 that
used to hit the scalar shims now exercise the AVX2 kernels
and continue to pass).
* cargo clippy --lib --tests --features rayon,native -- -D warnings
clean.
* cargo clippy --lib --tests --features rayon,native,runtime-dispatch
-- -D warnings clean.
* cargo fmt --all --check clean.
Throughput impact (back-of-envelope on Sapphire Rapids, n=4096):
scal_f32: scalar 4096 cycles (1 mul/lane) → AVX2 ~520 cycles
(8 lanes/instr + 1-cycle issue) = ~8× faster.
asum_f32: scalar 4096 cycles → AVX2 ~520 cycles = ~8× faster.
nrm2_f32: scalar 4096 cycles (1 FMA/lane) → AVX2 ~260 cycles
(16 lanes via 2-acc unroll, 1-cycle issue) = ~16×.
Out of scope (separate PRs):
* AVX-512 versions of the same three ops — `kernels_avx512.rs`
has them already (lines 137-209), wired through the
cfg(target_feature = "avx512f") path. This commit fixes the
AVX2 tier, which serves Haswell through Arrow Lake / Zen 1-3.
* Runtime-dispatch trampolines for these ops (would go in
`simd_runtime/blas_l1.rs` mirroring the matmul.rs pattern from
the runtime-dispatch PR).
https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
Summary
Documentation-only PR. Two companion files in
.claude/knowledge/:td-simd-tier-audit.md— 22 verified findings of SIMD tier debt, every line range read against actual source. Files swept:simd.rs,simd_amx.rs,simd_dispatch.rs,simd_caps.rs,amx_matmul.rs,bf16_tile_gemm.rs,backend/native.rs, plus relevant sections ofsimd_avx512.rs,simd_neon_bf16.rs,simd_neon_dotprod.rs,bgz17_bridge.rs,nibble.rs,quantized.rs,vnni_gemm.rs.td-simd-integration-plan.md— architecture decision (SimdProfile+ static dispatch tables) plus 4-phase fix plan.What the audit found
Headline (CRITICAL):
matmul_bf16_to_f32/matmul_f32/matmul_i8_to_i32(hpc/amx_matmul.rs:319-412) all haveif amx_available()branches whose AMX arm calls the scalar reference kernel.bf16_tile_gemm_16x16(working TDPBF16PS dispatch) andint8_gemm_vnni(working VPDPBUSD dispatch) exist and are tested — they're just not wired through.bf16_gemm_f32(quantized.rs:444) is a triple-nested scalar loop with.to_f32()per element. Nocrate::simd::*, no F32x16 mul_add. This IS the fallback everything bottoms out at.int8_gemm_i32(quantized.rs:618) is the same shape.backend/native.rs::avx2::{scal,nrm2,asum}_*(:544-561) allsuper::scalar::*-delegate. Dispatch macro thinks AVX2; body is scalar.gemv_f32/gemv_f64(backend/native.rs:271-278) skip the dispatch macro entirely. Scalar on every CPU.Headline (HIGH):
simd_dispatch.rs:150-163aarch64 tier reportsNeonDotProd/Neonbut fills every fn pointer fromSelf::scalar(). Pi 5 / M2 / Apple silicon = scalar.simd_dispatch.rs:128-134AVX-512 tier uses AVX2 wrappers for 4 of 6 ops.simd_neon_bf16.rs:149-177andsimd_neon_dotprod.rs:115-148have*Stubtypes withunimplemented!()panic bodies. Module docs spell out the asm-byte BFMMLA/BFDOT/fmla.8h encodings; nothing wired.Tierenums (simd.rs:18,backend/native.rs:22,hpc/simd_dispatch.rs:31) all collapse Skylake-X through Granite Rapids into oneAvx512bucket.simd.rs:291-292gatesBF16x16re-export on compile-timetarget_feature = "avx512bf16". Default cargo is v3; even on SPR/Zen 4 silicon,crate::simd::BF16x16is the scalarsimd_half::BF16x16.Two agent-claimed findings verified false (
simd_amx.rs:291,simd_avx512.rs:695) — recorded as such in the audit so they don't get re-flagged on the next sweep.Architecture proposal (integration plan)
SimdProfileenum flat-keyed by silicon profile. One static*Dispatchtable per profile, populated at compile time.LazyLock<&'static *Dispatch>resolved once at first call. Branch predictor warms once; every subsequent call is one pointer deref + one indirect call.This is the "switch hashtable" the discussion asked about — conceptually a
HashMap<CpuId, FnPointers>resolved once.Compile-time pinning via cargo features (
cpu-spr,cpu-icx,cpu-zen4,cpu-a76, …) shortcuts theLazyLockto aconst &'static *Dispatch. The#[inline(always)]dispatch function gets folded out at the callsite; chosen kernel inlines directly. Same source, two ergonomics: runtime default for "one binary, all ISAs"; compile-time pin for per-tier distribution binaries.Phases
SimdProfilearchitecture; migrate threeTierenumsPhases 1 and 2 are independent (disjoint files), can land in parallel. Phase 3 needs Phase 1's wired kernels as table entries. Phase 4 is fully parallelizable; each entry is one PR.
"What dispatches where" quick reference
The plan ends with per-primitive routing tables, e.g.:
bf16_gemm_f32impltile_dpbf16ps16×16×k tile_mm512_dpbf16_ps(native BF16 dot)Similar tables for
int8_gemm_i32,gemv_f32,BF16x16ops,F16x16ops.Risks documented
7 open questions at the end of the integration plan. Most important:
cpu-*features must be mutually exclusive (compile_error!) AND documented non-portable;runtime-dispatchdefault stays the safe option. Also:caps.amx_fp16doesn't exist yet — needs to be added before Phase 3 (CPUID leaf 7,1 EAX bit 21 for Granite Rapids).Test plan
objdumpverification on x86 dev machine + throughput test on real Pi 5 / M2.SimdProfilevariant exercised in CI via thecpu-*cargo feature matrix.Generated by Claude Code