diff --git a/.claude/knowledge/td-simd-integration-plan.md b/.claude/knowledge/td-simd-integration-plan.md new file mode 100644 index 00000000..40e2e681 --- /dev/null +++ b/.claude/knowledge/td-simd-integration-plan.md @@ -0,0 +1,479 @@ +# SIMD Tier Integration Plan + +> **Companion to:** `.claude/knowledge/td-simd-tier-audit.md` — read it first. This document is the *fix plan* for the 22 verified findings recorded there, plus the architecture decision for **intra-bucket dispatch** (Sapphire Rapids vs Ice Lake-SP vs Cooper Lake within `Tier::Avx512`, A76 vs A72 within `Tier::Neon`, etc.). + +## Goals + +1. `crate::simd::*` reaches the silicon ceiling on every CPU. No scalar where SIMD exists; no AVX2 where AVX-512 exists; no AVX-512F-only where VNNI / BF16 / FP16 / AMX exists. +2. **One binary, all ISAs** (default): runtime feature detection at startup, frozen dispatch tables, zero per-call branch cost. +3. **Optional pinning** for users distributing per-tier binaries: cargo features that statically resolve dispatch to one profile, removing the runtime branch entirely. Same source, two ergonomics. +4. Wire what already exists before adding new abstractions. Phase 1 closes ~70% of audit debt by routing existing kernels through their existing dispatch sites. + +## Non-goals + +- Inventing new GEMM kernels. `matrixmultiply` covers AVX2 GEMM; AMX tile kernels already exist; VNNI GEMM already exists. The work is routing, not implementation, for Phase 1. +- Replacing the `simd_caps()` singleton or the three existing `Tier` enums in one shot. Phase 3 consolidates them; Phases 1–2 use them as-is. +- Nightly intrinsics (`x86_amx_intrinsics`, etc.). The codebase has already chosen inline asm for AMX on stable; we extend that pattern, not bypass it. + +--- + +## Architecture decision: `SimdProfile` + static dispatch tables + +### The core problem + +The audit's TD-T12 / TD-T13 / TD-T14 — three independent `Tier` enums, each collapsing Skylake-X through Granite Rapids into one `Avx512` bucket. The audit also showed `simd_caps()` (the 20-bit per-feature singleton) already exists and is correct; the gap is that consumers branching on `tier()` get coarse answers, and consumers wanting to specialize on `(BF16, FP16, AMX-INT8, AMX-BF16)` simultaneously have to write 4-deep conditional cascades. + +### The architecture + +Introduce `SimdProfile` — a flat enum that enumerates *silicon profiles*, each defined as the combination of sub-features that distinguishes it from its neighbors. One profile = one set of best primitives. + +```rust +// src/hpc/simd_profile.rs (new file) + +#[derive(Debug, Clone, Copy, PartialEq, Eq)] +pub enum SimdProfile { + // ── x86_64 profiles ── + // Ordered by capability superset for readability; detection is explicit. + + /// Sapphire Rapids / Emerald Rapids: + /// AVX-512F+BW+VL+CD+DQ+VNNI+VBMI+BF16+FP16 + AMX-TILE+INT8+BF16. + SapphireRapids, + /// Granite Rapids: SPR + AMX-FP16 + AMX-COMPLEX. + GraniteRapids, + /// Ice Lake-SP / Tiger Lake server: AVX-512F+VNNI+VBMI, no BF16/FP16/AMX. + IceLakeSp, + /// Cooper Lake: AVX-512F+VNNI+BF16, no VBMI/FP16/AMX. + /// (Rare in the wild; included for completeness.) + CooperLake, + /// Cascade Lake: AVX-512F+VNNI, no VBMI/BF16/FP16/AMX. + CascadeLake, + /// Skylake-X / SP / W: AVX-512F only, no VNNI/VBMI/BF16/FP16/AMX. + SkylakeX, + /// Zen 4 / Zen 5 (Genoa, Ryzen 7000+, EPYC 9004+): + /// AVX-512F+VNNI+VBMI+BF16+FP16, no AMX. + /// Same capability set on Zen 5; we don't distinguish microarch within AMD. + Zen4Avx512, + /// Arrow Lake / Lunar Lake / Meteor Lake-H consumer: + /// AVX2 + AVX-VNNI-INT8 + AVX-IFMA + AVX-NE-CONVERT, no AVX-512. + ArrowLake, + /// Haswell through Coffee Lake / Zen 1-3 desktop: + /// AVX2 + FMA only, no AVX-512. + HaswellAvx2, + + // ── aarch64 profiles ── + + /// ARMv8.2-A: A76 (Pi 5), Apple M-series, Snapdragon 8 Gen 2+. + /// NEON + dotprod + fp16 + bf16+ (BFMMLA/BFDOT). + A76DotProd, + /// ARMv8.0 with 2× NEON pipelines: A72 (Pi 4). + A72Fast, + /// ARMv8.0 single pipeline: A53 (Pi 3 / Pi Zero 2 W). + A53Baseline, + + // ── Fallback ── + /// Anything else: wasm32, riscv, x86 baseline, unknown aarch64. + Scalar, +} +``` + +### Detection + +```rust +impl SimdProfile { + fn detect() -> Self { + #[cfg(target_arch = "x86_64")] + { + let caps = simd_caps(); + // Order matters: most-specific superset first. + if caps.amx_tile && caps.amx_bf16 && caps.avx512fp16 + && /* GNR-specific bit: amx_fp16 if/when detected */ false + { + return SimdProfile::GraniteRapids; + } + if caps.amx_tile && caps.amx_bf16 { + return SimdProfile::SapphireRapids; + } + if caps.avx512f && caps.avx512vnni && caps.avx512vbmi && caps.avx512bf16 { + return SimdProfile::Zen4Avx512; // ICX has VBMI but no BF16 + } + if caps.avx512f && caps.avx512vnni && caps.avx512vbmi { + return SimdProfile::IceLakeSp; + } + if caps.avx512f && caps.avx512vnni && caps.avx512bf16 { + return SimdProfile::CooperLake; + } + if caps.avx512f && caps.avx512vnni { + return SimdProfile::CascadeLake; + } + if caps.avx512f { + return SimdProfile::SkylakeX; + } + if caps.avxvnniint8 { + return SimdProfile::ArrowLake; + } + if caps.avx2 && caps.fma { + return SimdProfile::HaswellAvx2; + } + } + #[cfg(target_arch = "aarch64")] + { + let caps = simd_caps(); + if caps.asimd_dotprod && caps.fp16 { + return SimdProfile::A76DotProd; + } + if caps.neon && caps.aes /* heuristic for A72 vs A53 */ { + return SimdProfile::A72Fast; + } + return SimdProfile::A53Baseline; + } + SimdProfile::Scalar + } +} + +static PROFILE: LazyLock = LazyLock::new(SimdProfile::detect); + +#[inline(always)] +pub fn simd_profile() -> SimdProfile { + *PROFILE +} +``` + +### The "switch hashtable" — static dispatch tables per profile + +For each primitive family, define a `*Dispatch` struct of function pointers and one `static` table per profile. Profiles share entries where the silicon is identical (e.g. CascadeLake and SkylakeX share the AVX-512F path for ops that don't use VNNI/BF16). + +```rust +// src/hpc/gemm_dispatch.rs (new) + +pub struct GemmDispatch { + pub bf16_gemm: fn(&[BF16], &[BF16], &mut [f32], usize, usize, usize, f32, f32), + pub int8_gemm: fn(&[u8], &[i8], &mut [i32], usize, usize, usize), + pub f32_gemv: fn(usize, usize, f32, &[f32], usize, &[f32], f32, &mut [f32]), + // ... etc +} + +// One table per silicon profile. Compile-time const, lives in .rodata. + +static SPR_GEMM: GemmDispatch = GemmDispatch { + bf16_gemm: amx_bf16_tile_gemm, // TDPBF16PS, 256 mul-adds/instr + int8_gemm: amx_int8_tile_gemm, // TDPBUSD, 256 mul-adds/instr + f32_gemv: avx512_f32x16_gemv, // shared with all AVX-512 profiles +}; +static ICX_GEMM: GemmDispatch = GemmDispatch { + bf16_gemm: avx512_f32x16_bf16_gemm, // no native BF16 dot, use F32x16 mul_add + int8_gemm: avx512vnni_gemm, // VPDPBUSD, 64 mul-adds/instr + f32_gemv: avx512_f32x16_gemv, +}; +static CPL_GEMM: GemmDispatch = GemmDispatch { + bf16_gemm: avx512bf16_vdpbf16ps_gemm, // native VDPBF16PS + int8_gemm: avx512vnni_gemm, + f32_gemv: avx512_f32x16_gemv, +}; +static CLX_GEMM: GemmDispatch = GemmDispatch { + bf16_gemm: avx512_f32x16_bf16_gemm, + int8_gemm: avx512vnni_gemm, + f32_gemv: avx512_f32x16_gemv, +}; +static SKX_GEMM: GemmDispatch = GemmDispatch { + bf16_gemm: avx512_f32x16_bf16_gemm, + int8_gemm: scalar_int8_gemm, // no VNNI on SKX + f32_gemv: avx512_f32x16_gemv, +}; +static ZEN4_GEMM: GemmDispatch = GemmDispatch { + bf16_gemm: avx512bf16_vdpbf16ps_gemm, + int8_gemm: avx512vnni_gemm, + f32_gemv: avx512_f32x16_gemv, +}; +static ARROW_GEMM: GemmDispatch = GemmDispatch { + bf16_gemm: avx2_f32x8_bf16_gemm, + int8_gemm: avxvnniint8_gemm, + f32_gemv: avx2_f32x8_gemv, +}; +static HSW_GEMM: GemmDispatch = GemmDispatch { + bf16_gemm: avx2_f32x8_bf16_gemm, + int8_gemm: scalar_int8_gemm, + f32_gemv: avx2_f32x8_gemv, +}; +static A76_GEMM: GemmDispatch = GemmDispatch { + bf16_gemm: neon_bfmmla_bf16_gemm, // ARMv8.2-A BFMMLA + int8_gemm: neon_dotprod_int8_gemm, // ARMv8.2-A SDOT + f32_gemv: neon_f32x4_gemv, +}; +static A72_GEMM: GemmDispatch = GemmDispatch { + bf16_gemm: neon_f32x4_bf16_gemm, + int8_gemm: neon_f32x4_int8_gemm, // no SDOT pre-A76 + f32_gemv: neon_f32x4_gemv, +}; +static A53_GEMM: GemmDispatch = GemmDispatch { /* same as A72, lower MR */ ..A72_GEMM }; +static SCALAR_GEMM: GemmDispatch = GemmDispatch { + bf16_gemm: scalar_bf16_gemm, + int8_gemm: scalar_int8_gemm, + f32_gemv: scalar_gemv, +}; + +#[inline(always)] +pub fn gemm_dispatch() -> &'static GemmDispatch { + static TABLE: LazyLock<&'static GemmDispatch> = LazyLock::new(|| { + match simd_profile() { + SimdProfile::SapphireRapids | SimdProfile::GraniteRapids => &SPR_GEMM, + SimdProfile::IceLakeSp => &ICX_GEMM, + SimdProfile::CooperLake => &CPL_GEMM, + SimdProfile::CascadeLake => &CLX_GEMM, + SimdProfile::SkylakeX => &SKX_GEMM, + SimdProfile::Zen4Avx512 => &ZEN4_GEMM, + SimdProfile::ArrowLake => &ARROW_GEMM, + SimdProfile::HaswellAvx2 => &HSW_GEMM, + SimdProfile::A76DotProd => &A76_GEMM, + SimdProfile::A72Fast => &A72_GEMM, + SimdProfile::A53Baseline => &A53_GEMM, + SimdProfile::Scalar => &SCALAR_GEMM, + } + }); + *TABLE +} +``` + +Call site: + +```rust +pub fn bf16_gemm_f32(a: &[BF16], b: &[BF16], c: &mut [f32], m: usize, n: usize, k: usize, alpha: f32, beta: f32) { + (gemm_dispatch().bf16_gemm)(a, b, c, m, n, k, alpha, beta); +} +``` + +One pointer deref, one indirect call. Branch predictor warms on the first call; every subsequent call is monomorphic to whatever silicon detected at startup. **This is the "switch hashtable" the user asked about.** Conceptually a `HashMap` resolved once. + +### Compile-time pinning (the user's "what if native can handle compile time dispatch") + +Add cargo features that pin the runtime selection at compile time: + +```toml +# Cargo.toml +[features] +# Default: runtime LazyLock dispatch (one binary, all ISAs). +default = ["runtime-dispatch"] +runtime-dispatch = [] + +# Compile-time pins: pick ONE. Set RUSTFLAGS and skip LazyLock detection. +cpu-spr = [] # implies -Ctarget-cpu=sapphirerapids +cpu-gnr = [] # implies -Ctarget-cpu=graniterapids-d +cpu-icx = [] # implies -Ctarget-cpu=icelake-server +cpu-cpl = [] # implies -Ctarget-cpu=cooperlake +cpu-clx = [] # implies -Ctarget-cpu=cascadelake +cpu-skx = [] # implies -Ctarget-cpu=skylake-avx512 +cpu-zen4 = [] # implies -Ctarget-cpu=znver4 +cpu-arrowlake = [] # implies -Ctarget-cpu=arrowlake +cpu-haswell = [] # implies -Ctarget-cpu=haswell +cpu-a76 = [] # implies -Ctarget-cpu=cortex-a76 (or -Ctarget-feature=+dotprod,+fp16,+bf16) +``` + +And in code: + +```rust +#[inline(always)] +pub fn gemm_dispatch() -> &'static GemmDispatch { + #[cfg(feature = "cpu-spr")] + return &SPR_GEMM; + #[cfg(feature = "cpu-icx")] + return &ICX_GEMM; + #[cfg(feature = "cpu-zen4")] + return &ZEN4_GEMM; + // ... etc + #[cfg(all(feature = "runtime-dispatch", not(any(feature = "cpu-spr", feature = "cpu-icx", feature = "cpu-zen4", /* ... */))))] + { + static TABLE: LazyLock<&'static GemmDispatch> = LazyLock::new(|| /* runtime switch */); + *TABLE + } +} +``` + +The `cpu-*` arm is a const reference returned from a `#[inline(always)]` function. After monomorphization the compiler folds the entire dispatch out and inlines the chosen kernel directly at the callsite. **Zero runtime cost. Same source as runtime dispatch.** + +The same primitive functions (`amx_bf16_tile_gemm`, `avx512vnni_gemm`, etc.) are referenced by both the static profile tables AND the inline `cpu-*` returns. No duplication. + +### Per-function `target_feature` annotations remain mandatory + +The dispatch table picks the function pointer; the function itself still needs `#[target_feature(enable = "...")]` for the compiler to emit the right instructions. Pattern continues unchanged: + +```rust +#[cfg(target_arch = "x86_64")] +#[target_feature(enable = "avx512f,avx512bf16,avx512vl")] +unsafe fn avx512bf16_vdpbf16ps_gemm(...) { + // _mm512_dpbf16_ps lives here +} + +fn avx512bf16_vdpbf16ps_gemm_wrapper(...) { + // SAFETY: dispatch verified profile == CooperLake | Zen4Avx512 | SPR | GNR + unsafe { avx512bf16_vdpbf16ps_gemm(...) } +} +``` + +The dispatch table holds the safe wrapper; the wrapper relies on profile detection as its safety invariant. + +--- + +## Phase 1 — Wire existing infrastructure (P0) + +No new abstractions. Each task routes existing kernels through existing dispatch primitives. + +| Task | File | Change | Effort | +|---|---|---|---| +| TD-T1 | `src/hpc/amx_matmul.rs:304-331` | `matmul_bf16_to_f32` AMX arm calls `bf16_tile_gemm_16x16` for 16×16 sub-tiles, `bf16_gemm_f32` for tails. | 1h | +| TD-T2 | `src/hpc/amx_matmul.rs:342-370` | `matmul_f32` AMX arm: convert to BF16, call `bf16_tile_gemm_16x16`. Drop the duplicate `bf16_gemm_f32` line. | 30m | +| TD-T3 | `src/hpc/amx_matmul.rs:386-434` | `matmul_i8_to_i32`: AMX arm wires `tile_dpbusd` (primitives at `amx_matmul.rs:146-150` already work). Non-AMX arm calls `int8_gemm_vnni` (at `vnni_gemm.rs:46`) instead of `int8_gemm_i32`. | 1.5h | +| TD-T5 | `src/hpc/quantized.rs:618-630` | Rename current `int8_gemm_i32` → `int8_gemm_i32_scalar`. Add a public `int8_gemm_i32` that calls `int8_gemm_vnni` (which has its own VNNI/scalar dispatch). | 30m | +| TD-T7 | `src/backend/native.rs:271-278` | Replace `gemv_f32`/`gemv_f64` bodies with `dispatch!` macro invocations. Add `kernels_avx512::gemv_f32/f64` (F32x16 / F64x8 over rows). Add `avx2::gemv_f32/f64` (F32x8 / F64x4). | 2h | +| TD-T6 | `src/backend/native.rs:544-561` | Implement `avx2::{scal,nrm2,asum}_f32/f64` with real `_mm256_*` intrinsics. Pattern matches the existing `dot_f32_avx2` at lines 567-602. | 2h | +| TD-T4 | `src/hpc/quantized.rs:444-481` | Rewrite `bf16_gemm_f32` using F32x16 mul_add over decoded BF16 rows (the fallback pattern already exists in `bf16_tile_gemm.rs::fallback_path` at lines 96-126). Reuse that, don't duplicate. | 3h | + +**Phase 1 total: ~10–12h.** Closes 7 of 22 audit findings, all CRITICAL. + +After Phase 1: `crate::simd::*` consumers on Sapphire Rapids actually run AMX tile kernels; consumers on Cascade Lake / Ice Lake-SP / Zen 4 actually run VNNI; consumers on AVX2 silicon actually run `_mm256_*` intrinsics for BLAS-1. + +--- + +## Phase 2 — aarch64 fill (P1) + +| Task | File | Change | Effort | +|---|---|---|---| +| TD-T10 | `src/simd_neon_bf16.rs:149-204` | Replace `BF16x8Stub` / `BF16x16Stub` with real wrappers backed by `bfloat16x8_t` pairs. Implement `splat`, `from_slice`, `to_array`, `dot_f32` (via BFDOT asm-byte `0x4e40_ec00 | ...`), `cvt_to_f32_lo/hi`, `fma` (via BFMLALB/T or BFMMLA). Mirror `simd_amx.rs` for the asm-byte pattern. Parity tests vs scalar. | 4h | +| TD-T11 | `src/simd_neon_dotprod.rs:115-148` | Replace `F16x16Stub` with `F16x16(pub [float16x8_t; 2])`. Implement the intrinsic map documented at lines 121-131 via asm-byte (`fmla v.8h` = `0x0e40_cc20 \| ...`). Parity tests. | 4h | +| TD-T21 | `src/simd.rs:351-354` | Replace the `scalar::*` integer re-exports on aarch64 with `simd_neon::aarch64_simd_int::*` (new module). Implement `I32x8` (= `int32x4x2_t`), `U8x64` (= `uint8x16x4_t`), `U16x32` (= `uint16x8x4_t`), etc., backed by 128-bit NEON quartets. | 8h | +| TD-T8 | `src/hpc/simd_dispatch.rs:150-163` | Replace `Self::scalar()` aarch64 dispatch with real NEON wrappers. Add `byte_find_all_neon`, `byte_count_neon`, `squared_distances_neon`, `nibble_unpack_neon`, etc. Match the x86_64 wrapper pattern at lines 207-288. | 6h | + +**Phase 2 total: ~22h.** Closes 4 findings, all HIGH. Pi 5 / M2 silicon ceiling. + +Verification: needs aarch64 hardware (Pi 5 or Apple M-series). The asm-byte encodings can be unit-tested via objdump on a build machine, but throughput / correctness verification needs real hardware. Land behind a CI gate that runs on aarch64 runners (already exists per the `nostd` job). + +--- + +## Phase 3 — `SimdProfile` architecture (P1, addresses the user's intra-bucket question) + +| Task | File | Change | Effort | +|---|---|---|---| +| T3.1 | `src/hpc/simd_profile.rs` (new) | Define `SimdProfile` enum + `detect()` per the architecture section above. ~150 LoC. | 3h | +| T3.2 | `src/hpc/simd_profile.rs` | Cargo features `cpu-spr` / `cpu-icx` / `cpu-cpl` / `cpu-clx` / `cpu-skx` / `cpu-zen4` / `cpu-arrowlake` / `cpu-haswell` / `cpu-a76` / `cpu-a72`. `.cargo/config-{profile}.toml` for each. Each pins `target-cpu` and shortcuts `simd_profile()` to a const. | 4h | +| T3.3 | `src/hpc/gemm_dispatch.rs` (new) | First `*Dispatch` struct: `GemmDispatch` with `bf16_gemm`, `int8_gemm`, `f32_gemv`. 12 static tables (one per profile). | 4h | +| T3.4 | `src/hpc/blas1_dispatch.rs` (new) | `Blas1Dispatch` for `dot_f32/f64`, `axpy_f32/f64`, `scal_f32/f64`, `nrm2_f32/f64`, `asum_f32/f64`. | 3h | +| T3.5 | `src/backend/native.rs` | Migrate `dispatch!` macro to consume `simd_profile()` instead of the local `Tier` enum. Delete the local enum. | 2h | +| T3.6 | `src/simd.rs` | Migrate `tier()` to alias `simd_profile().coarse()` (`fn coarse(self) -> Tier`). Preserve existing public callers; let the migration happen incrementally. | 2h | +| T3.7 | `src/hpc/simd_dispatch.rs` | Migrate `detect()` to consume `simd_profile()`. Add the missing dispatch tables (`Avx512f-only`, `AvxVnniInt8`, `IceLakeSp` for VBMI permute paths). | 3h | + +**Phase 3 total: ~21h.** Closes TD-T12, T13, T14 (the three Tier enum collapses) and provides the framework for Phase 4. + +Migration path is **additive**: `simd_profile()` ships alongside `tier()`, callers opt in, the three local `Tier` enums get deleted last. No big-bang refactor. + +--- + +## Phase 4 — Intra-bucket SIMD fills (P2, parallelizable) + +Each entry below is one PR-sized task. They can land independently in any order. + +| Task | Profile that unlocks it | What gets faster | File(s) | +|---|---|---|---| +| `crate::simd::BF16x16` native dispatch | CooperLake, Zen4, SPR, GNR | BF16 lane-wise ops, BF16 dot via `vdpbf16ps` | TD-T15 — `src/simd.rs:291-292, 531-532`; convert compile-time gate to runtime via `simd_profile()` | +| F16 native dispatch | Zen4, SPR, GNR (avx512fp16); A76 (fp16) | `F16x16::{add,sub,mul,fma}` via `_mm512_*_ph` or `vaddq_f16` | New `src/simd_fp16.rs` for AVX-512-FP16; extend `simd_neon_dotprod.rs` for aarch64 | +| AVX-VNNI-INT8 ymm GEMM | ArrowLake | int8 GEMM on no-AVX-512 silicon | `src/hpc/vnni_gemm.rs` add `int8_gemm_avxvnniint8_ymm` (256-bit `_mm256_dpbusd_*`) | +| VPOPCNTDQ paths | IceLakeSp, SPR, GNR, Zen4 | Hamming distance, bit counting | Audit `src/hpc/bitwise.rs` for sites currently using avx512bw popcount where `_mm512_popcnt_epi64` exists | +| VBMI byte permute | IceLakeSp, SPR, GNR, Zen4 | Sprite atlas reorder, palette remap > 16 entries | Already wired at `simd_avx512.rs:695` — sweep for other byte-permute sites | +| GFNI bitmatrix multiply | IceLakeSp+, Zen4 | 8×8 bit transpose, palette swaps | New finding — needs audit pass | +| VAES 4×-wide AES | IceLakeSp+, Zen3+ | AES-CTR / AES-GCM batch | Audit `src/hpc` for AES use; may not exist yet | +| Nibble AVX-512BW path | All AVX-512BW | `nibble_unpack`, `nibble_above_threshold` 2× width | TD-T16 — `src/hpc/nibble.rs` | +| Real `_mm256_*` nibble intrinsics | HaswellAvx2 | `nibble_unpack_avx2` proper pshufb | TD-T17 — `src/hpc/nibble.rs:59-94` | +| `simd_ln_f32` Remez polynomial | All AVX-512F | 16-wide log (currently scalar loop) | TD-T18 — `src/simd.rs:479-486` | +| `distance::squared_distances_f32` AVX-512 path | All AVX-512F | 16-wide 3D L2 distance | TD-T19 — `src/hpc/distance.rs:101` | +| `spatial_hash::batch_sq_dist` AVX-512 path | All AVX-512F | same | TD-T20 — `src/hpc/spatial_hash.rs:273` | +| AMX FP16 (when GNR ships) | GraniteRapids | `_tile_dpfp16ps` for FP16 GEMM | New, gated on `caps.amx_fp16` (needs CPUID leaf update in `simd_amx.rs:50`) | + +**Phase 4 total: 30–50h, parallelizable.** Each task is small, isolated, and has clear silicon-tier acceptance criteria. + +--- + +## What dispatches where — quick reference + +For each named primitive, the silicon-by-silicon route after all 4 phases land: + +### `bf16_gemm_f32` (BF16 × BF16 → f32 matmul) + +| Profile | Implementation | +|---|---| +| GraniteRapids, SapphireRapids | `tile_dpbf16ps` 16×16×k tile kernel (`hpc/bf16_tile_gemm.rs::amx_path`) | +| Zen4Avx512, CooperLake | `_mm512_dpbf16_ps` over 16-wide BF16 dot-products + f32 accumulate (new) | +| IceLakeSp, CascadeLake, SkylakeX | F32x16 mul_add over decoded BF16 rows (`hpc/bf16_tile_gemm.rs::fallback_path`) | +| ArrowLake, HaswellAvx2 | F32x8 mul_add over decoded BF16 rows (new) | +| A76DotProd | NEON BFMMLA via asm-byte (new in Phase 2 TD-T10) | +| A72Fast, A53Baseline | NEON F32x4 mul_add over decoded BF16 (new) | +| Scalar | Scalar triple loop (current `quantized.rs:444`) — kept as the reference | + +### `int8_gemm_i32` (u8 × i8 → i32 matmul) + +| Profile | Implementation | +|---|---| +| GraniteRapids, SapphireRapids | `tile_dpbusd` 16×16×k AMX tile kernel | +| Zen4Avx512, IceLakeSp, CooperLake, CascadeLake | `int8_gemm_vnni_avx512` (existing at `vnni_gemm.rs:94`) | +| SkylakeX | F32x16-accumulate-i32 scalar packing fallback | +| ArrowLake | `_mm256_dpbusd_epi32` (existing `vnni2_dot_u8_i8` at `simd_amx.rs:203`) | +| HaswellAvx2 | Scalar i32 accumulate (no VNNI pre-Cascade Lake) | +| A76DotProd | NEON SDOT (`vdotq_s32`, existing in `simd_neon.rs`) | +| A72Fast, A53Baseline | NEON int16x8 widen + multiply-accumulate | + +### `gemv_f32` (BLAS-2 matrix-vector) + +| Profile | Implementation | +|---|---| +| All AVX-512F+ | `kernels_avx512::gemv_f32` over F32x16 row-dot-products (new in Phase 1 TD-T7) | +| ArrowLake, HaswellAvx2 | `avx2::gemv_f32` over F32x8 (new in Phase 1 TD-T7) | +| aarch64 | NEON F32x4 gemv (new in Phase 2) | +| Scalar | Existing `scalar::gemv_f32` | + +### `BF16x16::add/sub/mul/fma/dot` + +| Profile | Implementation | +|---|---| +| Zen4Avx512, CooperLake, SPR, GNR | Native `__m256bh` via `vdpbf16ps` / scalar conversion | +| All other AVX-512 (SKX, CLX, ICX) | F32x16 upcast → op → BF16 downcast | +| AVX2 | F32x8 upcast → op → BF16 downcast (currently scalar polyfill) | +| A76DotProd | NEON `bfloat16x8_t` BFDOT for dot product, upcast-via-f32 for other ops | +| Scalar | `simd_half::BF16x16` (existing) | + +### `F16x16::add/sub/mul/fma` + +| Profile | Implementation | +|---|---| +| Zen4Avx512, SPR, GNR (avx512fp16) | Native `__m256h` (Phase 4) | +| Other AVX-512 with F16C | F16C `_mm256_cvtph_ps` upcast → F32x16 op → `_mm256_cvtps_ph` downcast | +| ArrowLake, HaswellAvx2 (F16C) | Same F16C upcast/downcast pattern | +| A76DotProd | Native `float16x8_t` via asm-byte (Phase 2 TD-T11) | +| Scalar | `simd_half::F16x16` (existing) | + +--- + +## Risks and open questions + +1. **Cargo feature combinations.** If a user sets `cpu-spr` AND has runtime detection on Zen 4, the binary SIGILLs on AMX instructions. Solution: make `cpu-*` features mutually exclusive via a `compile_error!` in the crate, AND document that `cpu-*` binaries are NOT portable across silicon. The `runtime-dispatch` default stays the safe option. + +2. **The `Zen4Avx512` profile question.** AMD's BF16 / FP16 implementation on Zen 4 has different latency than Intel's CPL / SPR. The dispatch picks the same `vdpbf16ps` path on both, but the surrounding tile sizes may need to differ. Out of scope for the initial integration; revisit when benchmarks show divergence. + +3. **Detection robustness across hypervisors.** Cloud hypervisors sometimes mask CPUID. The `simd_amx.rs::amx_available()` function already handles this (XCR0 + `arch_prctl` checks). Extend the same defense-in-depth to `simd_profile::detect()` — if a feature is CPUID-reported but XSAVE-state-disabled, demote to the next-best profile. + +4. **No GNR-specific detection yet.** `caps.amx_fp16` doesn't exist as a `SimdCaps` field. Add it before TD-T3.1, gated on CPUID.07H.1H:EAX bit 21 (AMX-FP16) — this is a separate leaf from the existing AMX-INT8 / AMX-BF16 bits at leaf 7,0. + +5. **Three Tier enums → one SimdProfile.** Phase 3 deletes the local `Tier` enums in `simd.rs`, `backend/native.rs`, and `simd_dispatch.rs`. Migration is additive (new alongside old) until all callers move. Risk of stale `Tier` callers re-emerging during the migration window — mitigate via a deprecation lint. + +6. **AVX2 polyfill audit (TD-T22).** The audit deferred verification of whether the 256-bit polyfills in `simd_avx2.rs` are real `__m256i` intrinsics or scalar-storage arrays under `#[target_feature]`. If they're scalar storage, that's a Phase 4 entry: real `__m256i` implementations for the 5 new 256-bit int types (U16x16, U32x8, U64x4, I32x8, I64x4) plus the existing ones. + +7. **BLAS-1/2/3 / LAPACK / statistics scalar implementations.** Audit grep showed no `crate::simd::*` use in `blas_level{1,2,3}.rs`, `statistics.rs`, `lapack.rs`, `quantized.rs`. These are flagship public surfaces. Phase 4 needs to either (a) route them through `gemm_dispatch()` / `blas1_dispatch()` / etc., or (b) replace them with `crate::simd::*`-using implementations directly. Decision deferred until the dispatch tables exist. + +--- + +## Sequencing and dependencies + +``` +Phase 1 (wiring) ──────────► ships independently, no deps +Phase 2 (aarch64) ──────────► ships independently, no deps +Phase 3 (SimdProfile) ──────► builds on Phase 1 (uses Phase 1 kernels as table entries) +Phase 4 (intra-bucket) ─────► builds on Phase 3 (each entry is a new profile-specific kernel) +``` + +Phase 1 and Phase 2 can run in parallel — they touch disjoint files. Phase 3 needs Phase 1's wired kernels as the function pointers in its dispatch tables (otherwise the tables would point at the same scalar stub for every profile). Phase 4 is fully parallelizable; each entry is one PR. + diff --git a/.claude/knowledge/td-simd-tier-audit.md b/.claude/knowledge/td-simd-tier-audit.md new file mode 100644 index 00000000..c57268dc --- /dev/null +++ b/.claude/knowledge/td-simd-tier-audit.md @@ -0,0 +1,357 @@ +# SIMD Tier Technical Debt Audit + +> **Design principle:** `crate::simd::*` exposes the maximum hardware performance available on the current silicon via runtime-detected polyfill. Every CPU trick is applied at its tier. Consumers must never get scalar code when the hardware offers a SIMD path. The polyfill **is** the dispatch layer, not the fallback. + +## Audit scope (2026-05-20) + +Every finding below was verified by reading the file at the cited line range. Files read end-to-end or in dispatch-relevant sections: + +| File | LoC read | Why | +|---|---|---| +| `src/simd.rs` | 1–720 (full) | Top-level dispatch | +| `src/simd_amx.rs` | 1–421 (full) | AMX detection + VNNI dispatch | +| `src/hpc/amx_matmul.rs` | 1–671 (full) | Public ndarray-typed matmul API | +| `src/hpc/bf16_tile_gemm.rs` | 1–205 (full) | AMX tile kernel | +| `src/hpc/simd_caps.rs` | 1–514 (full) | Capability singleton | +| `src/hpc/simd_dispatch.rs` | 1–361 (full) | Frozen dispatch table | +| `src/backend/native.rs` | 1–763 (full) | Backend BLAS-1 + GEMM dispatch | +| `src/backend/kernels_avx512.rs` | 1–100 + grep | AVX-512 BLAS-1 kernels | +| `src/simd_neon_bf16.rs:130–204` | stub section | BF16 NEON stubs | +| `src/simd_neon_dotprod.rs:96–157` | stub section | F16 NEON stub | +| `src/simd_avx512.rs:680–720, 2360–2420` | VBMI + BF16 conv | VBMI permute, BF16 batch | +| `src/hpc/bgz17_bridge.rs:35–135` | dispatch sites | bgz17 L1 kernels | +| `src/hpc/nibble.rs:1–270` | dispatch sites | Nibble ops | +| `src/hpc/quantized.rs:444–630` | GEMM kernels | bf16/int8 GEMM | +| `src/hpc/vnni_gemm.rs:1–130` | VNNI INT8 GEMM | VNNI dispatch | + +Files NOT yet read for this audit (next sweep): + +- `src/simd_avx512.rs` remainder (~3700 LoC unread) +- `src/simd_avx2.rs` (2805 LoC unread) +- `src/simd_neon.rs` (1917 LoC unread) +- `src/simd_scalar.rs` (1308 LoC unread) +- `src/simd_half.rs` (762 LoC unread) +- `src/simd_nightly/*` +- HPC modules: `vml.rs`, `activations.rs`, `reductions.rs`, `kernels.rs`, `fft.rs`, `statistics.rs`, `lapack.rs`, `blas_level{1,2,3}.rs`, `cam_pq.rs`, `palette_distance.rs`, `aabb.rs`, `distance.rs`, `bitwise.rs`, `p64_bridge.rs`, `spatial_hash.rs`, `jitson_cranelift/detect.rs`, all of `src/hpc/styles/*` + +## Microscopic silicon tier matrix + +| CPU | AVX-512F | VNNI | VBMI | BF16 | FP16 | AMX-INT8 | AMX-BF16 | AVX-VNNI-INT8 | +|---|---|---|---|---|---|---|---|---| +| Skylake-X / SP / W (2017) | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | +| Cascade Lake (2019) | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | +| Cooper Lake (2020) | ✓ | ✓ | ✗ | ✓ | ✗ | ✗ | ✗ | ✗ | +| Ice Lake-SP / Tiger Lake (2021) | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | +| Sapphire Rapids (2023) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | +| Granite Rapids (2024) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | (+ AMX-FP16, AMX-COMPLEX) | +| Zen 4 (Genoa, Ryzen 7000, 2022) | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ | +| Zen 5 (2024) | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ | +| Arrow Lake / Lunar Lake (2024) | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ | +| Pi 5 / Orange Pi 5 (A76, ARMv8.2) | (NEON) | (dotprod) | – | (bf16+) | (fp16) | – | – | – | +| Pi 4 (A72, ARMv8.0) | (NEON) | – | – | – | – | – | – | – | + +--- + +## Findings — CRITICAL + +### TD-T1 · `src/hpc/amx_matmul.rs:319-327` · stub fast path + +`matmul_bf16_to_f32`: `if amx_available() { bf16_gemm_f32(...) } else { bf16_gemm_f32(...) }` — both arms identical. Comment 320-323 admits "Future: AMX-tiled fast path. Today we route through the same f32 reference kernel; correctness is identical regardless of hardware. The `amx_available()` branch is preserved so callers can be sure the AMX detection runs." + +Working AMX kernel exists at `src/hpc/bf16_tile_gemm.rs::bf16_tile_gemm_16x16` (lines 39-87) — full TDPBF16PS dispatch, tested at lines 151-204. + +**Hit on Sapphire Rapids / Granite Rapids:** scalar instead of 256-mul-add/instr tile op. + +### TD-T2 · `src/hpc/amx_matmul.rs:351-356` · stub fast path + +`matmul_f32` AMX branch: converts f32 → BF16 and calls `bf16_gemm_f32` (scalar). Same shape as TD-T1. + +### TD-T3 · `src/hpc/amx_matmul.rs:395-412` · stub fast path + wrong fallback + +`matmul_i8_to_i32` AMX branch: shifts LHS i8 → u8 (+128) and calls `int8_gemm_i32` (scalar reference). + +Two debts here: +1. AMX path never reaches `tile_dpbusd` (the working primitive at `amx_matmul.rs:146-150`). +2. The fallback when AMX is absent should be `int8_gemm_vnni` (at `src/hpc/vnni_gemm.rs:46`), which dispatches AVX-512 VNNI `VPDPBUSD` (64 MACs/instr) — but it calls the scalar `int8_gemm_i32` directly. + +**Hit on Sapphire Rapids:** ~256× slower than AMX TDPBUSD. +**Hit on Cascade Lake / Ice Lake-SP / Zen 4 (AVX-512 + VNNI but no AMX):** ~64× slower than VNNI. + +### TD-T4 · `src/hpc/quantized.rs:444-481` · scalar kernel labeled GEMM + +`bf16_gemm_f32` is a triple-nested scalar loop with per-element `.to_f32()` upcast. No `crate::simd::*` types, no F32x16 mul_add, no FMA. This is the function `matmul_bf16_to_f32` falls back to — so the entire BF16 GEMM public surface bottoms out in scalar. + +**Hit on every CPU:** even AVX-512F-only Skylake-X loses the F32x16 mul_add (16-wide FMA per instr) that would lift this 16×. + +### TD-T5 · `src/hpc/quantized.rs:618-630` · scalar kernel labeled GEMM + +`int8_gemm_i32` is a triple-nested scalar loop. The VNNI dispatch path `int8_gemm_vnni` (lines 46-61 of `vnni_gemm.rs`) exists and is correct (uses `simd_caps().has_avx512_vnni()` and calls `int8_gemm_vnni_avx512`), but it's a separate symbol — nothing routes the public `int8_gemm_i32` callers through it. + +### TD-T6 · `src/backend/native.rs:544-561` · scalar-only "avx2" implementations + +The `avx2` module's `scal_f32`, `scal_f64`, `nrm2_f32`, `nrm2_f64`, `asum_f32`, `asum_f64` all unconditionally delegate to `super::scalar::*`. The dispatch macro thinks it's dispatching to AVX2 but the body is scalar. + +```rust +pub fn scal_f32(alpha: f32, x: &mut [f32]) { + super::scalar::scal_f32(alpha, x); // ← line 545 +} +``` + +Effect: on Haswell–Coffee Lake / Zen 1-3 (AVX2 + FMA but no AVX-512), all of `scal_*`, `nrm2_*`, `asum_*` run scalar. The dispatch macro at lines 92-165 routes through `avx2::name()` which is itself scalar. + +### TD-T7 · `src/backend/native.rs:271-278` · GEMV scalar everywhere + +`gemv_f32` and `gemv_f64` skip the `dispatch!` macro entirely and call `scalar::gemv_*` unconditionally. No AVX-512, no AVX2, no NEON. Every consumer of the backend GEMV path runs the scalar nested loop on every CPU. + +```rust +pub fn gemv_f32(...) { + scalar::gemv_f32(...); // ← line 272 +} +``` + +--- + +## Findings — HIGH + +### TD-T8 · `src/hpc/simd_dispatch.rs:150-163` · aarch64 dispatch = scalar + +```rust +#[cfg(target_arch = "aarch64")] +fn detect() -> Self { + let caps = simd_caps(); + let tier = if caps.asimd_dotprod { SimdTier::NeonDotProd } else { SimdTier::Neon }; + // NEON uses the same scalar wrapper signatures — NEON intrinsics + // will be wired when simd_neon.rs types are activated. For now, + // dispatch to scalar which auto-vectorizes well on aarch64 with + // `-C target-feature=+neon` (mandatory on aarch64). + Self { tier, ..Self::scalar() } +} +``` + +The frozen dispatch table reports `NeonDotProd` or `Neon` tier to consumers but every function pointer in the struct is the scalar wrapper. Pi 5 / Pi 4 / M2 get the scalar implementations for `byte_find_all`, `byte_count`, `squared_distances_f32`, `nibble_unpack`, `nibble_above_threshold`, `batch_sq_dist`. + +### TD-T9 · `src/hpc/simd_dispatch.rs:128-134` · AVX-512 dispatch falls to AVX2 wrappers + +Even when `caps.avx512bw` is true, the AVX-512 tier branch fills in 4 of 6 function pointers with AVX2 wrappers: + +```rust +if caps.avx512bw { + Self { + tier: SimdTier::Avx512, + byte_find_all: byte_find_all_avx512_wrapper, // ← real + byte_count: byte_count_avx512_wrapper, // ← real + squared_distances_f32: squared_distances_avx2_wrapper, // ← AVX2! + nibble_unpack: nibble_unpack_avx2_wrapper, // ← AVX2! + nibble_above_threshold: nibble_above_threshold_avx2_wrapper, // ← AVX2! + batch_sq_dist: batch_sq_dist_avx2_wrapper, // ← AVX2! + } +} +``` + +Comment at line 130 admits `// no avx512 variant for 3D dist`. For `nibble_*`, the variant is missing per TD-T17. + +### TD-T10 · `src/simd_neon_bf16.rs:149-177` · stub structs that panic + +`BF16x8Stub` (line 149) and `BF16x16Stub` (line 156) are placeholder structs whose only method is `unimplemented()` panicking with the message documenting the BFMMLA / BFDOT asm-byte encoding still to wire up: `BFMMLA = 0x6e40_ec00 | (Vm << 16) | (Vn << 5) | Vd`, `BFDOT = 0x4e40_ec00 | (Vm << 16) | (Vn << 5) | Vd`. Module docs at lines 187-204 spell out the implementation plan; nothing is wired. + +**Hit on Pi 5 A76, Apple M2+, Snapdragon 8 Gen 2+:** consumers reaching for BF16 NEON ops panic or fall through to scalar `simd_half::BF16x16`. + +### TD-T11 · `src/simd_neon_dotprod.rs:115-148` · F16x16 stub + +`F16x16Stub` (line 136) is a placeholder; `unimplemented()` panics (line 141-147). Module docs at lines 96-113 give the full intrinsic map (`vfmaq_f16`, `vaddvq_f16`, `vsqrtq_f16`, `vcgtq_f16`) and the stable-Rust asm-byte encoding `0x0e40_cc20` for `fmla v0.8h, v1.8h, v2.8h`. + +**Hit on Pi 5 A76, Apple M-series:** consumers reaching `crate::simd::F16x16` get `simd_avx2::F16Scaler` scalar polyfill (line 134 comment) or `simd_nightly::F16x16`. + +### TD-T12 · `src/simd.rs:18-26` + `:49-88` · top-level Tier enum collapses + +```rust +enum Tier { + Avx512 = 1, + Avx2 = 2, + NeonDotProd = 3, + Neon = 4, + Scalar = 5, +} + +fn detect_tier() -> Tier { + if is_x86_feature_detected!("avx512f") { return Tier::Avx512; } + if is_x86_feature_detected!("avx2") { return Tier::Avx2; } + ... +} +``` + +Skylake-X (no VNNI / VBMI / BF16 / FP16 / AMX) and Granite Rapids (all of them) both → `Tier::Avx512`. Arrow Lake (`avxvnniint8`, no AVX-512F) → `Tier::Avx2`. Every caller of `tier()` (line 97) gets a coarse answer. + +Mitigation: `simd_caps()` at `src/hpc/simd_caps.rs:98` exists with 20 per-feature bits — but it's a separate dispatch channel, and consumers who use `tier()` don't see the sub-features. + +### TD-T13 · `src/backend/native.rs:22-26` · second Tier enum, same collapse + +Backend defines its own `Tier { Avx512, Avx2, Scalar }` enum (line 21-26), independent of the one in `simd.rs:18`. Same 3-bucket collapse. Same lack of VNNI / VBMI / BF16 / FP16 / AMX awareness. + +### TD-T14 · `src/hpc/simd_dispatch.rs:30-49` · third Tier enum, same collapse + +`SimdTier { Avx512, Avx2, Sse2, NeonDotProd, Neon, Scalar, WasmSimd128 }` — 7 variants, but `detect()` at lines 121-148 only branches on `caps.avx512bw` and `caps.avx2`. SSE2 never selected. No AVX-512-VNNI / VBMI / BF16 / FP16 / AMX paths. + +Three independent Tier enums total (TD-T12, TD-T13, TD-T14). + +### TD-T15 · `src/simd.rs:291-292 + 531-532` · BF16x16 polyfill-not-max under default config + +```rust +// 291: hardware-native, ONLY if compile-time avx512bf16 is on +#[cfg(all(target_arch = "x86_64", target_feature = "avx512bf16", not(feature = "nightly-simd")))] +pub use crate::simd_avx512::{BF16x16, BF16x8}; + +// 531: scalar polyfill, the default +#[cfg(all(feature = "std", not(all(target_arch = "x86_64", target_feature = "avx512bf16"))))] +pub use crate::simd_half::BF16x16; +``` + +The cargo default is `x86-64-v3` (per `.cargo/config.toml:25`), which is AVX2 only — no AVX-512F, definitely no avx512bf16. So even on Sapphire Rapids / Zen 4 silicon under default cargo, `crate::simd::BF16x16` resolves to scalar `simd_half::BF16x16`. + +Compile-time gate where runtime dispatch would lift the entire AVX-512 + BF16 install base out of the scalar polyfill. + +--- + +## Findings — MEDIUM + +### TD-T16 · `src/hpc/nibble.rs:23-41, 227-237` · nibble ops cap at AVX2 + +`nibble_unpack` (line 23) and `nibble_above_threshold` (line 227) check `caps.avx2` only — no AVX-512 path. Sapphire Rapids / Ice Lake / Zen 4 process 32 nibbles per AVX2 iteration when 64 per AVX-512BW iteration would be possible. + +### TD-T17 · `src/hpc/nibble.rs:59-94, 169-189, 257-278` · "AVX2" funcs are scalar loops + +`nibble_unpack_avx2` (line 59), `nibble_sub_clamp_avx2` (line 170), `nibble_above_threshold_avx2` (line 258) all carry `#[target_feature(enable = "avx2")]` but their bodies are plain scalar loops: + +```rust +#[target_feature(enable = "avx2")] +pub(crate) unsafe fn nibble_unpack_avx2(packed: &[u8], count: usize, out: &mut Vec) { + // ... + for j in 0..16 { + lo[j] = data[j] & 0x0F; // ← scalar loop + hi[j] = (data[j] >> 4) & 0x0F; + } + // ... +} +``` + +The autovectorizer may emit reasonable code, but this is not true `_mm256_*` intrinsics. `nibble_sub_clamp_avx512` at line 197 IS real (uses `U8x64::saturating_sub`). So nibble has one real SIMD path and two pretend-SIMD paths. + +### TD-T18 · `src/simd.rs:479-486` · simd_ln_f32 is a scalar loop + +```rust +pub fn simd_ln_f32(x: F32x16) -> F32x16 { + let arr = x.to_array(); + let mut out = [0.0f32; 16]; + for i in 0..16 { + out[i] = arr[i].ln(); // ← scalar per-lane + } + F32x16::from_array(out) +} +``` + +`simd_exp_f32` at lines 419-450 is a real Remez polynomial with FMA via `mul_add` chain. `simd_ln_f32` is its asymmetric scalar twin. A consumer thinking they're getting 16-wide log gets 16× scalar `ln`. + +### TD-T19 · `src/hpc/distance.rs:101` · single tier, no AVX-512 + +The 3D `squared_distances` function checks `caps.avx2` only — line 101: `if super::simd_caps::simd_caps().avx2`. No AVX-512F variant. Sapphire Rapids etc. fall to AVX2 8-wide instead of AVX-512 16-wide. + +### TD-T20 · `src/hpc/spatial_hash.rs:273` · same as TD-T19 + +`batch_sq_dist` checks `caps.avx2` only. No AVX-512F variant. + +### TD-T21 · `src/simd.rs:351-354` · aarch64 integers come from scalar + +```rust +#[cfg(all(target_arch = "aarch64", not(feature = "nightly-simd")))] +pub use scalar::{ + f32x8, f64x4, i32x16, i32x8, i64x4, i64x8, u16x16, u32x16, u32x8, u64x4, u64x8, u8x64, + F32x8, F64x4, I32x16, I32x8, I64x4, I64x8, U16x16, U16x32, U32x16, U32x8, U64x4, U64x8, U8x64, +}; +``` + +On aarch64, the only types from `simd_neon::aarch64_simd` (line 349) are `f32x16, f64x8, F32Mask16, F32x16, F64Mask8, F64x8`. Every integer width — `I32x16`, `I8x32`, `U8x64`, `U16x32`, etc. — comes from `scalar::*`. Pi 5 / M2 get scalar integer SIMD even though NEON has `int32x4_t`, `uint8x16_t`, etc. + +### TD-T22 · `src/simd.rs:310, 318-321` · 256-bit int types in AVX2 build come from `simd_avx512` + +```rust +// 310: AVX2-baseline arm uses simd_avx512 for the 256-bit shapes +pub use crate::simd_avx512::{f32x8, f64x4, i16x16, i8x32, F32x8, F64x4, I16x16, I8x32}; +``` + +Inverted naming: `I32x8` / `U32x8` / `I64x4` / `U64x4` (the natural AVX2 widths) come from `simd_avx2.rs` (which polyfills them as scalar storage with `[u32; 8]` arrays per the comment in the AMX matmul work), not from native `__m256i`. The polyfill IS the AVX2 module on AVX2 builds — verify whether the AVX2 module's polyfills wrap real `_mm256_*` intrinsics or scalar arrays. (Audit pending — requires reading `src/simd_avx2.rs`.) + +--- + +## Verified — code is correct (rejected agent claims) + +### `src/simd_amx.rs:282-301` · AVX-VNNI-INT8 dispatch IS done + +`matvec_dispatch` correctly routes `is_x86_feature_detected!("avxvnniint8")` to `vnni2_matvec` (256-bit VPDPBUSD path) when avx512vnni is absent. No debt. + +### `src/simd_avx512.rs:695-710` · VBMI dispatch IS done + +`permute_bytes` checks `if simd_caps().avx512vbmi { permute_bytes_vbmi(...) } else { scalar }`. Native `_mm512_permutexvar_epi8` reaches Ice Lake / SPR / Zen 4. The scalar branch is correct fallback for Skylake-X / Cascade Lake / Cooper Lake. No debt. + +### `src/hpc/bgz17_bridge.rs:43-86` · multi-versioning is correct + +5 dispatch sites at lines 78, 142, 197, 250, 349 each route `avx512f → avx2 → scalar` with proper `#[target_feature]` annotations on the inner functions. The `is_x86_feature_detected!("avx512f")` vs `avx2` granularity is appropriate for L1 absolute-difference kernels — VNNI / VBMI / BF16 don't help on `abs(a-b)` reductions. No tier-collapse debt here. + +### `src/hpc/vnni_gemm.rs:46-61` · VNNI dispatch is correct + +`int8_gemm_vnni` checks `simd_caps().has_avx512_vnni()` and calls `int8_gemm_vnni_avx512`. The only debt is that other paths (TD-T3, TD-T5) don't route through this function. + +### `src/hpc/p64_bridge.rs:109` · VPOPCNTDQ dispatch is correct + +`simd_caps().avx512vpopcntdq` runtime-detected. No debt at this site. + +### `src/hpc/cam_pq.rs:202, 215` · AVX-512F dispatch is correct + +`simd_caps().avx512f` runtime-detected. No debt at this site. + +### `src/hpc/aabb.rs:284, 440` · AVX-512F + SSE2 dispatch present + +Dispatches `avx512f` then falls to `sse2`. Missing intermediate AVX2 — `aabb` uses AVX-512F at one tier and SSE2 at the other. Probably acceptable for AABB (BV-shape ops), but a sub-finding to investigate on next sweep. + +--- + +## Prioritized action list + +| ID | Severity | Effort | Description | +|---|---|---|---| +| TD-T1 | CRIT | 1h | Wire `matmul_bf16_to_f32` to `bf16_tile_gemm_16x16` | +| TD-T2 | CRIT | 30m (after T1) | Same for `matmul_f32` | +| TD-T3 | CRIT | 1.5h | Wire `matmul_i8_to_i32` to AMX tile / VNNI fallback | +| TD-T5 | CRIT | 30m | Route `int8_gemm_i32` callers through `int8_gemm_vnni` | +| TD-T4 | CRIT | 3-4h | Rewrite `bf16_gemm_f32` with F32x16 mul_add + tiling | +| TD-T6 | CRIT | 2h | Implement `avx2::{scal,nrm2,asum}_*` with real AVX2 intrinsics | +| TD-T7 | CRIT | 2h | Implement `gemv_f32`/`gemv_f64` with tier dispatch | +| TD-T8 | HIGH | 4-6h | Wire `simd_dispatch.rs` aarch64 tier to real NEON impls | +| TD-T9 | HIGH | 2-3h | Add AVX-512 variants for `squared_distances`, `nibble_*`, `batch_sq_dist` | +| TD-T10 | HIGH | 3-4h | Implement `BF16x8/16` NEON via asm-byte BFMMLA/BFDOT | +| TD-T11 | HIGH | 3-4h | Implement `F16x16` NEON via asm-byte fmla v.8h | +| TD-T15 | HIGH | 4-6h | Convert `BF16x16` from compile-time `target_feature` gate to runtime dispatch | +| TD-T16 | MED | 1.5h | Add AVX-512BW variants for `nibble_unpack` / `nibble_above_threshold` | +| TD-T17 | MED | 2h | Replace scalar-loop "avx2" funcs in nibble with `_mm256_*` intrinsics | +| TD-T18 | MED | 2h | Rewrite `simd_ln_f32` as real Remez polynomial like `simd_exp_f32` | +| TD-T19 | MED | 1h | Add AVX-512F path to `distance::squared_distances_f32` | +| TD-T20 | MED | 1h | Same for `spatial_hash::batch_sq_dist` | +| TD-T21 | HIGH | 8-12h | Replace aarch64 scalar integer types in `simd.rs` with NEON impls | +| TD-T22 | – | – | Investigation only — needs `simd_avx2.rs` read first | +| TD-T12/T13/T14 | HIGH | (audit-wide) | Consolidate three Tier enums OR route all callers through `simd_caps()` for sub-feature dispatch | + +## Next-sweep targets (unread) + +These files are listed in the dispatch site grep but not yet read for this audit. Findings in them are unverified: + +- Full `src/simd_avx512.rs`, `simd_avx2.rs`, `simd_neon.rs`, `simd_scalar.rs`, `simd_half.rs` +- HPC SIMD-consuming: `vml.rs`, `activations.rs`, `reductions.rs`, `kernels.rs`, `fft.rs` +- HPC suspected scalar: `statistics.rs`, `lapack.rs`, `blas_level{1,2,3}.rs` +- HPC dispatch sites with `is_x86_feature_detected!`: `cam_pq.rs`, `palette_distance.rs`, `aabb.rs`, `distance.rs`, `bitwise.rs`, `p64_bridge.rs`, `spatial_hash.rs`, `jitson_cranelift/detect.rs` +- All 34 `src/hpc/styles/*` primitives + +The most likely-debt-rich unread targets: + +1. `src/hpc/blas_level{1,2,3}.rs` — grep showed NO use of `crate::simd::*` types. The flagship BLAS public API may be entirely scalar (separate audit needed). +2. `src/hpc/statistics.rs`, `lapack.rs` — same, no `crate::simd::*` use. +3. `src/simd_avx2.rs` — the 256-bit polyfills for 512-bit types. TD-T22 needs this read to know whether the polyfills are real `__m256i` intrinsics or scalar arrays under `#[target_feature]`. +