simd_runtime: CpuOps DTO (third dispatch pattern) + GCC-scraped CPU table#187
Conversation
… CPU table
Two additions, scoped together because they're the same idea — using
scraped CPU metadata to drive runtime dispatch:
# Piece A: matrix doc § M (GCC-grounded aarch64 enumeration)
The matrix had three aarch64 columns (A53 / A72 / A76) covering
*dispatch tiers* (multiple physical cores share each tier's SIMD
primitive set). The authoritative per-core feature membership lives
in GCC's `gcc/config/aarch64/aarch64-cores.def` — scraped 2026-05-21
and recorded as a new § M table covering 28 cores:
* V8.0-A baseline (A53, A72)
* V8.2-A dotprod+fp16 (A76, A78, X1, Neoverse-N1, Apple M1)
* V8.5-A baseline (Apple M1 specifically — V8.5 includes V8.2's
fp16+dotprod but NOT bf16+i8mm; corrects a wrong "+bf16" claim
on the existing A76 row of the column legend)
* V8.6-A baseline incl. bf16+i8mm (Apple M2/M3, Oryon-1 / Snapdragon
X Elite, Ampere1+, Cortex-A510/A710/A715, X2/X3, Neoverse-N2/V2)
* V8.7-A (Apple M4, Ampere1B)
* V9.0-A SVE2 baseline + explicit bf16+i8mm flags (Cortex-A510-A715,
X2/X3, Neoverse-N2/V2, Grace)
* V8.4-A SVE tier (Neoverse-V1 / Graviton 3 — only V8.4 core with
explicit SVE+bf16+i8mm)
* V9.2-A (Cortex-A520/A720/A725, X4, X925, Neoverse-N3/V3)
Each entry verbatim from the GCC FEATURE_STRING column. Cross-
referencing with the V8.X-A baseline rules (V8.6+ includes bf16+i8mm
implicitly; V9.0 includes SVE2 implicitly) gives the canonical
"which silicon has what" table. The note flags that a new dispatch
column for the V8.6+/V9-bf16-i8mm tier needs to land alongside the
NEON BFMMLA / BFDOT asm-byte arm in Phase 3b.
The A76 column legend (line 26 of the matrix) was corrected: removed
the wrong "+bf16" (A76 itself is V8.2-A, NO bf16 — bf16 came in
V8.6-A).
# Piece B: CpuOps DTO — third dispatch pattern
Adds `src/simd_runtime/cpu_ops.rs` exposing a per-CPU operations DTO
distinct from the existing patterns:
Pattern 1 (`crate::simd::*`): compile-time `#[cfg(target_feature)]`
cascade. Direct monomorphized calls.
Pattern 2 (`crate::simd_runtime::vnni_dot_u8_i8` etc., from #185):
per-op LazyLock<fn ptr>. One CPUID +
atomic-load per op the first time
called.
Pattern 3 (THIS COMMIT): per-CPU `&'static CpuOps` selected
once at first access. Every op is a
fn-ptr field on the struct.
Why the third pattern?
* Per-op LazyLock: N ops touched = N atomic-load setup costs over
the process lifetime.
* CpuOps DTO: ONE atomic-load total at first `cpu_ops()` call;
every subsequent op is a direct fn-ptr deref through the cached
`&'static CpuOps`. The OpenBLAS / MKL dispatch model — wins for
dense-op consumers (linear-algebra pipelines touching every
BLAS-1/2/3 kernel).
* All three coexist. Consumers pick by import path.
Six tiers baked as static const `CpuOps` instances:
x86_64: amx_int8, avx512vnni, avx512f, avxvnni, avx2_fma
aarch64: neon
universal: scalar
Each instance points at the existing trampolines in
`crate::simd_runtime::{vnni_dot, add_mul}` — no kernel duplication;
this module is pure dispatch glue. Backend ops referenced:
vnni_dot_u8_i8 (3 backends: avx512+tail / avxvnni / scalar)
add_mul_f32 (4 backends: avx512 / avx2+fma / neon / scalar)
add_mul_f64 (4 backends: avx512 / avx2+fma / neon / scalar)
# The naughty data-driven part
`cpu_ops_for_cpu(name: &str) -> Option<&'static CpuOps>` maps GCC
CPU codenames to the dispatch tier they land in, sourced from § M's
scrape. Spot-checks (each verified by the test suite):
sapphirerapids / graniterapids / emeraldrapids → amx_int8
cascadelake / cooperlake / icelake-* / tigerlake / rocketlake
/ znver4 / znver5 → avx512vnni
alderlake / raptorlake / meteorlake / arrowlake / arrowlake-s
/ lunarlake / pantherlake / sierraforest → avxvnni
haswell / broadwell / skylake / znver1-3 → avx2_fma
apple-m1..m4 / oryon-1 / cortex-a76..a725
/ cortex-x1..x925 / neoverse-n1..v3 / grace
/ ampere1..1b → neon
Returns `None` for unknown CPUs — caller can fall back to
`cpu_ops_for_tier("scalar")` if a "best-effort" answer is needed.
Use cases for `cpu_ops_for_cpu`:
* "What would $CPU pick?" introspection without running on $CPU.
* Cross-compilation reports + deployment-planning tools.
* Integration tests asserting tier selection for named targets.
* Explicit-tier-pinning ("force AVX2 even though AMX is available,
to measure overhead").
Future: code-gen the table from a `build.rs` that fetches GCC's
latest core list. Today the table is hand-rolled from the scrape
recorded in matrix doc § M.
# Verification
* `cargo test --lib --features runtime-dispatch`: 2147 tests pass
(was 2105 — +5 new cpu_ops tests + 37 carried over from prior
feature-gated tests now compiled-in too).
* 5 new cpu_ops tests:
cpu_ops_resolves_on_this_host
cpu_ops_stable_across_calls (LazyLock fires once)
cpu_ops_for_tier_known_names
cpu_ops_for_cpu_data_driven_lookup (spot-checks the GCC scrape)
cpu_ops_call_through_dto (full indirect-call exercise)
* cargo clippy --lib --tests --features rayon,native,runtime-dispatch
-- -D warnings clean.
* cargo fmt --all --check clean.
* Default build (no feature) unchanged: zero impact on existing
paths — the entire `simd_runtime` module is gated out.
# Backward-compat for the existing per-op LazyLock surface
The pub(super) wrappers in `vnni_dot.rs` and `add_mul.rs`
(`*_safe` / `*_safe_wrapper` / `*_scalar_wrapper`) are new but
purely additive — every existing public function in `simd_runtime`
keeps its prior signature and dispatch behavior.
https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d50caaf578
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| pub fn cpu_ops_for_cpu(name: &str) -> Option<&'static CpuOps> { | ||
| cpu_ops_for_tier(cpu_to_tier(name)?) |
There was a problem hiding this comment.
Preserve cpu-name lookup across target architectures
cpu_ops_for_cpu currently resolves a CPU name via cpu_to_tier and then immediately calls cpu_ops_for_tier, but cpu_ops_for_tier is #[cfg(target_arch)]-gated. On an x86_64 build, ARM tiers like "neon" are compiled out, so known names such as "apple-m2" map to None even though cpu_to_tier recognizes them. This breaks the documented “what would this CPU pick?” cross-target introspection use case and makes lookup results depend on the build host architecture rather than the CPU name input.
Useful? React with 👍 / 👎.
…on (codex P2) Codex flagged on PR #187 that `cpu_ops_for_cpu` is cfg-gated through `cpu_ops_for_tier`, so cross-arch lookups silently return None — e.g. `cpu_ops_for_cpu("apple-m2")` on an x86_64 build maps "apple-m2" → "neon" via `cpu_to_tier`, but then `cpu_ops_for_tier("neon")` is compiled out because `CPU_OPS_NEON` is `cfg(target_arch = "aarch64")`. This broke the documented "what would this CPU pick?" introspection use case, which is supposed to work for deployment-planning tools and cross-target reports regardless of the build host. Fix: promote the previously-private `cpu_to_tier` to `pub fn cpu_tier_for_cpu`. It returns `Option<&'static str>` and is cfg-free, so `cpu_tier_for_cpu("apple-m2")` reliably returns `Some("neon")` on every build target. `cpu_ops_for_cpu` keeps its current semantics (current-arch only) but the docstring now explicitly says so and points cross-arch callers at `cpu_tier_for_cpu`. Returning a phantom CpuOps with scalar fn ptrs for cross-arch lookups would lie about behavior — better to return None and force callers to use the honest tier-name surface. Added regression test `cpu_tier_for_cpu_is_cross_arch` that asserts the cross-arch CPU names resolve on every build host.
…on (codex P2) Codex flagged on PR #187 that `cpu_ops_for_cpu` is cfg-gated through `cpu_ops_for_tier`, so cross-arch lookups silently return None — e.g. `cpu_ops_for_cpu("apple-m2")` on an x86_64 build maps "apple-m2" → "neon" via `cpu_to_tier`, but then `cpu_ops_for_tier("neon")` is compiled out because `CPU_OPS_NEON` is `cfg(target_arch = "aarch64")`. This broke the documented "what would this CPU pick?" introspection use case, which is supposed to work for deployment-planning tools and cross-target reports regardless of the build host. Fix: promote the previously-private `cpu_to_tier` to `pub fn cpu_tier_for_cpu`. It returns `Option<&'static str>` and is cfg-free, so `cpu_tier_for_cpu("apple-m2")` reliably returns `Some("neon")` on every build target. `cpu_ops_for_cpu` keeps its current semantics (current-arch only) but the docstring now explicitly says so and points cross-arch callers at `cpu_tier_for_cpu`. Returning a phantom CpuOps with scalar fn ptrs for cross-arch lookups would lie about behavior — better to return None and force callers to use the honest tier-name surface. Added regression test `cpu_tier_for_cpu_is_cross_arch` that asserts the cross-arch CPU names resolve on every build host.
…on (codex P2) Codex flagged on PR #187 that `cpu_ops_for_cpu` is cfg-gated through `cpu_ops_for_tier`, so cross-arch lookups silently return None — e.g. `cpu_ops_for_cpu("apple-m2")` on an x86_64 build maps "apple-m2" → "neon" via `cpu_to_tier`, but then `cpu_ops_for_tier("neon")` is compiled out because `CPU_OPS_NEON` is `cfg(target_arch = "aarch64")`. This broke the documented "what would this CPU pick?" introspection use case, which is supposed to work for deployment-planning tools and cross-target reports regardless of the build host. Fix: promote the previously-private `cpu_to_tier` to `pub fn cpu_tier_for_cpu`. It returns `Option<&'static str>` and is cfg-free, so `cpu_tier_for_cpu("apple-m2")` reliably returns `Some("neon")` on every build target. `cpu_ops_for_cpu` keeps its current semantics (current-arch only) but the docstring now explicitly says so and points cross-arch callers at `cpu_tier_for_cpu`. Returning a phantom CpuOps with scalar fn ptrs for cross-arch lookups would lie about behavior — better to return None and force callers to use the honest tier-name surface. Added regression test `cpu_tier_for_cpu_is_cross_arch` that asserts the cross-arch CPU names resolve on every build host.
Summary
Two scoped-together additions, both driven by the same insight — using scraped CPU metadata to drive runtime dispatch.
Piece A — matrix doc § M (GCC-grounded aarch64 enumeration)
The matrix doc had three aarch64 columns (A53 / A72 / A76) covering dispatch tiers but no per-core authoritative source. § M now records the canonical core list scraped from GCC's
gcc/config/aarch64/aarch64-cores.def— 28 cores spanning V8.0-A through V9.2-A, each with its verbatim FEATURE_STRING.Bug fix as a side effect: the A76 column legend claimed "+bf16" — but A76 is V8.2-A; BF16 came in V8.6-A. Removed the wrong claim; Apple M1 listed under A76 tier (V8.5-A baseline includes V8.2's dotprod+fp16 but NOT bf16/i8mm).
New tier groupings that should become matrix columns in Phase 3b (when NEON BFMMLA / BFDOT / FMLA.8h asm-byte arms land):
Piece B —
CpuOpsDTO (the third dispatch pattern)Adds
src/simd_runtime/cpu_ops.rsexposing a third dispatch pattern that coexists with the existing two:crate::simd::*compile-time#[cfg(target_feature)]cascadecrate::simd_runtime::vnni_dot_u8_i8etc. — per-opLazyLock<fn ptr>cpu_ops()→&'static CpuOpsDTO (THIS PR)The OpenBLAS / MKL dispatch model. All three coexist; consumers pick by import path:
Six static
CpuOpsinstances baked at compile time, one per tier (amx_int8,avx512vnni,avx512f,avxvnni,avx2_fma,neon,scalar). Each references the existing trampolines invnni_dot.rs/add_mul.rs— no kernel duplication; this module is pure dispatch glue.The naughty data-driven part
cpu_ops_for_cpu(name: &str) -> Option<&'static CpuOps>— maps GCC CPU codenames to the dispatch tier they land in. Source: § M's GCC scrape. Spot-checks all verified by the test suite:amx_int8avx512vnniavxvnniavx2_fmaneonReturns
Nonefor unknown CPUs.Use cases: "what would this CPU pick?" introspection without running on it; cross-compilation reports; deployment-planning tools; integration tests asserting tier selection for named targets; explicit-tier-pinning ("force AVX2 to measure overhead").
Future: code-gen the table from a
build.rsthat fetches GCC's latest core list. Today it's hand-rolled from the scrape.Test plan
cargo test --lib --features runtime-dispatch: 2147 tests pass (was 2105, +5 new cpu_ops tests).cpu_opstests:cpu_ops_resolves_on_this_hostcpu_ops_stable_across_calls(LazyLock fires once)cpu_ops_for_tier_known_namescpu_ops_for_cpu_data_driven_lookup(spot-checks the GCC scrape)cpu_ops_call_through_dto(full indirect-call exercise via(ops.vnni_dot_u8_i8)(a, b)and(ops.add_mul_f32)(...))cargo clippy --lib --tests --features rayon,native,runtime-dispatch -- -D warningsclean.cargo fmt --all --checkclean.Backward-compat
The
pub(super)wrappers invnni_dot.rsandadd_mul.rs(*_safe/*_safe_wrapper/*_scalar_wrapper) are new but purely additive — every existing public function insimd_runtimekeeps its prior signature and dispatch behavior.Out of scope (separate PRs)
build.rsautomation of the GCC scrape (today the data is hand-rolled).CpuOpsto cover the matmul / cast surfaces (today onlyvnni_dot_u8_i8,add_mul_f32,add_mul_f64— the only surfaces with explicit per-tier kernels insimd_runtime/; the matmul/cast trampolines delegate to functions that runtime-dispatch internally, so they're apub fnper-op call, not aCpuOpsfield).neon_bf16/neon_dotprod) — lands with the Phase 3b asm-byte arms.https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
Generated by Claude Code