Skip to content

simd_runtime: CpuOps DTO (third dispatch pattern) + GCC-scraped CPU table#187

Merged
AdaWorldAPI merged 1 commit into
masterfrom
claude/continue-ndarray-x0Oaw
May 21, 2026
Merged

simd_runtime: CpuOps DTO (third dispatch pattern) + GCC-scraped CPU table#187
AdaWorldAPI merged 1 commit into
masterfrom
claude/continue-ndarray-x0Oaw

Conversation

@AdaWorldAPI
Copy link
Copy Markdown
Owner

Summary

Two scoped-together additions, both driven by the same insight — using scraped CPU metadata to drive runtime dispatch.

Piece A — matrix doc § M (GCC-grounded aarch64 enumeration)

The matrix doc had three aarch64 columns (A53 / A72 / A76) covering dispatch tiers but no per-core authoritative source. § M now records the canonical core list scraped from GCC's gcc/config/aarch64/aarch64-cores.def — 28 cores spanning V8.0-A through V9.2-A, each with its verbatim FEATURE_STRING.

Bug fix as a side effect: the A76 column legend claimed "+bf16" — but A76 is V8.2-A; BF16 came in V8.6-A. Removed the wrong claim; Apple M1 listed under A76 tier (V8.5-A baseline includes V8.2's dotprod+fp16 but NOT bf16/i8mm).

New tier groupings that should become matrix columns in Phase 3b (when NEON BFMMLA / BFDOT / FMLA.8h asm-byte arms land):

  • V8.6+/V9 with bf16+i8mm: Apple M2+, Oryon-1 (Snapdragon X), Cortex-A510+, Neoverse-N2/V2, Grace, Ampere1+
  • V8.4-A SVE outlier: Neoverse-V1 (Graviton 3)

Piece B — CpuOps DTO (the third dispatch pattern)

Adds src/simd_runtime/cpu_ops.rs exposing a third dispatch pattern that coexists with the existing two:

Pattern Cost model Wins when
1. crate::simd::* compile-time #[cfg(target_feature)] cascade Direct monomorphized call, no runtime branch Bench / fixed-target builds
2. crate::simd_runtime::vnni_dot_u8_i8 etc. — per-op LazyLock<fn ptr> One CPUID + atomic-load per op, first call Sparse-op consumers
3. cpu_ops()&'static CpuOps DTO (THIS PR) ONE CPUID at startup; every op is fn-ptr field Dense-op consumers (linear-algebra pipelines)

The OpenBLAS / MKL dispatch model. All three coexist; consumers pick by import path:

// Pattern 1
crate::simd_ops::add_mul_f32(acc, a, b);

// Pattern 2 (per-op LazyLock — PR #185)
crate::simd_runtime::add_mul_f32(acc, a, b);

// Pattern 3 (this PR — one LazyLock total)
let ops = crate::simd_runtime::cpu_ops();
unsafe { (ops.add_mul_f32)(acc, a, b); }

Six static CpuOps instances baked at compile time, one per tier (amx_int8, avx512vnni, avx512f, avxvnni, avx2_fma, neon, scalar). Each references the existing trampolines in vnni_dot.rs / add_mul.rsno kernel duplication; this module is pure dispatch glue.

The naughty data-driven part

cpu_ops_for_cpu(name: &str) -> Option<&'static CpuOps> — maps GCC CPU codenames to the dispatch tier they land in. Source: § M's GCC scrape. Spot-checks all verified by the test suite:

Pattern CPUs Tier
AMX-INT8 hosts sapphirerapids, graniterapids, emeraldrapids amx_int8
AVX-512 + VNNI cascadelake, cooperlake, icelake-*, tigerlake, rocketlake, znver4, znver5 avx512vnni
AVX-VNNI no AVX-512 alderlake, raptorlake, meteorlake, arrowlake, arrowlake-s, lunarlake, pantherlake, sierraforest avxvnni
Plain AVX2+FMA haswell, broadwell, skylake, znver1-3 avx2_fma
All aarch64 apple-m1..m4, oryon-1, cortex-a76..a725, cortex-x1..x925, neoverse-n1..v3, grace, ampere1..1b neon

Returns None for unknown CPUs.

Use cases: "what would this CPU pick?" introspection without running on it; cross-compilation reports; deployment-planning tools; integration tests asserting tier selection for named targets; explicit-tier-pinning ("force AVX2 to measure overhead").

Future: code-gen the table from a build.rs that fetches GCC's latest core list. Today it's hand-rolled from the scrape.

Test plan

  • cargo test --lib --features runtime-dispatch: 2147 tests pass (was 2105, +5 new cpu_ops tests).
  • 5 new cpu_ops tests:
    • cpu_ops_resolves_on_this_host
    • cpu_ops_stable_across_calls (LazyLock fires once)
    • cpu_ops_for_tier_known_names
    • cpu_ops_for_cpu_data_driven_lookup (spot-checks the GCC scrape)
    • cpu_ops_call_through_dto (full indirect-call exercise via (ops.vnni_dot_u8_i8)(a, b) and (ops.add_mul_f32)(...))
  • cargo clippy --lib --tests --features rayon,native,runtime-dispatch -- -D warnings clean.
  • cargo fmt --all --check clean.
  • Default build (no feature) unchanged — zero impact on existing paths.

Backward-compat

The pub(super) wrappers in vnni_dot.rs and add_mul.rs (*_safe / *_safe_wrapper / *_scalar_wrapper) are new but purely additive — every existing public function in simd_runtime keeps its prior signature and dispatch behavior.

Out of scope (separate PRs)

  • build.rs automation of the GCC scrape (today the data is hand-rolled).
  • Extending CpuOps to cover the matmul / cast surfaces (today only vnni_dot_u8_i8, add_mul_f32, add_mul_f64 — the only surfaces with explicit per-tier kernels in simd_runtime/; the matmul/cast trampolines delegate to functions that runtime-dispatch internally, so they're a pub fn per-op call, not a CpuOps field).
  • NEON tier expansion (neon_bf16 / neon_dotprod) — lands with the Phase 3b asm-byte arms.

https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u


Generated by Claude Code

… CPU table

Two additions, scoped together because they're the same idea — using
scraped CPU metadata to drive runtime dispatch:

# Piece A: matrix doc § M (GCC-grounded aarch64 enumeration)

The matrix had three aarch64 columns (A53 / A72 / A76) covering
*dispatch tiers* (multiple physical cores share each tier's SIMD
primitive set). The authoritative per-core feature membership lives
in GCC's `gcc/config/aarch64/aarch64-cores.def` — scraped 2026-05-21
and recorded as a new § M table covering 28 cores:

  * V8.0-A baseline (A53, A72)
  * V8.2-A dotprod+fp16 (A76, A78, X1, Neoverse-N1, Apple M1)
  * V8.5-A baseline (Apple M1 specifically — V8.5 includes V8.2's
    fp16+dotprod but NOT bf16+i8mm; corrects a wrong "+bf16" claim
    on the existing A76 row of the column legend)
  * V8.6-A baseline incl. bf16+i8mm (Apple M2/M3, Oryon-1 / Snapdragon
    X Elite, Ampere1+, Cortex-A510/A710/A715, X2/X3, Neoverse-N2/V2)
  * V8.7-A (Apple M4, Ampere1B)
  * V9.0-A SVE2 baseline + explicit bf16+i8mm flags (Cortex-A510-A715,
    X2/X3, Neoverse-N2/V2, Grace)
  * V8.4-A SVE tier (Neoverse-V1 / Graviton 3 — only V8.4 core with
    explicit SVE+bf16+i8mm)
  * V9.2-A (Cortex-A520/A720/A725, X4, X925, Neoverse-N3/V3)

Each entry verbatim from the GCC FEATURE_STRING column. Cross-
referencing with the V8.X-A baseline rules (V8.6+ includes bf16+i8mm
implicitly; V9.0 includes SVE2 implicitly) gives the canonical
"which silicon has what" table. The note flags that a new dispatch
column for the V8.6+/V9-bf16-i8mm tier needs to land alongside the
NEON BFMMLA / BFDOT asm-byte arm in Phase 3b.

The A76 column legend (line 26 of the matrix) was corrected: removed
the wrong "+bf16" (A76 itself is V8.2-A, NO bf16 — bf16 came in
V8.6-A).

# Piece B: CpuOps DTO — third dispatch pattern

Adds `src/simd_runtime/cpu_ops.rs` exposing a per-CPU operations DTO
distinct from the existing patterns:

  Pattern 1 (`crate::simd::*`):  compile-time `#[cfg(target_feature)]`
                                 cascade. Direct monomorphized calls.
  Pattern 2 (`crate::simd_runtime::vnni_dot_u8_i8` etc., from #185):
                                 per-op LazyLock<fn ptr>. One CPUID +
                                 atomic-load per op the first time
                                 called.
  Pattern 3 (THIS COMMIT):       per-CPU `&'static CpuOps` selected
                                 once at first access. Every op is a
                                 fn-ptr field on the struct.

Why the third pattern?
  * Per-op LazyLock: N ops touched = N atomic-load setup costs over
    the process lifetime.
  * CpuOps DTO: ONE atomic-load total at first `cpu_ops()` call;
    every subsequent op is a direct fn-ptr deref through the cached
    `&'static CpuOps`. The OpenBLAS / MKL dispatch model — wins for
    dense-op consumers (linear-algebra pipelines touching every
    BLAS-1/2/3 kernel).
  * All three coexist. Consumers pick by import path.

Six tiers baked as static const `CpuOps` instances:
  x86_64:  amx_int8, avx512vnni, avx512f, avxvnni, avx2_fma
  aarch64: neon
  universal: scalar

Each instance points at the existing trampolines in
`crate::simd_runtime::{vnni_dot, add_mul}` — no kernel duplication;
this module is pure dispatch glue. Backend ops referenced:
  vnni_dot_u8_i8  (3 backends: avx512+tail / avxvnni / scalar)
  add_mul_f32     (4 backends: avx512 / avx2+fma / neon / scalar)
  add_mul_f64     (4 backends: avx512 / avx2+fma / neon / scalar)

# The naughty data-driven part

`cpu_ops_for_cpu(name: &str) -> Option<&'static CpuOps>` maps GCC
CPU codenames to the dispatch tier they land in, sourced from § M's
scrape. Spot-checks (each verified by the test suite):

  sapphirerapids / graniterapids / emeraldrapids → amx_int8
  cascadelake / cooperlake / icelake-* / tigerlake / rocketlake
    / znver4 / znver5                            → avx512vnni
  alderlake / raptorlake / meteorlake / arrowlake / arrowlake-s
    / lunarlake / pantherlake / sierraforest     → avxvnni
  haswell / broadwell / skylake / znver1-3       → avx2_fma
  apple-m1..m4 / oryon-1 / cortex-a76..a725
    / cortex-x1..x925 / neoverse-n1..v3 / grace
    / ampere1..1b                                → neon

Returns `None` for unknown CPUs — caller can fall back to
`cpu_ops_for_tier("scalar")` if a "best-effort" answer is needed.

Use cases for `cpu_ops_for_cpu`:
  * "What would $CPU pick?" introspection without running on $CPU.
  * Cross-compilation reports + deployment-planning tools.
  * Integration tests asserting tier selection for named targets.
  * Explicit-tier-pinning ("force AVX2 even though AMX is available,
    to measure overhead").

Future: code-gen the table from a `build.rs` that fetches GCC's
latest core list. Today the table is hand-rolled from the scrape
recorded in matrix doc § M.

# Verification

  * `cargo test --lib --features runtime-dispatch`: 2147 tests pass
    (was 2105 — +5 new cpu_ops tests + 37 carried over from prior
    feature-gated tests now compiled-in too).
  * 5 new cpu_ops tests:
      cpu_ops_resolves_on_this_host
      cpu_ops_stable_across_calls (LazyLock fires once)
      cpu_ops_for_tier_known_names
      cpu_ops_for_cpu_data_driven_lookup (spot-checks the GCC scrape)
      cpu_ops_call_through_dto (full indirect-call exercise)
  * cargo clippy --lib --tests --features rayon,native,runtime-dispatch
    -- -D warnings clean.
  * cargo fmt --all --check clean.
  * Default build (no feature) unchanged: zero impact on existing
    paths — the entire `simd_runtime` module is gated out.

# Backward-compat for the existing per-op LazyLock surface

The pub(super) wrappers in `vnni_dot.rs` and `add_mul.rs`
(`*_safe` / `*_safe_wrapper` / `*_scalar_wrapper`) are new but
purely additive — every existing public function in `simd_runtime`
keeps its prior signature and dispatch behavior.

https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d50caaf578

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +261 to +262
pub fn cpu_ops_for_cpu(name: &str) -> Option<&'static CpuOps> {
cpu_ops_for_tier(cpu_to_tier(name)?)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve cpu-name lookup across target architectures

cpu_ops_for_cpu currently resolves a CPU name via cpu_to_tier and then immediately calls cpu_ops_for_tier, but cpu_ops_for_tier is #[cfg(target_arch)]-gated. On an x86_64 build, ARM tiers like "neon" are compiled out, so known names such as "apple-m2" map to None even though cpu_to_tier recognizes them. This breaks the documented “what would this CPU pick?” cross-target introspection use case and makes lookup results depend on the build host architecture rather than the CPU name input.

Useful? React with 👍 / 👎.

@AdaWorldAPI AdaWorldAPI merged commit bc307ec into master May 21, 2026
17 checks passed
AdaWorldAPI pushed a commit that referenced this pull request May 21, 2026
…on (codex P2)

Codex flagged on PR #187 that `cpu_ops_for_cpu` is cfg-gated through
`cpu_ops_for_tier`, so cross-arch lookups silently return None — e.g.
`cpu_ops_for_cpu("apple-m2")` on an x86_64 build maps "apple-m2" → "neon"
via `cpu_to_tier`, but then `cpu_ops_for_tier("neon")` is compiled out
because `CPU_OPS_NEON` is `cfg(target_arch = "aarch64")`.

This broke the documented "what would this CPU pick?" introspection use
case, which is supposed to work for deployment-planning tools and
cross-target reports regardless of the build host.

Fix: promote the previously-private `cpu_to_tier` to `pub fn
cpu_tier_for_cpu`. It returns `Option<&'static str>` and is cfg-free,
so `cpu_tier_for_cpu("apple-m2")` reliably returns `Some("neon")` on
every build target.

`cpu_ops_for_cpu` keeps its current semantics (current-arch only) but
the docstring now explicitly says so and points cross-arch callers
at `cpu_tier_for_cpu`. Returning a phantom CpuOps with scalar fn ptrs
for cross-arch lookups would lie about behavior — better to return
None and force callers to use the honest tier-name surface.

Added regression test `cpu_tier_for_cpu_is_cross_arch` that asserts
the cross-arch CPU names resolve on every build host.
AdaWorldAPI pushed a commit that referenced this pull request May 21, 2026
…on (codex P2)

Codex flagged on PR #187 that `cpu_ops_for_cpu` is cfg-gated through
`cpu_ops_for_tier`, so cross-arch lookups silently return None — e.g.
`cpu_ops_for_cpu("apple-m2")` on an x86_64 build maps "apple-m2" → "neon"
via `cpu_to_tier`, but then `cpu_ops_for_tier("neon")` is compiled out
because `CPU_OPS_NEON` is `cfg(target_arch = "aarch64")`.

This broke the documented "what would this CPU pick?" introspection use
case, which is supposed to work for deployment-planning tools and
cross-target reports regardless of the build host.

Fix: promote the previously-private `cpu_to_tier` to `pub fn
cpu_tier_for_cpu`. It returns `Option<&'static str>` and is cfg-free,
so `cpu_tier_for_cpu("apple-m2")` reliably returns `Some("neon")` on
every build target.

`cpu_ops_for_cpu` keeps its current semantics (current-arch only) but
the docstring now explicitly says so and points cross-arch callers
at `cpu_tier_for_cpu`. Returning a phantom CpuOps with scalar fn ptrs
for cross-arch lookups would lie about behavior — better to return
None and force callers to use the honest tier-name surface.

Added regression test `cpu_tier_for_cpu_is_cross_arch` that asserts
the cross-arch CPU names resolve on every build host.
AdaWorldAPI pushed a commit that referenced this pull request May 21, 2026
…on (codex P2)

Codex flagged on PR #187 that `cpu_ops_for_cpu` is cfg-gated through
`cpu_ops_for_tier`, so cross-arch lookups silently return None — e.g.
`cpu_ops_for_cpu("apple-m2")` on an x86_64 build maps "apple-m2" → "neon"
via `cpu_to_tier`, but then `cpu_ops_for_tier("neon")` is compiled out
because `CPU_OPS_NEON` is `cfg(target_arch = "aarch64")`.

This broke the documented "what would this CPU pick?" introspection use
case, which is supposed to work for deployment-planning tools and
cross-target reports regardless of the build host.

Fix: promote the previously-private `cpu_to_tier` to `pub fn
cpu_tier_for_cpu`. It returns `Option<&'static str>` and is cfg-free,
so `cpu_tier_for_cpu("apple-m2")` reliably returns `Some("neon")` on
every build target.

`cpu_ops_for_cpu` keeps its current semantics (current-arch only) but
the docstring now explicitly says so and points cross-arch callers
at `cpu_tier_for_cpu`. Returning a phantom CpuOps with scalar fn ptrs
for cross-arch lookups would lie about behavior — better to return
None and force callers to use the honest tier-name surface.

Added regression test `cpu_tier_for_cpu_is_cross_arch` that asserts
the cross-arch CPU names resolve on every build host.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants