simd_runtime: CpuOps DTO (third dispatch pattern) + GCC-scraped CPU table by AdaWorldAPI · Pull Request #187 · AdaWorldAPI/ndarray

AdaWorldAPI · 2026-05-21T09:10:54Z

Summary

Two scoped-together additions, both driven by the same insight — using scraped CPU metadata to drive runtime dispatch.

Piece A — matrix doc § M (GCC-grounded aarch64 enumeration)

The matrix doc had three aarch64 columns (A53 / A72 / A76) covering dispatch tiers but no per-core authoritative source. § M now records the canonical core list scraped from GCC's gcc/config/aarch64/aarch64-cores.def — 28 cores spanning V8.0-A through V9.2-A, each with its verbatim FEATURE_STRING.

Bug fix as a side effect: the A76 column legend claimed "+bf16" — but A76 is V8.2-A; BF16 came in V8.6-A. Removed the wrong claim; Apple M1 listed under A76 tier (V8.5-A baseline includes V8.2's dotprod+fp16 but NOT bf16/i8mm).

New tier groupings that should become matrix columns in Phase 3b (when NEON BFMMLA / BFDOT / FMLA.8h asm-byte arms land):

V8.6+/V9 with bf16+i8mm: Apple M2+, Oryon-1 (Snapdragon X), Cortex-A510+, Neoverse-N2/V2, Grace, Ampere1+
V8.4-A SVE outlier: Neoverse-V1 (Graviton 3)

Piece B — `CpuOps` DTO (the third dispatch pattern)

Adds src/simd_runtime/cpu_ops.rs exposing a third dispatch pattern that coexists with the existing two:

Pattern	Cost model	Wins when
1. `crate::simd::*` compile-time `#[cfg(target_feature)]` cascade	Direct monomorphized call, no runtime branch	Bench / fixed-target builds
2. `crate::simd_runtime::vnni_dot_u8_i8` etc. — per-op `LazyLock<fn ptr>`	One CPUID + atomic-load per op, first call	Sparse-op consumers
3. `cpu_ops()` → `&'static CpuOps` DTO (THIS PR)	ONE CPUID at startup; every op is fn-ptr field	Dense-op consumers (linear-algebra pipelines)

The OpenBLAS / MKL dispatch model. All three coexist; consumers pick by import path:

// Pattern 1
crate::simd_ops::add_mul_f32(acc, a, b);

// Pattern 2 (per-op LazyLock — PR #185)
crate::simd_runtime::add_mul_f32(acc, a, b);

// Pattern 3 (this PR — one LazyLock total)
let ops = crate::simd_runtime::cpu_ops();
unsafe { (ops.add_mul_f32)(acc, a, b); }

Six static CpuOps instances baked at compile time, one per tier (amx_int8, avx512vnni, avx512f, avxvnni, avx2_fma, neon, scalar). Each references the existing trampolines in vnni_dot.rs / add_mul.rs — no kernel duplication; this module is pure dispatch glue.

The naughty data-driven part

cpu_ops_for_cpu(name: &str) -> Option<&'static CpuOps> — maps GCC CPU codenames to the dispatch tier they land in. Source: § M's GCC scrape. Spot-checks all verified by the test suite:

Pattern	CPUs	Tier
AMX-INT8 hosts	sapphirerapids, graniterapids, emeraldrapids	`amx_int8`
AVX-512 + VNNI	cascadelake, cooperlake, icelake-*, tigerlake, rocketlake, znver4, znver5	`avx512vnni`
AVX-VNNI no AVX-512	alderlake, raptorlake, meteorlake, arrowlake, arrowlake-s, lunarlake, pantherlake, sierraforest	`avxvnni`
Plain AVX2+FMA	haswell, broadwell, skylake, znver1-3	`avx2_fma`
All aarch64	apple-m1..m4, oryon-1, cortex-a76..a725, cortex-x1..x925, neoverse-n1..v3, grace, ampere1..1b	`neon`

Returns None for unknown CPUs.

Use cases: "what would this CPU pick?" introspection without running on it; cross-compilation reports; deployment-planning tools; integration tests asserting tier selection for named targets; explicit-tier-pinning ("force AVX2 to measure overhead").

Future: code-gen the table from a build.rs that fetches GCC's latest core list. Today it's hand-rolled from the scrape.

Test plan

cargo test --lib --features runtime-dispatch: 2147 tests pass (was 2105, +5 new cpu_ops tests).
5 new cpu_ops tests:
- cpu_ops_resolves_on_this_host
- cpu_ops_stable_across_calls (LazyLock fires once)
- cpu_ops_for_tier_known_names
- cpu_ops_for_cpu_data_driven_lookup (spot-checks the GCC scrape)
- cpu_ops_call_through_dto (full indirect-call exercise via (ops.vnni_dot_u8_i8)(a, b) and (ops.add_mul_f32)(...))
cargo clippy --lib --tests --features rayon,native,runtime-dispatch -- -D warnings clean.
cargo fmt --all --check clean.
Default build (no feature) unchanged — zero impact on existing paths.

Backward-compat

The pub(super) wrappers in vnni_dot.rs and add_mul.rs (*_safe / *_safe_wrapper / *_scalar_wrapper) are new but purely additive — every existing public function in simd_runtime keeps its prior signature and dispatch behavior.

Out of scope (separate PRs)

build.rs automation of the GCC scrape (today the data is hand-rolled).
Extending CpuOps to cover the matmul / cast surfaces (today only vnni_dot_u8_i8, add_mul_f32, add_mul_f64 — the only surfaces with explicit per-tier kernels in simd_runtime/; the matmul/cast trampolines delegate to functions that runtime-dispatch internally, so they're a pub fn per-op call, not a CpuOps field).
NEON tier expansion (neon_bf16 / neon_dotprod) — lands with the Phase 3b asm-byte arms.

https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u

Generated by Claude Code

… CPU table Two additions, scoped together because they're the same idea — using scraped CPU metadata to drive runtime dispatch: # Piece A: matrix doc § M (GCC-grounded aarch64 enumeration) The matrix had three aarch64 columns (A53 / A72 / A76) covering *dispatch tiers* (multiple physical cores share each tier's SIMD primitive set). The authoritative per-core feature membership lives in GCC's `gcc/config/aarch64/aarch64-cores.def` — scraped 2026-05-21 and recorded as a new § M table covering 28 cores: * V8.0-A baseline (A53, A72) * V8.2-A dotprod+fp16 (A76, A78, X1, Neoverse-N1, Apple M1) * V8.5-A baseline (Apple M1 specifically — V8.5 includes V8.2's fp16+dotprod but NOT bf16+i8mm; corrects a wrong "+bf16" claim on the existing A76 row of the column legend) * V8.6-A baseline incl. bf16+i8mm (Apple M2/M3, Oryon-1 / Snapdragon X Elite, Ampere1+, Cortex-A510/A710/A715, X2/X3, Neoverse-N2/V2) * V8.7-A (Apple M4, Ampere1B) * V9.0-A SVE2 baseline + explicit bf16+i8mm flags (Cortex-A510-A715, X2/X3, Neoverse-N2/V2, Grace) * V8.4-A SVE tier (Neoverse-V1 / Graviton 3 — only V8.4 core with explicit SVE+bf16+i8mm) * V9.2-A (Cortex-A520/A720/A725, X4, X925, Neoverse-N3/V3) Each entry verbatim from the GCC FEATURE_STRING column. Cross- referencing with the V8.X-A baseline rules (V8.6+ includes bf16+i8mm implicitly; V9.0 includes SVE2 implicitly) gives the canonical "which silicon has what" table. The note flags that a new dispatch column for the V8.6+/V9-bf16-i8mm tier needs to land alongside the NEON BFMMLA / BFDOT asm-byte arm in Phase 3b. The A76 column legend (line 26 of the matrix) was corrected: removed the wrong "+bf16" (A76 itself is V8.2-A, NO bf16 — bf16 came in V8.6-A). # Piece B: CpuOps DTO — third dispatch pattern Adds `src/simd_runtime/cpu_ops.rs` exposing a per-CPU operations DTO distinct from the existing patterns: Pattern 1 (`crate::simd::*`): compile-time `#[cfg(target_feature)]` cascade. Direct monomorphized calls. Pattern 2 (`crate::simd_runtime::vnni_dot_u8_i8` etc., from #185): per-op LazyLock<fn ptr>. One CPUID + atomic-load per op the first time called. Pattern 3 (THIS COMMIT): per-CPU `&'static CpuOps` selected once at first access. Every op is a fn-ptr field on the struct. Why the third pattern? * Per-op LazyLock: N ops touched = N atomic-load setup costs over the process lifetime. * CpuOps DTO: ONE atomic-load total at first `cpu_ops()` call; every subsequent op is a direct fn-ptr deref through the cached `&'static CpuOps`. The OpenBLAS / MKL dispatch model — wins for dense-op consumers (linear-algebra pipelines touching every BLAS-1/2/3 kernel). * All three coexist. Consumers pick by import path. Six tiers baked as static const `CpuOps` instances: x86_64: amx_int8, avx512vnni, avx512f, avxvnni, avx2_fma aarch64: neon universal: scalar Each instance points at the existing trampolines in `crate::simd_runtime::{vnni_dot, add_mul}` — no kernel duplication; this module is pure dispatch glue. Backend ops referenced: vnni_dot_u8_i8 (3 backends: avx512+tail / avxvnni / scalar) add_mul_f32 (4 backends: avx512 / avx2+fma / neon / scalar) add_mul_f64 (4 backends: avx512 / avx2+fma / neon / scalar) # The naughty data-driven part `cpu_ops_for_cpu(name: &str) -> Option<&'static CpuOps>` maps GCC CPU codenames to the dispatch tier they land in, sourced from § M's scrape. Spot-checks (each verified by the test suite): sapphirerapids / graniterapids / emeraldrapids → amx_int8 cascadelake / cooperlake / icelake-* / tigerlake / rocketlake / znver4 / znver5 → avx512vnni alderlake / raptorlake / meteorlake / arrowlake / arrowlake-s / lunarlake / pantherlake / sierraforest → avxvnni haswell / broadwell / skylake / znver1-3 → avx2_fma apple-m1..m4 / oryon-1 / cortex-a76..a725 / cortex-x1..x925 / neoverse-n1..v3 / grace / ampere1..1b → neon Returns `None` for unknown CPUs — caller can fall back to `cpu_ops_for_tier("scalar")` if a "best-effort" answer is needed. Use cases for `cpu_ops_for_cpu`: * "What would $CPU pick?" introspection without running on $CPU. * Cross-compilation reports + deployment-planning tools. * Integration tests asserting tier selection for named targets. * Explicit-tier-pinning ("force AVX2 even though AMX is available, to measure overhead"). Future: code-gen the table from a `build.rs` that fetches GCC's latest core list. Today the table is hand-rolled from the scrape recorded in matrix doc § M. # Verification * `cargo test --lib --features runtime-dispatch`: 2147 tests pass (was 2105 — +5 new cpu_ops tests + 37 carried over from prior feature-gated tests now compiled-in too). * 5 new cpu_ops tests: cpu_ops_resolves_on_this_host cpu_ops_stable_across_calls (LazyLock fires once) cpu_ops_for_tier_known_names cpu_ops_for_cpu_data_driven_lookup (spot-checks the GCC scrape) cpu_ops_call_through_dto (full indirect-call exercise) * cargo clippy --lib --tests --features rayon,native,runtime-dispatch -- -D warnings clean. * cargo fmt --all --check clean. * Default build (no feature) unchanged: zero impact on existing paths — the entire `simd_runtime` module is gated out. # Backward-compat for the existing per-op LazyLock surface The pub(super) wrappers in `vnni_dot.rs` and `add_mul.rs` (`*_safe` / `*_safe_wrapper` / `*_scalar_wrapper`) are new but purely additive — every existing public function in `simd_runtime` keeps its prior signature and dispatch behavior. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d50caaf578

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-21T09:14:22Z

+pub fn cpu_ops_for_cpu(name: &str) -> Option<&'static CpuOps> {
+    cpu_ops_for_tier(cpu_to_tier(name)?)


Preserve cpu-name lookup across target architectures

cpu_ops_for_cpu currently resolves a CPU name via cpu_to_tier and then immediately calls cpu_ops_for_tier, but cpu_ops_for_tier is #[cfg(target_arch)]-gated. On an x86_64 build, ARM tiers like "neon" are compiled out, so known names such as "apple-m2" map to None even though cpu_to_tier recognizes them. This breaks the documented “what would this CPU pick?” cross-target introspection use case and makes lookup results depend on the build host architecture rather than the CPU name input.

Useful? React with 👍 / 👎.

…on (codex P2) Codex flagged on PR #187 that `cpu_ops_for_cpu` is cfg-gated through `cpu_ops_for_tier`, so cross-arch lookups silently return None — e.g. `cpu_ops_for_cpu("apple-m2")` on an x86_64 build maps "apple-m2" → "neon" via `cpu_to_tier`, but then `cpu_ops_for_tier("neon")` is compiled out because `CPU_OPS_NEON` is `cfg(target_arch = "aarch64")`. This broke the documented "what would this CPU pick?" introspection use case, which is supposed to work for deployment-planning tools and cross-target reports regardless of the build host. Fix: promote the previously-private `cpu_to_tier` to `pub fn cpu_tier_for_cpu`. It returns `Option<&'static str>` and is cfg-free, so `cpu_tier_for_cpu("apple-m2")` reliably returns `Some("neon")` on every build target. `cpu_ops_for_cpu` keeps its current semantics (current-arch only) but the docstring now explicitly says so and points cross-arch callers at `cpu_tier_for_cpu`. Returning a phantom CpuOps with scalar fn ptrs for cross-arch lookups would lie about behavior — better to return None and force callers to use the honest tier-name surface. Added regression test `cpu_tier_for_cpu_is_cross_arch` that asserts the cross-arch CPU names resolve on every build host.

chatgpt-codex-connector Bot reviewed May 21, 2026

View reviewed changes

AdaWorldAPI merged commit bc307ec into master May 21, 2026
17 checks passed

AdaWorldAPI mentioned this pull request May 21, 2026

simd: SimdProfile fine-grained detection + cpu-* pinning + leaf 7,1 CPUID #190

Closed

7 tasks

AdaWorldAPI mentioned this pull request May 21, 2026

fix+test: cpu_tier_for_cpu cross-arch + Pillar 12/13/14 drift-checks #191

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

simd_runtime: CpuOps DTO (third dispatch pattern) + GCC-scraped CPU table#187

simd_runtime: CpuOps DTO (third dispatch pattern) + GCC-scraped CPU table#187
AdaWorldAPI merged 1 commit into
masterfrom
claude/continue-ndarray-x0Oaw

AdaWorldAPI commented May 21, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		pub fn cpu_ops_for_cpu(name: &str) -> Option<&'static CpuOps> {
		cpu_ops_for_tier(cpu_to_tier(name)?)

Conversation

AdaWorldAPI commented May 21, 2026

Summary

Piece A — matrix doc § M (GCC-grounded aarch64 enumeration)

Piece B — CpuOps DTO (the third dispatch pattern)

The naughty data-driven part

Test plan

Backward-compat

Out of scope (separate PRs)

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Piece B — `CpuOps` DTO (the third dispatch pattern)