Integrates four analysis passes into one executable sequence:
- REFACTOR_HPC_INTEGRATION.md — Type bridges & extension traits
- SOA_KERNEL_ARCHITECTURE.md — Columnar cascade & field-separated storage
- Transformer session feedback — API conventions, namespace, codegen macros
- lance-graph W1a consumer contract — 5 SIMD primitives blocking 158-violation remediation
Each wave is self-contained. Later waves depend on earlier ones. Within a wave, items are independent and can run in parallel.
- Dispatch resolves at compile time via
cfg(target_feature)insimd.rs. No per-call runtime checks. simd_avx2.rsis dead on x86-64-v4. Don't add to it.- NEON does 256-bit as 2×128-bit paired.
- VPABSB does NOT saturate i8::MIN.
- Palette-256 is the dominant gather use case.
- Rayon work-stealing is not the lever if typed SIMD integration is the global lever.
- The
hpc/simd_caps.rsandhpc/simd_dispatch.rsfiles are dead code under the pin — 877 lines to delete.
Unlocks everything else. No code changes to hot paths. Pure contract + tooling.
| ID | Item | Source | Effort | Why First |
|---|---|---|---|---|
| W0.1 | Dual-form signature convention (_into + Vec wrapper + _ptr) |
R1.1 | 1d | Fixes root cause of PP-15 WS-1/2/4 ❌. All future kernels follow this shape. |
| W0.2 | Unified HpcError enum (src/hpc/error.rs) |
R1.2 | 4h | Replaces 4 error conventions with 1. |
| W0.3 | #![deny(warnings)] → targeted denies |
R1.3 | 1h | Unblocks doctest authoring for all subsequent waves. |
| W0.4 | Feature flag rename (hpc-extras → research, backend-* prefix) |
R4.1 | 4h | Clean slate for feature-gated additions. |
| W0.5 | Prelude module (hpc::prelude::*) |
R1.4 | 1h | Single discovery surface for downstream consumers. |
Gate: All downstream crates (burn, candle, tract, ort, lance-graph) still compile unchanged. The dual-form is additive; existing call sites continue working.
From lance-graph W1a consumer contract. Five primitives blocking 158 raw-intrinsic
violations across 5 lance-graph crates. Each is a tight-scope PR. Parallel review.
The lance-graph simd-savant agent runs PRE-MERGE against each PR.
| ID | Item | Source | Effort | Consumer Site |
|---|---|---|---|---|
| W1.1 | I8x16::from_i4_packed_u64() + batch_packed_i4_16() |
W1a-#1 | 1d | lance-graph-contract/src/mul.rs::i4_eval::batch — 5 batch fns over QualiaI4_16D(u64). Closes PR #398 codex P1 (NEON OOB at len==2). |
| W1.2 | I8x16::saturating_abs() + I8x32::saturating_abs() |
W1a-#2 | 4h | lance-graph-contract/src/mul.rs — Direction-B fix. abs(i8::MIN) must return i8::MAX, not wrap. See §VPABSB correction. |
| W1.3 | U16x8::gather_u16() + palette_lookup_u8x8() |
W1a-#3 | 4h | bgz17/src/simd.rs:88 — currently inlines _mm256_i32gather_epi32 (AP-SIMD-1 violation). |
| W1.4 | prefetch_read_t0/t1/t2() |
W1a-#4 | 2h | bgz17/src/prefetch.rs:96,100 — currently inlines _mm_prefetch directly. |
| W1.5 | U64x8::popcnt() + U64x8::xor_popcount() |
W1a-#5 | 4h | holograph/hamming.rs + lance-graph/graph/blasgraph/types.rs — inline _mm512_popcnt_epi64. |
Every PR MUST:
- Impl in
simd_avx512.rs+simd_neon.rs+ scalar fallback. - Edge-case semantics documented (
i8::MIN, empty slices, OOB indices). - Parity test: all backends produce identical output on randomized corpus.
- Bench against scalar: speedup ratios in PR body.
// SAFETY:on everyunsafeblock.- No
is_x86_feature_detected!. No#[target_feature(enable=...)]. - Consumer site cited in PR description.
impl I8x16 {
/// Unpack 16 signed i4 nibbles from a u64 into 16 i8 lanes (sign-extended).
/// Nibble layout: lane[i] = sign_extend_4((packed >> (4*i)) & 0xf).
pub fn from_i4_packed_u64(packed: u64) -> Self;
/// Const-folded lane extract.
pub fn lane_i8<const N: usize>(self) -> i8;
}
/// Closure-parameterized batch: run `f` over each (unpacked_i8x16, aux[i]) pair.
/// Bounds-aware tail handling; scalar fallback on unsupported arch.
pub fn batch_packed_i4_16<E, F>(packed: &[u64], aux: &[i8], out: &mut [E], f: F)
where F: Fn(I8x16, i8) -> E + Sync + Send, E: Copy;AVX-512: _mm_cvtsi64_si128 + nibble shuffle via VPSHUFB mask LUT, then sign-extend.
Alt: PDEP (_pdep_u64 × 2) into two u64 halves → load → vpmovsxbw. Bench both on Zen4 + SPR.
NEON: vld1_u8 8 bytes → vshl_n_s8(v, 4) + vshr_n_s8(v, 4) for nibble split + sign-extend.
Scalar: ((packed >> (4*i)) & 0xf) as i8, with if x > 7 { x - 16 } else { x } for sign-extend.
The batch_packed_i4_16 closure-batch is the foundation for W1.5-#7 (randomized signatures).
Critical: _mm512_abs_epi8 does NOT saturate i8::MIN. Returns 0x80 (= -128).
impl I8x16 { pub fn saturating_abs(self) -> Self; }
impl I8x32 { pub fn saturating_abs(self) -> Self; }AVX-512: _mm512_min_epu8(_mm512_abs_epi8(x), _mm512_set1_epi8(0x7f)).
NEON: vqabsq_s8(x) — hardware-saturating (the q suffix).
Scalar: i8::saturating_abs() (stdlib, well-defined).
Mandatory test:
let input = I8x16::splat(i8::MIN);
let result = input.saturating_abs();
assert_eq!(result.lane_i8::<0>(), i8::MAX); // 127, NOT -128impl U16x8 {
/// Gather 8 u16 values from `table` at given indices.
/// Debug: panics if max(indices) >= table.len(). Release: scalar fallback.
pub fn gather_u16(indices: U16x8, table: &[u16]) -> Self;
}
pub fn palette_lookup_u8x8(idx_v: U16x8, lut: &[u8]) -> U8x8;Palette-256 is the dominant use case — every index fits in u8, table is always 256 or 512 bytes. No bounds risk at palette-256 widths. The gather_u16 API must handle arbitrary tables too, but palette-256 should be the fast path (no OOB possible when table.len() == 256 and indices are u8-sourced).
Codex P2 fix (gather_u16 OOB): _mm256_i32gather_epi32 reads 4 bytes per slot
from a &[u16] table — overreads 2 bytes at table[len-1]. For palette-256 this
is harmless (256 × 2 = 512 bytes, 4-byte read at index 255 reads bytes 510-513,
which is within a 512-byte aligned allocation + padding). For arbitrary tables,
use scalar fallback or pad the table allocation.
AVX-512: _mm512_i32gather_epi32 with index widening + mask to u16.
NEON / Scalar: loop (0..8).map(|i| table[indices.lane(i)]).
pub fn prefetch_read_t0(ptr: *const u8); // L1
pub fn prefetch_read_t1(ptr: *const u8); // L2
pub fn prefetch_read_t2(ptr: *const u8); // L3x86: _mm_prefetch(ptr, _MM_HINT_T0/T1/T2).
aarch64: prfm pldl1keep/pldl2keep/pldl3keep.
Other: no-op (prefetch is a hint; silent no-op is correct per ISA).
impl U64x8 {
/// Lane-wise population count. Each lane returns its u64 bit-count (0..=64).
pub fn popcnt(self) -> Self;
/// XOR + lane-wise popcount + horizontal sum. Optimized for Hamming distance.
pub fn xor_popcount(self, other: Self) -> u64;
}
impl U64x4 { pub fn popcnt(self) -> Self; } // NEON/scalar parityAVX-512 VPOPCNTDQ: _mm512_popcnt_epi64 directly (feature avx512vpopcntdq —
available on sapphirerapids, enabled by the target-cpu pin).
NEON popcount per-u64: vcntq_u8 → vpaddlq_u8 → vpaddlq_u16 → vpaddlq_u32
(NOT vaddvq_u8 which merges ALL lanes to a single scalar — Codex P2 fix).
Scalar: u64::count_ones() fused loop.
Three more primitives queued behind lance-graph:crates/sigker certification
(Hambly-Lyons 2010 path-signature uniqueness). No code needed today. Listed
so W1a additions are designed broad enough to compose with these later.
| ID | Primitive | Gate |
|---|---|---|
| W1.5-#6 | signature_pde_sweep() — Goursat PDE for 〈S(X), S(Y)〉 |
jc Pillar 11 activation |
| W1.5-#7 | Randomized projection (Cuchiero-Schmocker-Teichmann 2021) | sigker bench at production widths |
| W1.5-#8 | Lyndon-pack log-signature (7-13× compression) on I16x16 |
sigker bench at production widths |
The closure-batch shape introduced in W1.1 is the foundation for W1.5-#7.
Gate: All 5 primitives pass parity tests across AVX-512/NEON/scalar.
lance-graph simd-savant agent verifies compliance PRE-MERGE.
Consumer-side migration PRs (#398-#402) unblocked.
Eliminates copy-paste before adding new code. Every subsequent wave benefits. Wave 1 primitives become the first consumers of these macros (retrofit optional).
| ID | Item | Source | Effort | What It Produces |
|---|---|---|---|---|
| W2.1 | Dtype-parity macro (reductions_for!) |
R3.1 | 1d | One line = 7 reductions for a dtype. Cuts 700→150 lines. |
| W2.2 | Per-arch impl macro (simd_impl!) |
R3.2 | 4h | Generates the struct + methods for a type in both simd_avx512.rs and simd_neon.rs from one body. |
| W2.3 | Reduction kernel template (reduce_simd()) |
R3.3 | 4h | Generic chunk-loop; sum/max/nrm2 become 5-line callers. |
| W2.4 | Dual-form fusion (kernel_simd_dual!) |
R3.4 | 1d | One body → _into, Vec, _ptr, all arch variants. |
Gate: cargo test -p ndarray passes. Macro output matches existing hand-rolled functions.
Run benches to verify no regression.
From REFACTOR_HPC_INTEGRATION.md Tier 1. Makes domain types ndarray-native.
| ID | Item | Source | Effort | Impact |
|---|---|---|---|---|
| W3.1 | Fingerprint ↔ ArrayView1 | Tier 1.1 | 4h | Zero-copy bridge, unblocks Zip/broadcast on fingerprints |
| W3.2 | VsaVector ↔ ArrayView1 | Tier 1.2 | 2h | Same pattern |
| W3.3 | BF16/F16 → BlasFloat | Tier 1.3 | 4h | Enables Array<BF16>.dot(), unlocks BLAS dispatch |
| W3.4 | CogRecord channel_as_words() | Tier 1.4 | 2h | ArrayView1 per channel |
| W3.5 | Arrow bridge → ArrayView factories | Tier 4.1 | 2h | binary_column_view() returns ArrayView2 |
Gate: New From/Into impls compile. Existing code unchanged. Add tests for each bridge.
From REFACTOR_HPC_INTEGRATION.md Tier 2. hpc operations feel native on ndarray types.
| ID | Item | Source | Effort | Impact |
|---|---|---|---|---|
| W4.1 | HdcOps trait (K0/K1/K2 on Array1) | Tier 2.1 | 4h | query.cascade_distance(&candidate, &gate) |
| W4.2 | Quantize trait (Array → BF16/I8 with Array output) | Tier 2.2 | 4h | array.to_bf16(), array.to_i8_symmetric() |
| W4.3 | Prefilter on ArrayView2 | Tier 2.3 | 4h | Eliminates (slice, rows, cols) pattern |
| W4.4 | ActivationsNd (softmax_axis, log_softmax_axis) | Tier 2.4 | 4h | batch.softmax_axis(Axis(1)) for inference |
| W4.5 | SimdMath trait (VML on any Array) | Tier 2.5 | 4h | array.simd_exp() — 10-50x for transcendentals |
| W4.6 | Int8Matmul trait (Array2 × Array2 → Array2) | Tier 3.3 | 4h | Quantized inference via ndarray types |
Gate: Each trait has tests matching the raw-slice function's test suite. Benchmark: trait overhead vs. direct raw-slice call ≤ 1%.
From REFACTOR_HPC_INTEGRATION.md Tier 3. Core ndarray operations silently accelerate.
| ID | Item | Source | Effort | Impact |
|---|---|---|---|---|
| W5.1 | Delete duplicate detection code (simd_caps + simd_dispatch → dead code under cfg pin) | Tier 3.2 | 4h | Deletes 877 lines that are unreachable when target-cpu is pinned |
| W5.2 | Core sum/mean → SIMD dispatch | Tier 3.1 | 4h | 16x faster .sum() on contiguous f32/f64 |
| W5.3 | SIMD axis reductions (sum_axis with SIMD lanes) | Tier 6.1 | 1d | ML training hot path |
Note on W5.1: With target-cpu=sapphirerapids in .cargo/config.toml, the
cfg(target_feature = "avx512f") branch in simd.rs is the only live path.
hpc/simd_caps.rs (515 lines) and hpc/simd_dispatch.rs (362 lines) exist for
a multi-binary world that doesn't apply when the target is pinned. These can be
deleted or feature-gated behind cfg(not(target_feature = "avx512f")) for CI
fallback builds.
Gate: cargo bench shows measurable improvement on contiguous arrays.
Non-contiguous arrays unchanged (fallback to generic fold).
From SOA_KERNEL_ARCHITECTURE.md. The genius-level structural change.
Wave 1's U64x8::popcnt() and U64x8::xor_popcount() are direct prerequisites
for the columnar XOR+popcount scan in W6.2.
| ID | Item | Source | Effort | Impact |
|---|---|---|---|---|
| W6.1 | TieredDatabase struct (K0/K1/K2 as F-order Array2) | SoA §5 | 1d | Physical separation matching cache hierarchy |
| W6.2 | k0_columnar_simd (scan N u64s with broadcast+XOR+popcount) | SoA §1 | 1d | Uses U64x8::xor_popcount from W1.5. 8 candidates per cycle. |
| W6.3 | Bitmask survivor propagation (mask narrowing, no branching) | SoA §2 | 4h | Replace per-candidate continue with mask ops |
| W6.4 | k1_accumulate_columnar (8-column scan with mask) | SoA §1 | 4h | K1 on contiguous columns |
| W6.5 | k2_exact_masked (row extraction for survivors only) | SoA §1 | 4h | Stack-copy 2KB per survivor |
| W6.6 | BF16FieldDatabase (sign/exp/mantissa separated at ingest) | SoA §6 | 1d | Awareness without runtime decomposition |
| W6.7 | QualiaColumns (16 × Array1) | SoA §7 | 4h | 18x dimension scan throughput. Uses I8x16::from_i4_packed_u64 from W1.1. |
| W6.8 | Arrow native path (RecordBatch columns → columnar scan) | SoA §4 | 4h | Zero-copy from Lance/Arrow |
| W6.9 | Benchmark: AoS vs SoA on 1M SKU-16K containers | SoA §perf | 4h | Proof of 4-8x claim |
Gate: SoA cascade produces identical results to AoS cascade on full test suite. Benchmark confirms ≥3x throughput improvement on 1M containers.
From transformer feedback R2.1. Enforces the architecture rule from CLAUDE.md.
| ID | Item | Source | Effort | Impact |
|---|---|---|---|---|
| W7.1 | Split hpc/ → hpc/ + cog/ + ext/ + io/ | R2.1 | 2-3d | 30 numeric modules in hpc/, 35 cognitive in cog/, 20 experimental in ext/, 6 I/O in io/ |
| W7.2 | SIMD directory consolidation (simd_*.rs → src/simd/) | R2.2 | 1d | One directory for all SIMD code. W1 primitives land in new locations. |
| W7.3 | Quantized module split (quantized.rs → hpc/quant/) | R2.3 | 4h | BlockQ4_0 packed struct for candle compat |
Gate: pub use deprecation shims for all moved modules.
All existing use ndarray::hpc::* paths still resolve (with deprecation warning).
| ID | Item | Source | Effort | Impact |
|---|---|---|---|---|
| W8.1 | Extract integration tests to tests/hpc/, tests/cog/ |
R4.2 | 2d | Module files lose 30-50% of line count |
| W8.2 | HPC bench harness (dot, reductions, softmax, quantized) | R4.5 | 1d | Reproducible perf claims |
| W8.3 | Cross-comparison bench (ndarray vs candle vs tract kernel) | R4.5 ext | 4h | The whole reason for B1 |
| W8.4 | W1a primitive parity bench (AVX-512 vs NEON vs scalar speedup) | lance-graph | 4h | Acceptance criteria §4 |
| ID | Item | Source | Effort | Impact |
|---|---|---|---|---|
| W9.1 | Tag v0.18.0 | R4.4 | 2h | Clean cut |
| W9.2 | CHANGELOG with migration guide | R4.4 | 2h | Downstream knows what changed |
| W9.3 | Downstream Cargo.toml pins (burn, candle, tract, ort, lance-graph) | R4.4 | 2h | tag = "v0.18.0" |
| W9.4 | Remove deprecation shims (schedule for v0.19) | R2.1 | — | Future cleanup |
Wave 0 (conventions)
│
├──→ Wave 1 (SIMD primitives) ←── P0 for lance-graph
│ │
│ ├──→ Wave 6.2 (k0_columnar uses U64x8::xor_popcount from W1.5)
│ ├──→ Wave 6.7 (QualiaColumns uses I8x16::from_i4_packed from W1.1)
│ └──→ Wave 8.4 (primitive parity benches)
│
├���─→ Wave 2 (codegen macros)
│ │
│ └──→ Wave 5 (backend wiring uses macros)
│
├──→ Wave 3 (type bridges)
│ │
│ ├──→ Wave 4 (extension traits need bridges)
│ ├���─→ Wave 5 (backend dispatch needs BlasFloat for BF16)
│ └──→ Wave 6 (SoA needs Fingerprint↔Array, Arrow views)
│
├──→ Wave 7 (namespace) — independent of 1-6
│
└──→ Wave 8 (tests/benches) — after 1+5+6
│
└──→ Wave 9 (release)
For lance-graph (P0 consumer): 0 → 1 = 6 days. Consumer migration PRs unblocked.
For full release: 0 → 1 → 3 → 5 → 6 �� 8 → 9 = 18 days serial. With parallelism (1∥2, 3∥4, 6∥7, 8∥backlog): ~12 working days.
The 5 SIMD primitives aren't isolated additions — they're load-bearing for later waves:
| Primitive | Immediate consumer (lance-graph) | Later consumer (ndarray) |
|---|---|---|
I8x16::from_i4_packed_u64 |
mul.rs::i4_eval::batch (5 batch fns) |
W6.7 QualiaColumns — unpack i4 qualia from packed storage without scalar loop |
I8x16::saturating_abs |
Direction-B fix, ValleyOfDespair classifier | W4.2 Quantize trait — safe abs in quantization error metrics |
U16x8::gather_u16 |
bgz17/simd.rs palette lookup |
W6.6 BF16FieldDatabase �� gather exponent fields from lookup table |
prefetch_read_t0/t1/t2 |
bgz17/prefetch.rs tile prefetch |
W6.2 k0_columnar_simd — prefetch next column chunk during K0 scan |
U64x8::popcnt + xor_popcount |
holograph/hamming.rs + blasgraph/types.rs |
W6.2-W6.5 entire SoA cascade — columnar XOR+popcount is THE operation |
The SoA cascade (Wave 6) is fundamentally built on U64x8::xor_popcount(). Without Wave 1,
the SoA implementation would need to drop back to hpc::bitwise::hamming_distance_raw() on
slices, losing the typed-wrapper discipline and duplicating dispatch logic.
| Wave 6 needs | Provided by |
|---|---|
U64x8::xor_popcount() for columnar scan |
Wave 1.5 (SIMD primitives) |
I8x16::from_i4_packed_u64() for qualia |
Wave 1.1 (SIMD primitives) |
prefetch_read_t0() for column prefetch |
Wave 1.4 (SIMD primitives) |
| F-order Array2 database | Wave 3 (type bridges — Fingerprint converts to Array) |
| Impl macro for k0_columnar_simd | Wave 2 (codegen macros) |
_into form for columnar kernels |
Wave 0 (signature convention) |
Extension trait: database.cascade_soa(&query, &gate) |
Wave 4 (HdcOps trait extended) |
| BF16FieldDatabase uses Quantize trait | Wave 4 (Quantize extension) |
| Arrow columns → direct scan | Wave 3 (Arrow view factories) |
| Benchmark harness to prove 4-8x | Wave 8 (bench infrastructure) |
The SoA insight goes beyond just "transpose the database." It's a design principle that applies recursively across the entire surface:
- Fingerprint database: column-per-word → cascade becomes column scan
- BF16 fields: separated sign/exp/mantissa → awareness is native layout
- Qualia: column-per-dimension → per-dim queries are sequential scans
- The dual-form convention (W0.1) IS SoA thinking applied to function signatures:
- AoS API:
fn foo(input) -> output(allocates inside, hides layout) - SoA API:
fn foo_into(input, output)(caller controls memory layout)
- AoS API:
- The typed-wrapper-as-method convention IS SoA for the SIMD surface:
- AoS: raw intrinsics scattered across consumer crates (158 violations)
- SoA: struct methods on typed wrappers, consumers import ONE namespace
- The closure-batch pattern (W1.1
batch_packed_i4_16) IS SoA for computation:- AoS: each consumer writes its own unpack+process+pack loop
- SoA: one batch primitive accepts a closure, owns chunking/tailing/dispatch
- AoS: one module (
hpc/) contains everything → grep to find anything - SoA: separated by concern (
hpc/,cog/,ext/,io/) → contiguous columns
Array2<u64>in F-order IS the SoA database.column(i)IS the field-access pattern.row(i)IS the record-access pattern- Strides encode the duality without duplication
- Arrow IS columnar storage (SoA by definition)
- Lance stores Arrow columns on disk
- The SoA kernel reads Arrow columns directly — no ETL, no transpose, no copy
- Storage format = memory layout = compute pattern = cache access order
- lance-graph declares what it needs (W1a primitives), ndarray provides
- The contract is columnar: one primitive per concern, not a monolithic "SIMD library"
- Each primitive is independently testable, deployable, and benchmarkable
The architecture collapses all intermediate transformations at every level.
| Wave | Days | Parallel? | Key Outcome |
|---|---|---|---|
| 0 | 3 | Yes | Conventions locked |
| 1 | 3 | Yes (after 0) | lance-graph unblocked (P0) |
| 2 | 2 | ∥ with 1 | Codegen macros ready |
| 3 | 2 | After 0 | Domain types ↔ Array |
| 4 | 3 | After 3 | Extension traits live |
| 5 | 2 | After 2+3 | Core ops accelerated |
| 6 | 5 | After 1+3+5 | SoA cascade (the big win) |
| 7 | 3 | ∥ with 4-6 | Namespace clean |
| 8 | 2 | After 1+6 | Proof via benchmarks |
| 9 | 1 | After all | Release |
lance-graph unblocked: Wave 0+1 = 6 days. Full release critical path: 18 days serial, ~12 days with parallelism.
- Don't delete hpc/ modules — raw-slice functions stay (FFI, embedded, zero-overhead)
- Don't premature-v1.0 — surface is too young; candle/tract wiring will shake it again
- Don't gate SoA behind a feature flag — it's the default hot path, not optional
- Don't couple SoA with module restructure — they're independent; merge separately
- Don't break downstream in one shot — deprecation shims for one release minimum
- Don't ship W1a primitives without parity tests — the codex P2 i8::MIN divergence happened because no test existed
- Don't implement W1.5+ (deferred primitives) until sigker certification — they're gated