Unified Refactor Sequence: ndarray Fork

Integrates four analysis passes into one executable sequence:

REFACTOR_HPC_INTEGRATION.md — Type bridges & extension traits

SOA_KERNEL_ARCHITECTURE.md — Columnar cascade & field-separated storage

Transformer session feedback — API conventions, namespace, codegen macros

lance-graph W1a consumer contract — 5 SIMD primitives blocking 158-violation remediation

Each wave is self-contained. Later waves depend on earlier ones. Within a wave, items are independent and can run in parallel.

Known constraints (from prior sessions)

Dispatch resolves at compile time via cfg(target_feature) in simd.rs. No per-call runtime checks.
simd_avx2.rs is dead on x86-64-v4. Don't add to it.
NEON does 256-bit as 2×128-bit paired.
VPABSB does NOT saturate i8::MIN.
Palette-256 is the dominant gather use case.
Rayon work-stealing is not the lever if typed SIMD integration is the global lever.
The hpc/simd_caps.rs and hpc/simd_dispatch.rs files are dead code under the pin — 877 lines to delete.

Wave 0 — Conventions & Foundations (3 days)

Unlocks everything else. No code changes to hot paths. Pure contract + tooling.

ID	Item	Source	Effort	Why First
W0.1	Dual-form signature convention (`_into` + Vec wrapper + `_ptr`)	R1.1	1d	Fixes root cause of PP-15 WS-1/2/4 ❌. All future kernels follow this shape.
W0.2	Unified HpcError enum (`src/hpc/error.rs`)	R1.2	4h	Replaces 4 error conventions with 1.
W0.3	`#![deny(warnings)]` → targeted denies	R1.3	1h	Unblocks doctest authoring for all subsequent waves.
W0.4	Feature flag rename (`hpc-extras` → `research`, `backend-*` prefix)	R4.1	4h	Clean slate for feature-gated additions.
W0.5	Prelude module (`hpc::prelude::*`)	R1.4	1h	Single discovery surface for downstream consumers.

Gate: All downstream crates (burn, candle, tract, ort, lance-graph) still compile unchanged. The dual-form is additive; existing call sites continue working.

Wave 1 — SIMD Consumer Primitives (3 days) ← lance-graph P0

From lance-graph W1a consumer contract. Five primitives blocking 158 raw-intrinsic violations across 5 lance-graph crates. Each is a tight-scope PR. Parallel review.

The lance-graph simd-savant agent runs PRE-MERGE against each PR.

ID	Item	Source	Effort	Consumer Site
W1.1	`I8x16::from_i4_packed_u64()` + `batch_packed_i4_16()`	W1a-#1	1d	`lance-graph-contract/src/mul.rs::i4_eval::batch` — 5 batch fns over `QualiaI4_16D(u64)`. Closes PR #398 codex P1 (NEON OOB at `len==2`).
W1.2	`I8x16::saturating_abs()` + `I8x32::saturating_abs()`	W1a-#2	4h	`lance-graph-contract/src/mul.rs` — Direction-B fix. `abs(i8::MIN)` must return `i8::MAX`, not wrap. See §VPABSB correction.
W1.3	`U16x8::gather_u16()` + `palette_lookup_u8x8()`	W1a-#3	4h	`bgz17/src/simd.rs:88` — currently inlines `_mm256_i32gather_epi32` (AP-SIMD-1 violation).
W1.4	`prefetch_read_t0/t1/t2()`	W1a-#4	2h	`bgz17/src/prefetch.rs:96,100` — currently inlines `_mm_prefetch` directly.
W1.5	`U64x8::popcnt()` + `U64x8::xor_popcount()`	W1a-#5	4h	`holograph/hamming.rs` + `lance-graph/graph/blasgraph/types.rs` — inline `_mm512_popcnt_epi64`.

Per-primitive implementation contract

Every PR MUST:

Impl in simd_avx512.rs + simd_neon.rs + scalar fallback.
Edge-case semantics documented (i8::MIN, empty slices, OOB indices).
Parity test: all backends produce identical output on randomized corpus.
Bench against scalar: speedup ratios in PR body.
// SAFETY: on every unsafe block.
No is_x86_feature_detected!. No #[target_feature(enable=...)].
Consumer site cited in PR description.

W1.1 — I8x16::from_i4_packed_u64 (nibble unpack + sign extend)

impl I8x16 {
    /// Unpack 16 signed i4 nibbles from a u64 into 16 i8 lanes (sign-extended).
    /// Nibble layout: lane[i] = sign_extend_4((packed >> (4*i)) & 0xf).
    pub fn from_i4_packed_u64(packed: u64) -> Self;

    /// Const-folded lane extract.
    pub fn lane_i8<const N: usize>(self) -> i8;
}

/// Closure-parameterized batch: run `f` over each (unpacked_i8x16, aux[i]) pair.
/// Bounds-aware tail handling; scalar fallback on unsupported arch.
pub fn batch_packed_i4_16<E, F>(packed: &[u64], aux: &[i8], out: &mut [E], f: F)
where F: Fn(I8x16, i8) -> E + Sync + Send, E: Copy;

AVX-512: _mm_cvtsi64_si128 + nibble shuffle via VPSHUFB mask LUT, then sign-extend. Alt: PDEP (_pdep_u64 × 2) into two u64 halves → load → vpmovsxbw. Bench both on Zen4 + SPR. NEON: vld1_u8 8 bytes → vshl_n_s8(v, 4) + vshr_n_s8(v, 4) for nibble split + sign-extend. Scalar: ((packed >> (4*i)) & 0xf) as i8, with if x > 7 { x - 16 } else { x } for sign-extend.

The batch_packed_i4_16 closure-batch is the foundation for W1.5-#7 (randomized signatures).

W1.2 — I8x16::saturating_abs (VPABSB correction)

Critical: _mm512_abs_epi8 does NOT saturate i8::MIN. Returns 0x80 (= -128).

impl I8x16 { pub fn saturating_abs(self) -> Self; }
impl I8x32 { pub fn saturating_abs(self) -> Self; }

AVX-512: _mm512_min_epu8(_mm512_abs_epi8(x), _mm512_set1_epi8(0x7f)). NEON: vqabsq_s8(x) — hardware-saturating (the q suffix). Scalar: i8::saturating_abs() (stdlib, well-defined).

Mandatory test:

let input = I8x16::splat(i8::MIN);
let result = input.saturating_abs();
assert_eq!(result.lane_i8::<0>(), i8::MAX);  // 127, NOT -128

W1.3 — U16x8::gather_u16 (palette lookup)

impl U16x8 {
    /// Gather 8 u16 values from `table` at given indices.
    /// Debug: panics if max(indices) >= table.len(). Release: scalar fallback.
    pub fn gather_u16(indices: U16x8, table: &[u16]) -> Self;
}
pub fn palette_lookup_u8x8(idx_v: U16x8, lut: &[u8]) -> U8x8;

Palette-256 is the dominant use case — every index fits in u8, table is always 256 or 512 bytes. No bounds risk at palette-256 widths. The gather_u16 API must handle arbitrary tables too, but palette-256 should be the fast path (no OOB possible when table.len() == 256 and indices are u8-sourced).

Codex P2 fix (gather_u16 OOB): _mm256_i32gather_epi32 reads 4 bytes per slot from a &[u16] table — overreads 2 bytes at table[len-1]. For palette-256 this is harmless (256 × 2 = 512 bytes, 4-byte read at index 255 reads bytes 510-513, which is within a 512-byte aligned allocation + padding). For arbitrary tables, use scalar fallback or pad the table allocation.

AVX-512: _mm512_i32gather_epi32 with index widening + mask to u16. NEON / Scalar: loop (0..8).map(|i| table[indices.lane(i)]).

W1.4 — Prefetch hints (cross-arch)

pub fn prefetch_read_t0(ptr: *const u8);  // L1
pub fn prefetch_read_t1(ptr: *const u8);  // L2
pub fn prefetch_read_t2(ptr: *const u8);  // L3

x86: _mm_prefetch(ptr, _MM_HINT_T0/T1/T2). aarch64: prfm pldl1keep/pldl2keep/pldl3keep. Other: no-op (prefetch is a hint; silent no-op is correct per ISA).

W1.5 — U64x8::popcnt (lane-wise popcount)

impl U64x8 {
    /// Lane-wise population count. Each lane returns its u64 bit-count (0..=64).
    pub fn popcnt(self) -> Self;
    /// XOR + lane-wise popcount + horizontal sum. Optimized for Hamming distance.
    pub fn xor_popcount(self, other: Self) -> u64;
}
impl U64x4 { pub fn popcnt(self) -> Self; }  // NEON/scalar parity

AVX-512 VPOPCNTDQ: _mm512_popcnt_epi64 directly (feature avx512vpopcntdq — available on sapphirerapids, enabled by the target-cpu pin). NEON popcount per-u64: vcntq_u8 → vpaddlq_u8 → vpaddlq_u16 → vpaddlq_u32 (NOT vaddvq_u8 which merges ALL lanes to a single scalar — Codex P2 fix). Scalar: u64::count_ones() fused loop.

W1.5+ — Deferred primitives (gated on sigker certification)

Three more primitives queued behind lance-graph:crates/sigker certification (Hambly-Lyons 2010 path-signature uniqueness). No code needed today. Listed so W1a additions are designed broad enough to compose with these later.

ID	Primitive	Gate
W1.5-#6	`signature_pde_sweep()` — Goursat PDE for `〈S(X), S(Y)〉`	`jc Pillar 11` activation
W1.5-#7	Randomized projection (Cuchiero-Schmocker-Teichmann 2021)	sigker bench at production widths
W1.5-#8	Lyndon-pack log-signature (7-13× compression) on `I16x16`	sigker bench at production widths

The closure-batch shape introduced in W1.1 is the foundation for W1.5-#7.

Gate: All 5 primitives pass parity tests across AVX-512/NEON/scalar. lance-graph simd-savant agent verifies compliance PRE-MERGE. Consumer-side migration PRs (#398-#402) unblocked.

Wave 2 — Codegen Macros (2 days)

Eliminates copy-paste before adding new code. Every subsequent wave benefits. Wave 1 primitives become the first consumers of these macros (retrofit optional).

ID	Item	Source	Effort	What It Produces
W2.1	Dtype-parity macro (`reductions_for!`)	R3.1	1d	One line = 7 reductions for a dtype. Cuts 700→150 lines.
W2.2	Per-arch impl macro (`simd_impl!`)	R3.2	4h	Generates the struct + methods for a type in both simd_avx512.rs and simd_neon.rs from one body.
W2.3	Reduction kernel template (`reduce_simd()`)	R3.3	4h	Generic chunk-loop; sum/max/nrm2 become 5-line callers.
W2.4	Dual-form fusion (`kernel_simd_dual!`)	R3.4	1d	One body → `_into`, Vec, `_ptr`, all arch variants.

Gate: cargo test -p ndarray passes. Macro output matches existing hand-rolled functions. Run benches to verify no regression.

Wave 3 — Type Bridges (2 days)

From REFACTOR_HPC_INTEGRATION.md Tier 1. Makes domain types ndarray-native.

ID	Item	Source	Effort	Impact
W3.1	Fingerprint ↔ ArrayView1	Tier 1.1	4h	Zero-copy bridge, unblocks Zip/broadcast on fingerprints
W3.2	VsaVector ↔ ArrayView1	Tier 1.2	2h	Same pattern
W3.3	BF16/F16 → BlasFloat	Tier 1.3	4h	Enables `Array<BF16>.dot()`, unlocks BLAS dispatch
W3.4	CogRecord channel_as_words()	Tier 1.4	2h	ArrayView1 per channel
W3.5	Arrow bridge → ArrayView factories	Tier 4.1	2h	`binary_column_view()` returns ArrayView2

Gate: New From/Into impls compile. Existing code unchanged. Add tests for each bridge.

Wave 4 — Extension Traits (3 days)

From REFACTOR_HPC_INTEGRATION.md Tier 2. hpc operations feel native on ndarray types.

ID	Item	Source	Effort	Impact
W4.1	HdcOps trait (K0/K1/K2 on Array1)	Tier 2.1	4h	`query.cascade_distance(&candidate, &gate)`
W4.2	Quantize trait (Array → BF16/I8 with Array output)	Tier 2.2	4h	`array.to_bf16()`, `array.to_i8_symmetric()`
W4.3	Prefilter on ArrayView2	Tier 2.3	4h	Eliminates `(slice, rows, cols)` pattern
W4.4	ActivationsNd (softmax_axis, log_softmax_axis)	Tier 2.4	4h	`batch.softmax_axis(Axis(1))` for inference
W4.5	SimdMath trait (VML on any Array)	Tier 2.5	4h	`array.simd_exp()` — 10-50x for transcendentals
W4.6	Int8Matmul trait (Array2 × Array2 → Array2)	Tier 3.3	4h	Quantized inference via ndarray types

Gate: Each trait has tests matching the raw-slice function's test suite. Benchmark: trait overhead vs. direct raw-slice call ≤ 1%.

Wave 5 — Backend Wiring (2 days)

From REFACTOR_HPC_INTEGRATION.md Tier 3. Core ndarray operations silently accelerate.

ID	Item	Source	Effort	Impact
W5.1	Delete duplicate detection code (simd_caps + simd_dispatch → dead code under cfg pin)	Tier 3.2	4h	Deletes 877 lines that are unreachable when target-cpu is pinned
W5.2	Core sum/mean → SIMD dispatch	Tier 3.1	4h	16x faster `.sum()` on contiguous f32/f64
W5.3	SIMD axis reductions (sum_axis with SIMD lanes)	Tier 6.1	1d	ML training hot path

Note on W5.1: With target-cpu=sapphirerapids in .cargo/config.toml, the cfg(target_feature = "avx512f") branch in simd.rs is the only live path. hpc/simd_caps.rs (515 lines) and hpc/simd_dispatch.rs (362 lines) exist for a multi-binary world that doesn't apply when the target is pinned. These can be deleted or feature-gated behind cfg(not(target_feature = "avx512f")) for CI fallback builds.

Gate: cargo bench shows measurable improvement on contiguous arrays. Non-contiguous arrays unchanged (fallback to generic fold).

Wave 6 — SoA Kernel Architecture (1 week)

From SOA_KERNEL_ARCHITECTURE.md. The genius-level structural change. Wave 1's U64x8::popcnt() and U64x8::xor_popcount() are direct prerequisites for the columnar XOR+popcount scan in W6.2.

ID	Item	Source	Effort	Impact
W6.1	TieredDatabase struct (K0/K1/K2 as F-order Array2)	SoA §5	1d	Physical separation matching cache hierarchy
W6.2	k0_columnar_simd (scan N u64s with broadcast+XOR+popcount)	SoA §1	1d	Uses `U64x8::xor_popcount` from W1.5. 8 candidates per cycle.
W6.3	Bitmask survivor propagation (mask narrowing, no branching)	SoA §2	4h	Replace per-candidate `continue` with mask ops
W6.4	k1_accumulate_columnar (8-column scan with mask)	SoA §1	4h	K1 on contiguous columns
W6.5	k2_exact_masked (row extraction for survivors only)	SoA §1	4h	Stack-copy 2KB per survivor
W6.6	BF16FieldDatabase (sign/exp/mantissa separated at ingest)	SoA §6	1d	Awareness without runtime decomposition
W6.7	QualiaColumns (16 × Array1)	SoA §7	4h	18x dimension scan throughput. Uses `I8x16::from_i4_packed_u64` from W1.1.
W6.8	Arrow native path (RecordBatch columns → columnar scan)	SoA §4	4h	Zero-copy from Lance/Arrow
W6.9	Benchmark: AoS vs SoA on 1M SKU-16K containers	SoA §perf	4h	Proof of 4-8x claim

Gate: SoA cascade produces identical results to AoS cascade on full test suite. Benchmark confirms ≥3x throughput improvement on 1M containers.

Wave 7 — Namespace Restructure (3 days)

From transformer feedback R2.1. Enforces the architecture rule from CLAUDE.md.

ID	Item	Source	Effort	Impact
W7.1	Split hpc/ → hpc/ + cog/ + ext/ + io/	R2.1	2-3d	30 numeric modules in hpc/, 35 cognitive in cog/, 20 experimental in ext/, 6 I/O in io/
W7.2	SIMD directory consolidation (simd_*.rs → src/simd/)	R2.2	1d	One directory for all SIMD code. W1 primitives land in new locations.
W7.3	Quantized module split (quantized.rs → hpc/quant/)	R2.3	4h	BlockQ4_0 packed struct for candle compat

Gate: pub use deprecation shims for all moved modules. All existing use ndarray::hpc::* paths still resolve (with deprecation warning).

Wave 8 — Test & Bench Infrastructure (2 days)

ID	Item	Source	Effort	Impact
W8.1	Extract integration tests to `tests/hpc/`, `tests/cog/`	R4.2	2d	Module files lose 30-50% of line count
W8.2	HPC bench harness (dot, reductions, softmax, quantized)	R4.5	1d	Reproducible perf claims
W8.3	Cross-comparison bench (ndarray vs candle vs tract kernel)	R4.5 ext	4h	The whole reason for B1
W8.4	W1a primitive parity bench (AVX-512 vs NEON vs scalar speedup)	lance-graph	4h	Acceptance criteria §4

Wave 9 — Version Bump & Downstream Pin (1 day)

ID	Item	Source	Effort	Impact
W9.1	Tag v0.18.0	R4.4	2h	Clean cut
W9.2	CHANGELOG with migration guide	R4.4	2h	Downstream knows what changed
W9.3	Downstream Cargo.toml pins (burn, candle, tract, ort, lance-graph)	R4.4	2h	tag = "v0.18.0"
W9.4	Remove deprecation shims (schedule for v0.19)	R2.1	—	Future cleanup

Dependency Graph

Wave 0 (conventions)
  │
  ├──→ Wave 1 (SIMD primitives) ←── P0 for lance-graph
  │       │
  │       ├──→ Wave 6.2 (k0_columnar uses U64x8::xor_popcount from W1.5)
  │       ├──→ Wave 6.7 (QualiaColumns uses I8x16::from_i4_packed from W1.1)
  │       └──→ Wave 8.4 (primitive parity benches)
  │
  ├���─→ Wave 2 (codegen macros)
  │       │
  │       └──→ Wave 5 (backend wiring uses macros)
  │
  ├──→ Wave 3 (type bridges)
  │       │
  │       ├──→ Wave 4 (extension traits need bridges)
  │       ├���─→ Wave 5 (backend dispatch needs BlasFloat for BF16)
  │       └──→ Wave 6 (SoA needs Fingerprint↔Array, Arrow views)
  │
  ├──→ Wave 7 (namespace) — independent of 1-6
  │
  └──→ Wave 8 (tests/benches) — after 1+5+6
          │
          └──→ Wave 9 (release)

Critical Path

For lance-graph (P0 consumer): 0 → 1 = 6 days. Consumer migration PRs unblocked.

For full release: 0 → 1 → 3 → 5 → 6 �� 8 → 9 = 18 days serial. With parallelism (1∥2, 3∥4, 6∥7, 8∥backlog): ~12 working days.

How Wave 1 Feeds Everything Downstream

The 5 SIMD primitives aren't isolated additions — they're load-bearing for later waves:

Primitive	Immediate consumer (lance-graph)	Later consumer (ndarray)
`I8x16::from_i4_packed_u64`	`mul.rs::i4_eval::batch` (5 batch fns)	W6.7 QualiaColumns — unpack i4 qualia from packed storage without scalar loop
`I8x16::saturating_abs`	Direction-B fix, ValleyOfDespair classifier	W4.2 Quantize trait — safe abs in quantization error metrics
`U16x8::gather_u16`	`bgz17/simd.rs` palette lookup	W6.6 BF16FieldDatabase �� gather exponent fields from lookup table
`prefetch_read_t0/t1/t2`	`bgz17/prefetch.rs` tile prefetch	W6.2 k0_columnar_simd — prefetch next column chunk during K0 scan
`U64x8::popcnt` + `xor_popcount`	`holograph/hamming.rs` + `blasgraph/types.rs`	W6.2-W6.5 entire SoA cascade — columnar XOR+popcount is THE operation

The SoA cascade (Wave 6) is fundamentally built on U64x8::xor_popcount(). Without Wave 1, the SoA implementation would need to drop back to hpc::bitwise::hamming_distance_raw() on slices, losing the typed-wrapper discipline and duplicating dispatch logic.

The SoA Integration with All Waves

Wave 6 needs	Provided by
`U64x8::xor_popcount()` for columnar scan	Wave 1.5 (SIMD primitives)
`I8x16::from_i4_packed_u64()` for qualia	Wave 1.1 (SIMD primitives)
`prefetch_read_t0()` for column prefetch	Wave 1.4 (SIMD primitives)
F-order Array2 database	Wave 3 (type bridges — Fingerprint converts to Array)
Impl macro for k0_columnar_simd	Wave 2 (codegen macros)
`_into` form for columnar kernels	Wave 0 (signature convention)
Extension trait: `database.cascade_soa(&query, &gate)`	Wave 4 (HdcOps trait extended)
BF16FieldDatabase uses Quantize trait	Wave 4 (Quantize extension)
Arrow columns → direct scan	Wave 3 (Arrow view factories)
Benchmark harness to prove 4-8x	Wave 8 (bench infrastructure)

The Meta-Level SoA Observation

The SoA insight goes beyond just "transpose the database." It's a design principle that applies recursively across the entire surface:

Level 1: Database layout (SOA_KERNEL_ARCHITECTURE.md)

Fingerprint database: column-per-word → cascade becomes column scan
BF16 fields: separated sign/exp/mantissa → awareness is native layout
Qualia: column-per-dimension → per-dim queries are sequential scans

Level 2: API surface

The dual-form convention (W0.1) IS SoA thinking applied to function signatures:
- AoS API: fn foo(input) -> output (allocates inside, hides layout)
- SoA API: fn foo_into(input, output) (caller controls memory layout)

Level 3: SIMD primitives (lance-graph W1a)

The typed-wrapper-as-method convention IS SoA for the SIMD surface:
- AoS: raw intrinsics scattered across consumer crates (158 violations)
- SoA: struct methods on typed wrappers, consumers import ONE namespace
The closure-batch pattern (W1.1 batch_packed_i4_16) IS SoA for computation:
- AoS: each consumer writes its own unpack+process+pack loop
- SoA: one batch primitive accepts a closure, owns chunking/tailing/dispatch

Level 4: Module structure (W7.1 namespace split)

AoS: one module (hpc/) contains everything → grep to find anything
SoA: separated by concern (hpc/, cog/, ext/, io/) → contiguous columns

Level 5: ndarray's type system

Array2<u64> in F-order IS the SoA database
.column(i) IS the field-access pattern
.row(i) IS the record-access pattern
Strides encode the duality without duplication

Level 6: Arrow + Lance alignment

Arrow IS columnar storage (SoA by definition)
Lance stores Arrow columns on disk
The SoA kernel reads Arrow columns directly — no ETL, no transpose, no copy
Storage format = memory layout = compute pattern = cache access order

Level 7: The consumer contract itself

lance-graph declares what it needs (W1a primitives), ndarray provides
The contract is columnar: one primitive per concern, not a monolithic "SIMD library"
Each primitive is independently testable, deployable, and benchmarkable

The architecture collapses all intermediate transformations at every level.

Effort Summary

Wave	Days	Parallel?	Key Outcome
0	3	Yes	Conventions locked
1	3	Yes (after 0)	lance-graph unblocked (P0)
2	2	∥ with 1	Codegen macros ready
3	2	After 0	Domain types ↔ Array
4	3	After 3	Extension traits live
5	2	After 2+3	Core ops accelerated
6	5	After 1+3+5	SoA cascade (the big win)
7	3	∥ with 4-6	Namespace clean
8	2	After 1+6	Proof via benchmarks
9	1	After all	Release

lance-graph unblocked: Wave 0+1 = 6 days. Full release critical path: 18 days serial, ~12 days with parallelism.

What NOT to Do

Don't delete hpc/ modules — raw-slice functions stay (FFI, embedded, zero-overhead)
Don't premature-v1.0 — surface is too young; candle/tract wiring will shake it again
Don't gate SoA behind a feature flag — it's the default hot path, not optional
Don't couple SoA with module restructure — they're independent; merge separately
Don't break downstream in one shot — deprecation shims for one release minimum
Don't ship W1a primitives without parity tests — the codex P2 i8::MIN divergence happened because no test existed
Don't implement W1.5+ (deferred primitives) until sigker certification — they're gated

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unified Refactor Sequence: ndarray Fork

Known constraints (from prior sessions)

Wave 0 — Conventions & Foundations (3 days)

Wave 1 — SIMD Consumer Primitives (3 days) ← lance-graph P0

Per-primitive implementation contract

W1.1 — I8x16::from_i4_packed_u64 (nibble unpack + sign extend)

W1.2 — I8x16::saturating_abs (VPABSB correction)

W1.3 — U16x8::gather_u16 (palette lookup)

W1.4 — Prefetch hints (cross-arch)

W1.5 — U64x8::popcnt (lane-wise popcount)

W1.5+ — Deferred primitives (gated on sigker certification)

Wave 2 — Codegen Macros (2 days)

Wave 3 — Type Bridges (2 days)

Wave 4 — Extension Traits (3 days)

Wave 5 — Backend Wiring (2 days)

Wave 6 — SoA Kernel Architecture (1 week)

Wave 7 — Namespace Restructure (3 days)

Wave 8 — Test & Bench Infrastructure (2 days)

Wave 9 — Version Bump & Downstream Pin (1 day)

Dependency Graph

Critical Path

How Wave 1 Feeds Everything Downstream

The SoA Integration with All Waves

The Meta-Level SoA Observation

Level 1: Database layout (SOA_KERNEL_ARCHITECTURE.md)

Level 2: API surface

Level 3: SIMD primitives (lance-graph W1a)

Level 4: Module structure (W7.1 namespace split)

Level 5: ndarray's type system

Level 6: Arrow + Lance alignment

Level 7: The consumer contract itself

Effort Summary

What NOT to Do

FilesExpand file tree

UNIFIED_REFACTOR_SEQUENCE.md

Latest commit

History

UNIFIED_REFACTOR_SEQUENCE.md

File metadata and controls

Unified Refactor Sequence: ndarray Fork

Known constraints (from prior sessions)

Wave 0 — Conventions & Foundations (3 days)

Wave 1 — SIMD Consumer Primitives (3 days) ← lance-graph P0

Per-primitive implementation contract

W1.1 — I8x16::from_i4_packed_u64 (nibble unpack + sign extend)

W1.2 — I8x16::saturating_abs (VPABSB correction)

W1.3 — U16x8::gather_u16 (palette lookup)

W1.4 — Prefetch hints (cross-arch)

W1.5 — U64x8::popcnt (lane-wise popcount)

W1.5+ — Deferred primitives (gated on sigker certification)

Wave 2 — Codegen Macros (2 days)

Wave 3 — Type Bridges (2 days)

Wave 4 — Extension Traits (3 days)

Wave 5 — Backend Wiring (2 days)

Wave 6 — SoA Kernel Architecture (1 week)

Wave 7 — Namespace Restructure (3 days)

Wave 8 — Test & Bench Infrastructure (2 days)

Wave 9 — Version Bump & Downstream Pin (1 day)

Dependency Graph

Critical Path

How Wave 1 Feeds Everything Downstream

The SoA Integration with All Waves

The Meta-Level SoA Observation

Level 1: Database layout (SOA_KERNEL_ARCHITECTURE.md)

Level 2: API surface

Level 3: SIMD primitives (lance-graph W1a)

Level 4: Module structure (W7.1 namespace split)

Level 5: ndarray's type system

Level 6: Arrow + Lance alignment

Level 7: The consumer contract itself

Effort Summary

What NOT to Do