|
| 1 | +# Cognitive Shader Foundation — ndarray's Role in the 7-Layer Stack |
| 2 | + |
| 3 | +> READ BY: all ndarray agents (arm-neon-specialist, cascade-architect, |
| 4 | +> cognitive-architect, l3-strategist, migration-tracker, product-engineer, |
| 5 | +> savant-architect, sentinel-qa, truth-architect, vector-synthesis) |
| 6 | +> |
| 7 | +> Parallel doc in lance-graph: `.claude/knowledge/cognitive-shader-architecture.md` |
| 8 | +
|
| 9 | +## ndarray's Role: Layer 0 + parts of Layer 1 |
| 10 | + |
| 11 | +ndarray is the HARDWARE FOUNDATION of the cognitive shader stack. |
| 12 | +It provides the primitives that Layer 1 (BindSpace columns) and higher |
| 13 | +layers build on. ndarray never depends on lance-graph or any cognitive |
| 14 | +crate — the dependency is one-way. |
| 15 | + |
| 16 | +``` |
| 17 | +Layer 6: LanceDB (cold persistence) ← lance-graph |
| 18 | +Layer 5: GPU/APU meta ops (optional) ← future |
| 19 | +Layer 4: Planner strategies (16-19) ← lance-graph-planner |
| 20 | +Layer 3: CollapseGate write protocol ← ndarray (enum) + contract |
| 21 | +Layer 2: CognitiveShader dispatch ← p64-bridge |
| 22 | +Layer 1: BindSpace columns + multi-lane ← ndarray + contract |
| 23 | +Layer 0: SIMD primitives ← ndarray (THIS CRATE) |
| 24 | +``` |
| 25 | + |
| 26 | +## Public Surface: `ndarray::simd::*` |
| 27 | + |
| 28 | +All consumers import from `ndarray::simd::*`, NOT from `ndarray::hpc::*`. |
| 29 | +The hpc/ paths are private implementation detail. The simd/ module is the |
| 30 | +stable public API. |
| 31 | + |
| 32 | +Types that MUST be in `ndarray::simd::*`: |
| 33 | +- `F32x16, F64x8, U8x64, F16x32, U64x8, I16x32, I8x64` |
| 34 | +- `Fingerprint<N>` — const-generic, N×64 bits |
| 35 | +- `MultiLaneColumn<T>` — same bytes, multiple SIMD lane views |
| 36 | +- `array_window(data, N)` — aligned batch iterator |
| 37 | +- `VectorWidth, vector_config()` — the LazyLock width singleton |
| 38 | +- `CollapseGate` — Flow/Block/Hold enum (exists in hpc/bnn_cross_plane) |
| 39 | + |
| 40 | +If a type isn't in `ndarray::simd::*`, consumers can't use it. |
| 41 | +Keeps our API surface small. Internal refactors in hpc/ don't break |
| 42 | +downstream. |
| 43 | + |
| 44 | +## What Layer 0 Provides |
| 45 | + |
| 46 | +### SIMD Primitives (hardware abstraction) |
| 47 | +- Runtime dispatch: `simd_caps()` frozen singleton |
| 48 | +- AVX-512 (F32x16, VPOPCNTDQ, VPGATHERDD) |
| 49 | +- AVX2 + FMA |
| 50 | +- NEON (A53 / A72 / A76 dotprod tiers) |
| 51 | +- AMX via `asm!(".byte ...")` — TDPBF16PS, TDPBUSD |
| 52 | +- F16C hardware conversion |
| 53 | +- BF16 bit-exact RNE matching VCVTNEPS2BF16 |
| 54 | + |
| 55 | +### Fingerprint<N> — the BindSpace atom |
| 56 | +- `[u64; N]` backing, 64-byte aligned |
| 57 | +- `get/set/toggle_bit`, `bind` (XOR), `and`, `not` |
| 58 | +- `hamming_distance` via SIMD popcount |
| 59 | +- `popcount`, `density` |
| 60 | +- `random` (xorshift128+), `from_content` (hash expansion) |
| 61 | +- `permute` (circular bit shift for sequence encoding) |
| 62 | + |
| 63 | +### MultiLaneColumn — same object, multiple SIMD widths |
| 64 | +- One `Arc<[u8]>` backing store |
| 65 | +- View as U8x64 / F16x32 / F32x16 / F64x8 without copy |
| 66 | +- Consumer picks lane width per operation |
| 67 | + |
| 68 | +### array_window — SIMD batch iterator |
| 69 | +- Yields N-aligned chunks from a slice |
| 70 | +- Zero-copy: window IS a `&[T]` view |
| 71 | +- One cascade level = one array_window pattern |
| 72 | + |
| 73 | +### CollapseGate enum |
| 74 | +- `Flow` / `Block` / `Hold` (already in `hpc/bnn_cross_plane`) |
| 75 | +- Consumers (L3) extend with MergeMode (Xor/Bundle/Superposition) |
| 76 | + |
| 77 | +## What ndarray DOES NOT Provide |
| 78 | + |
| 79 | +These live UP the stack, not in ndarray: |
| 80 | +- BindSpace address types (lance-graph-contract) |
| 81 | +- CognitiveShader dispatch (p64-bridge) |
| 82 | +- Planner strategies (lance-graph-planner) |
| 83 | +- CausalEdge64 (causal-edge) |
| 84 | +- NARS inference (causal-edge + contract) |
| 85 | +- GGUF parsing (bgz-tensor / consumer) |
| 86 | + |
| 87 | +Keep ndarray free of cognitive logic. It's the foundation, not the cortex. |
| 88 | + |
| 89 | +## Current Gaps (next session targets) |
| 90 | + |
| 91 | +1. **MultiLaneColumn type doesn't exist yet** — add to `src/hpc/column.rs`, |
| 92 | + re-export from `src/simd.rs` |
| 93 | +2. **Fingerprint<N> missing `as_u8x64()`** — add SIMD view methods |
| 94 | +3. **simd.rs re-exports incomplete** — add Fingerprint, MultiLaneColumn, |
| 95 | + array_window, VectorWidth |
| 96 | +4. **VectorWidth LazyLock not consumed** — any module that serializes |
| 97 | + fingerprints should read it for width config |
| 98 | +5. **Hamming popcount hasn't been exposed via multi-lane view** — |
| 99 | + combine with MultiLaneColumn for the Layer 1 cascade path |
| 100 | + |
| 101 | +## Migration Tracking (from ladybug-rs) |
| 102 | + |
| 103 | +ladybug-rs depended on `rustynum` as its HPC crate. rustynum was |
| 104 | +ported INTO this ndarray fork as `src/hpc/` (55 modules, 880 tests). |
| 105 | +Downstream consumers (lance-graph-cognitive, learning crate) still |
| 106 | +reference `rustynum_core::*` types. They need these substitutions: |
| 107 | + |
| 108 | +| ladybug-rs `rustynum_core::*` | ndarray equivalent | |
| 109 | +|---|---| |
| 110 | +| `Fingerprint` | `ndarray::simd::Fingerprint<256>` | |
| 111 | +| `hamming_distance` | `ndarray::hpc::bitwise::hamming_distance_raw` | |
| 112 | +| `simd_level` | `ndarray::hpc::simd_caps::simd_caps()` | |
| 113 | +| `cascade::*` | `ndarray::hpc::cascade::*` | |
| 114 | +| `bf16_*` | `ndarray::hpc::quantized::BF16` | |
| 115 | +| `rustynum_bnn::CollapseGate` | `ndarray::hpc::bnn_cross_plane::CollapseGate` | |
| 116 | +| `rustynum_holo::*` | `ndarray::hpc::holo::*` | |
| 117 | + |
| 118 | +**migration-tracker agent** owns this substitution table. |
| 119 | + |
| 120 | +## The Endgame (ndarray's view) |
| 121 | + |
| 122 | +Each token of LLM inference in the cognitive shader system runs: |
| 123 | + |
| 124 | +``` |
| 125 | +1. Read BindSpace column slice → &[u64; N] (Layer 1) |
| 126 | +2. Hamming popcount via SIMD dispatch → [u32; N] (Layer 0) |
| 127 | +3. Base17 L1 distance on survivors → [u16; M] (Layer 0) |
| 128 | +4. Palette table lookup (256×256) → [u8; K] (Layer 0) |
| 129 | +5. Gather via VPGATHERDD → [u8; K] (Layer 0) |
| 130 | +``` |
| 131 | + |
| 132 | +All Layer 0. All ndarray. Zero FP. Zero matmul. 611M lookups/sec. |
| 133 | + |
| 134 | +The cognitive layers above coordinate WHICH columns to scan and HOW |
| 135 | +to combine results. ndarray just executes the primitives as fast as |
| 136 | +the hardware allows. Pi Zero to Sapphire Rapids, same API, same |
| 137 | +correctness, different throughput. |
0 commit comments