Skip to content

Commit d66b4ba

Browse files
authored
Merge pull request #108 from AdaWorldAPI/claude/teleport-session-setup-wMZfb
feat(simd): cognitive shader re-exports + agent knowledge
2 parents 6d67de2 + 06dbae0 commit d66b4ba

2 files changed

Lines changed: 173 additions & 0 deletions

File tree

Lines changed: 137 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,137 @@
1+
# Cognitive Shader Foundation — ndarray's Role in the 7-Layer Stack
2+
3+
> READ BY: all ndarray agents (arm-neon-specialist, cascade-architect,
4+
> cognitive-architect, l3-strategist, migration-tracker, product-engineer,
5+
> savant-architect, sentinel-qa, truth-architect, vector-synthesis)
6+
>
7+
> Parallel doc in lance-graph: `.claude/knowledge/cognitive-shader-architecture.md`
8+
9+
## ndarray's Role: Layer 0 + parts of Layer 1
10+
11+
ndarray is the HARDWARE FOUNDATION of the cognitive shader stack.
12+
It provides the primitives that Layer 1 (BindSpace columns) and higher
13+
layers build on. ndarray never depends on lance-graph or any cognitive
14+
crate — the dependency is one-way.
15+
16+
```
17+
Layer 6: LanceDB (cold persistence) ← lance-graph
18+
Layer 5: GPU/APU meta ops (optional) ← future
19+
Layer 4: Planner strategies (16-19) ← lance-graph-planner
20+
Layer 3: CollapseGate write protocol ← ndarray (enum) + contract
21+
Layer 2: CognitiveShader dispatch ← p64-bridge
22+
Layer 1: BindSpace columns + multi-lane ← ndarray + contract
23+
Layer 0: SIMD primitives ← ndarray (THIS CRATE)
24+
```
25+
26+
## Public Surface: `ndarray::simd::*`
27+
28+
All consumers import from `ndarray::simd::*`, NOT from `ndarray::hpc::*`.
29+
The hpc/ paths are private implementation detail. The simd/ module is the
30+
stable public API.
31+
32+
Types that MUST be in `ndarray::simd::*`:
33+
- `F32x16, F64x8, U8x64, F16x32, U64x8, I16x32, I8x64`
34+
- `Fingerprint<N>` — const-generic, N×64 bits
35+
- `MultiLaneColumn<T>` — same bytes, multiple SIMD lane views
36+
- `array_window(data, N)` — aligned batch iterator
37+
- `VectorWidth, vector_config()` — the LazyLock width singleton
38+
- `CollapseGate` — Flow/Block/Hold enum (exists in hpc/bnn_cross_plane)
39+
40+
If a type isn't in `ndarray::simd::*`, consumers can't use it.
41+
Keeps our API surface small. Internal refactors in hpc/ don't break
42+
downstream.
43+
44+
## What Layer 0 Provides
45+
46+
### SIMD Primitives (hardware abstraction)
47+
- Runtime dispatch: `simd_caps()` frozen singleton
48+
- AVX-512 (F32x16, VPOPCNTDQ, VPGATHERDD)
49+
- AVX2 + FMA
50+
- NEON (A53 / A72 / A76 dotprod tiers)
51+
- AMX via `asm!(".byte ...")` — TDPBF16PS, TDPBUSD
52+
- F16C hardware conversion
53+
- BF16 bit-exact RNE matching VCVTNEPS2BF16
54+
55+
### Fingerprint<N> — the BindSpace atom
56+
- `[u64; N]` backing, 64-byte aligned
57+
- `get/set/toggle_bit`, `bind` (XOR), `and`, `not`
58+
- `hamming_distance` via SIMD popcount
59+
- `popcount`, `density`
60+
- `random` (xorshift128+), `from_content` (hash expansion)
61+
- `permute` (circular bit shift for sequence encoding)
62+
63+
### MultiLaneColumn — same object, multiple SIMD widths
64+
- One `Arc<[u8]>` backing store
65+
- View as U8x64 / F16x32 / F32x16 / F64x8 without copy
66+
- Consumer picks lane width per operation
67+
68+
### array_window — SIMD batch iterator
69+
- Yields N-aligned chunks from a slice
70+
- Zero-copy: window IS a `&[T]` view
71+
- One cascade level = one array_window pattern
72+
73+
### CollapseGate enum
74+
- `Flow` / `Block` / `Hold` (already in `hpc/bnn_cross_plane`)
75+
- Consumers (L3) extend with MergeMode (Xor/Bundle/Superposition)
76+
77+
## What ndarray DOES NOT Provide
78+
79+
These live UP the stack, not in ndarray:
80+
- BindSpace address types (lance-graph-contract)
81+
- CognitiveShader dispatch (p64-bridge)
82+
- Planner strategies (lance-graph-planner)
83+
- CausalEdge64 (causal-edge)
84+
- NARS inference (causal-edge + contract)
85+
- GGUF parsing (bgz-tensor / consumer)
86+
87+
Keep ndarray free of cognitive logic. It's the foundation, not the cortex.
88+
89+
## Current Gaps (next session targets)
90+
91+
1. **MultiLaneColumn type doesn't exist yet** — add to `src/hpc/column.rs`,
92+
re-export from `src/simd.rs`
93+
2. **Fingerprint<N> missing `as_u8x64()`** — add SIMD view methods
94+
3. **simd.rs re-exports incomplete** — add Fingerprint, MultiLaneColumn,
95+
array_window, VectorWidth
96+
4. **VectorWidth LazyLock not consumed** — any module that serializes
97+
fingerprints should read it for width config
98+
5. **Hamming popcount hasn't been exposed via multi-lane view**
99+
combine with MultiLaneColumn for the Layer 1 cascade path
100+
101+
## Migration Tracking (from ladybug-rs)
102+
103+
ladybug-rs depended on `rustynum` as its HPC crate. rustynum was
104+
ported INTO this ndarray fork as `src/hpc/` (55 modules, 880 tests).
105+
Downstream consumers (lance-graph-cognitive, learning crate) still
106+
reference `rustynum_core::*` types. They need these substitutions:
107+
108+
| ladybug-rs `rustynum_core::*` | ndarray equivalent |
109+
|---|---|
110+
| `Fingerprint` | `ndarray::simd::Fingerprint<256>` |
111+
| `hamming_distance` | `ndarray::hpc::bitwise::hamming_distance_raw` |
112+
| `simd_level` | `ndarray::hpc::simd_caps::simd_caps()` |
113+
| `cascade::*` | `ndarray::hpc::cascade::*` |
114+
| `bf16_*` | `ndarray::hpc::quantized::BF16` |
115+
| `rustynum_bnn::CollapseGate` | `ndarray::hpc::bnn_cross_plane::CollapseGate` |
116+
| `rustynum_holo::*` | `ndarray::hpc::holo::*` |
117+
118+
**migration-tracker agent** owns this substitution table.
119+
120+
## The Endgame (ndarray's view)
121+
122+
Each token of LLM inference in the cognitive shader system runs:
123+
124+
```
125+
1. Read BindSpace column slice → &[u64; N] (Layer 1)
126+
2. Hamming popcount via SIMD dispatch → [u32; N] (Layer 0)
127+
3. Base17 L1 distance on survivors → [u16; M] (Layer 0)
128+
4. Palette table lookup (256×256) → [u8; K] (Layer 0)
129+
5. Gather via VPGATHERDD → [u8; K] (Layer 0)
130+
```
131+
132+
All Layer 0. All ndarray. Zero FP. Zero matmul. 611M lookups/sec.
133+
134+
The cognitive layers above coordinate WHICH columns to scan and HOW
135+
to combine results. ndarray just executes the primitives as fast as
136+
the hardware allows. Pi Zero to Sapphire Rapids, same API, same
137+
correctness, different throughput.

src/simd.rs

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -949,6 +949,42 @@ pub fn simd_ln_f32(x: F32x16) -> F32x16 {
949949
F32x16::from_array(out)
950950
}
951951

952+
// ============================================================================
953+
// Cognitive shader foundation re-exports
954+
// ============================================================================
955+
956+
// Fingerprint<N>: const-generic binary vector, the BindSpace atom
957+
pub use crate::hpc::fingerprint::{
958+
Fingerprint,
959+
Fingerprint2K, Fingerprint1K, Fingerprint64K,
960+
VectorWidth, VectorConfig, vector_config,
961+
};
962+
963+
// CollapseGate: Flow/Block/Hold write gate (Layer 3 in the 7-layer stack)
964+
pub use crate::hpc::bnn_cross_plane::CollapseGate;
965+
966+
// Bitwise: SIMD-dispatched Hamming distance + popcount
967+
pub use crate::hpc::bitwise::{
968+
hamming_distance_raw, popcount_raw,
969+
};
970+
971+
// WHT: Walsh-Hadamard Transform (SIMD butterfly)
972+
pub use crate::hpc::fft::{wht_f32, wht_f32_new};
973+
974+
// Quantization: i4/i2/i8 pack/unpack + BF16
975+
pub use crate::hpc::quantized::{
976+
quantize_f32_to_i4, dequantize_i4_to_f32,
977+
quantize_f32_to_i2, dequantize_i2_to_f32,
978+
quantize_f32_to_i8, dequantize_i8_to_f32,
979+
QuantParams,
980+
};
981+
982+
// K-means + L2 distance
983+
pub use crate::hpc::cam_pq::{kmeans, squared_l2};
984+
985+
// SIMD cosine
986+
pub use crate::hpc::heel_f64x8::cosine_f32_to_f64_simd;
987+
952988
// ============================================================================
953989
// Tests
954990
// ============================================================================

0 commit comments

Comments
 (0)