Skip to content

Commit a4dc792

Browse files
committed
docs(arithmetic): consolidated math inventory across splat3d / splat4d / cognitive
Review of the three uploaded sprint prompts (splat3d_sprint_prompt, splat4d_cascade_sprint, splat4d_skeleton_anchored_sprint) in context of the cognitive-shader work drafted in PR-X4 / PR-X9 / PR-Z1. Tags every arithmetic primitive shipped / drafted / gap across 9 layers (L0 SPD substrate → L8 cognitive overlay), flags 3 precision classes (EXACT / FAST OK / VERIFY), and identifies 5 concrete gaps that gate the joint sprint: 1. Hilbert-3D encode/decode (mentioned in splat4d cascade but not specified anywhere — single shared dependency of medical AND cognitive paths) 2. INT4×32 packed dot product (PR-X7 thinking-style + qualia signature — needs VNNI/dotprod strategy decision) 3. NARS truth-revision kernel + precision class (replaces alpha-compose in W7 closure swap) 4. x265-style CTU mode encoder (skip/merge/delta/escape for PR-X9 lazy storage) 5. fast_exp_x16 precision audit for NARS context (3% rel err is OK for alpha but suspect for cognitive confidence cascade) Five new cross-cutting research items consolidated (atop the five from the three sprint docs): - Hilbert-3D algorithm choice (Butz vs Skilling vs precomputed table) - INT4×N hardware strategy (VNNI vs software unpack vs AMX widening) - NARS revise precision class decision (G5 (a/b/c) — lean toward (b), drop exp from cognitive path entirely) - CTU mode encoder λ-RDO calibration - Codebook size const-generic strategy Recommended ordering: Phase 0 (Hilbert-3D + INT4×N) unblocks BOTH the medical sprint (splat4d skeleton-anchored) AND the cognitive sprint (PR-X4 + PR-X9). Build the shared substrate first; both stacks accelerate together. Phase 1 medical+cognitive co-substrate (Pillar-8 + moment-match + mesh-fit). Phase 2 cognitive-only (basin XOR-popcount + CTU + NARS). Phase 3 W7 closure swap. Recommended 30-min math workshop before the joint plan-review savant to lock σ_temporal values, Hilbert-3D algorithm, and NARS precision class — removes 3 open questions per design doc and accelerates the sprint. Key strategic claim: Pillar-7 SPD-sandwich is the most-reused single math op in the entire stack. It's the projection (J·W·Σ·Wᵀ·Jᵀ), the temporal cascade (Σ_{t+1} = M·Σ_t·Mᵀ), the moment-match aggregate-up (via Δμ·Δμᵀ outer products), and the cognitive-spacetime evolution. Shipped in splat3d PR #153. Everything else is a semantic reinterpretation of M.
1 parent ea957d5 commit a4dc792

1 file changed

Lines changed: 313 additions & 0 deletions

File tree

Lines changed: 313 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,313 @@
1+
# Arithmetic Inventory — splat3d / splat4d / cognitive shader stack
2+
3+
> READ BY: all agents touching `crate::hpc::*` math kernels
4+
> (savant-architect, splat3d-architect, cascade-architect, cognitive-architect,
5+
> arm-neon-specialist, sentinel-qa, truth-architect, vector-synthesis).
6+
>
7+
> Status: review v1 — drafted 2026-05-18 in response to the three uploaded
8+
> sprint prompts (`splat3d_sprint_prompt.md`, `splat4d_cascade_sprint.md`,
9+
> `splat4d_skeleton_anchored_sprint.md`).
10+
>
11+
> Purpose: enumerate every arithmetic primitive required by the
12+
> splat3d → splat4d → cognitive-shader stack, tag each `shipped | drafted |
13+
> gap`, flag precision-class concerns, and identify the ordering blockers
14+
> for the joint sprint.
15+
>
16+
> Parallel docs:
17+
> - `.claude/knowledge/pr-x3-cognitive-grid-design.md` — BlockedGrid substrate (shipped, PR #158)
18+
> - `.claude/knowledge/pr-x4-design.md` — Gaussian splat cascade onto BlockedGrid
19+
> - `.claude/knowledge/pr-x9-design.md` — lazy basin-codebook storage
20+
> - `.claude/knowledge/pr-z1-ogit-cognitive-bootstrap.md` — OGIT Cognitive namespace bootstrap
21+
22+
## TL;DR
23+
24+
| Layer | Primitives | Status |
25+
|---|---|---|
26+
| **L0 SPD substrate** (Pillar-6/7) | Smith-1961 eig, Σ^t, sandwich, Spd3, sandwich_x16 | ✅ shipped (splat3d PR #153) |
27+
| **L1 Projection / EWA** | W·Σ·Wᵀ, J·Σ·Jᵀ, 2D conic inverse, 3σ radius | ✅ shipped |
28+
| **L2 SH eval** (deg-3) | 16-basis × 3-channel, Inria convention | ✅ shipped |
29+
| **L3 Tile bin + rasterize** | radix sort, Mahalanobis², fast_exp_x16, alpha compose | ✅ shipped |
30+
| **L4 Cascade addressing** | CascadeAddr bit-pack, parent/children, **Hilbert-3D** |**GAP** — Hilbert-3D not specified anywhere |
31+
| **L5 Gaussian-mixture moment-match** | Σ_parent = (1/n)·Σ(Σ_i + Δμ·Δμᵀ) | ⬜ drafted in splat4d cascade PR 1 |
32+
| **L6 Pillar-8 temporal sandwich** | Σ_{t+1} = M·Σ_t·Mᵀ with M = sqrt(σ_temporal) | ⬜ drafted in splat4d cascade |
33+
| **L7 Mesh→splat fitting** | PCA over vertex positions, fiber-direction Σ alignment | ⬜ drafted in skeleton-anchored PR 2a/2b |
34+
| **L8 Cognitive overlay** | INT4×N dot, NARS revision, basin XOR-popcount, CTU modes | ⬜⬜ **GAP** — not in any splat doc |
35+
36+
**Five concrete gaps gate the joint sprint:**
37+
1. Hilbert-3D curve encode/decode (~16 ops/coord, needed by L4 cascade addressing)
38+
2. INT4×32 packed dot product (needed by cognitive cell signature)
39+
3. NARS truth-revision kernel + precision class (replaces alpha-compositing in W7)
40+
4. x265-style CTU mode encoder (skip/merge/delta/escape — needed by PR-X9 lazy storage)
41+
5. fast_exp_x16 precision audit (3% relative error — OK for alpha, **suspect for NARS confidence**)
42+
43+
## Layer-by-layer detail
44+
45+
### L0 — SPD substrate (Pillar-6 2D / Pillar-7 3D)
46+
47+
**Shipped in splat3d PR #153**. The single most-reused arithmetic primitive in the stack — the temporal-sandwich (Pillar-8), the splat-cascade aggregate-up, and the EWA projection ALL reduce to `M·Σ·Mᵀ` on Spd3. No new SPD machinery needed downstream; only new semantic interpretations of M.
48+
49+
```
50+
Smith-1961 closed-form eigendecomp: O(1) per matrix, ~30 ops, scalar
51+
Σ^t = V · diag(λᵢᵗ) · Vᵀ: O(eig) + 3 pow, scalar inner
52+
sandwich(M, N): M·N·Mᵀ symmetric: 21 mul + 12 add for 3×3 sym
53+
sandwich_x16: AVX-512 batched, 10× over scalar
54+
from_scale_quat: Σ = R·diag(s²)·Rᵀ: 9 mul + 6 add (R) + 9 mul (sandwich)
55+
is_spd, frobenius², det, log_spd: constant-fold scalar
56+
```
57+
58+
**Precision class: EXACT** for all downstream compute. No approximations. `Spd3::eig` uses branchless acos clamp + Gram-Schmidt orthonormalization on near-degenerate covariances.
59+
60+
### L1 — Projection + EWA (3D world → 2D conic)
61+
62+
**Shipped in splat3d PR #153** (the math heat of PR 3).
63+
64+
```
65+
μ_cam = V · μ_world: 16 FMA/gaussian (3×4 mat-vec)
66+
Frustum cull (depth + AABB): F32x16 mask, branchless
67+
J = [[fx/z, 0, -fx·x/z²], [0, fy/z, -fy·y/z²]]: 6 div by z (vrcp14ps)
68+
Σ_image = J · W · Σ_world · Wᵀ · Jᵀ: ~50 FMA/gaussian — THE hottest single op
69+
2D conic = Σ_image⁻¹: 4 mul + 1 div + 3 mul
70+
3σ radius = 3·sqrt(λ_max(Σ_image)): closed-form 2×2 root + sqrt
71+
View dir = normalize(μ - cam_pos): 3-vec norm (vrsqrt14ps)
72+
```
73+
74+
**Precision class: FAST OK** for graphics (Inria parity SSIM ≥ 0.97 with vrcp14/vrsqrt14). **VERIFY** if reused for cognitive distance — the perspective division ε accumulates over the cascade.
75+
76+
### L2 — Spherical harmonics evaluation (deg 0–3)
77+
78+
**Shipped in splat3d PR #153**.
79+
80+
```
81+
SH_C0, SH_C1, SH_C2[5], SH_C3[7]: baked f32 constants
82+
sh_eval_deg3(sh[48], d): 17 mul-add per channel × 3 = 51 FMA per gaussian
83+
sh_eval_deg3_x16: AVX-512 batched, ~6× over scalar
84+
Inria convention: (v + 0.5).clamp(0, 1)
85+
```
86+
87+
**Cognitive reframe**: same math gives "appearance under different cognitive inquiries" — `vocab_idx × thinking_style` projection per PR-X4's `SplatCell`. **Drafted but not shipped** in cognitive form.
88+
89+
### L3 — Tile binner + rasterizer (alpha-compositing)
90+
91+
**Shipped in splat3d PR #153**. Three precision-class flags:
92+
93+
```
94+
Mahalanobis² power = -0.5·(ca·dx² + 2·cb·dx·dy + cc·dy²): 4 FMA/pixel/splat — EXACT
95+
fast_exp_x16 (Schraudolph 1999): 1 cast + 1 FMA — **3% rel err**
96+
alpha = min(0.99, op · fast_exp(power)): EXACT after exp
97+
T·alpha compose, T *= (1−α): 4 FMA/pixel/splat — EXACT
98+
Saturation early-exit T < 1e-4: single compare
99+
```
100+
101+
**Precision class: FAST OK for alpha-compositing. FLAG: fast_exp's 3% relative error is graphics-suitable but breaks NARS truth-revision convergence** — see L8 gap analysis.
102+
103+
Radix sort on packed u64 (tile_id << 32 | depth_bits) → 2M instances in ≤8 ms. **EXACT** by construction (integer key).
104+
105+
### L4 — Cascade addressing (splat4d cascade PR 1) — **PARTIAL GAP**
106+
107+
```
108+
CascadeAddr::level(l): (bits >> (l*4)) & 0xF: 1 shift + 1 and
109+
parent(): bits & !0xF000: 1 and
110+
children(): [parent | (i<<12) for i in 0..16]: 16 ors
111+
from_position(p, bbox, level): **Hilbert-3D encode** ⬜ GAP
112+
to_position_center(addr, bbox): **Hilbert-3D decode** ⬜ GAP
113+
```
114+
115+
**The Hilbert-3D curve at 4 bits per axis per level isn't specified in any of the three docs.** Splat4d cascade PR 1 says "Hilbert-3D order at L4 for cache locality" but doesn't sketch the math. We need:
116+
- `position → (l0, l1, l2, l3)` nibble path: ~16 conditional swaps per coordinate per level (Butz's algorithm)
117+
- Inverse decode: same shape
118+
- Possibly a precomputed 16-entry rotation table for Gray-code-order branching
119+
120+
**Precision class: EXACT** (integer-only encode/decode; no float ops). Estimated ~64 ops per address conversion.
121+
122+
### L5 — Gaussian-mixture moment-match (cascade aggregate-up) — **DRAFTED**
123+
124+
```
125+
Σ_parent = (1/n) · Σᵢ(Σᵢ + Δμᵢ · Δμᵢᵀ) where Δμᵢ = μᵢ − μ_parent
126+
μ_parent = (1/n) · Σᵢ μᵢ
127+
opacity_parent = mean or alpha-composite of children
128+
```
129+
130+
For N=16 children at each tier:
131+
```
132+
16 outer products Δμᵢ·Δμᵢᵀ: 16 × 6 mul = 96 mul/parent (3×3 sym)
133+
16 Spd3 additions: 16 × 6 = 96 add/parent
134+
1 scalar division by 16: 1 div (or shift)
135+
```
136+
137+
≈ 200 ops/parent at L3, ≈ 16 × 200 = 3200 ops/parent at L2, etc. Total cascade aggregate-up across all 65,536 leaves: ~5 M ops, well within frame budget.
138+
139+
**Drafted in splat4d cascade PR 1.** Same math is PR-X4's `compose_cascade` operator from our prior conversation. **EXACT** precision class — no approximations.
140+
141+
### L6 — Pillar-8 temporal sandwich (Σ_{t+1} = M·Σ_t·Mᵀ) — **DRAFTED**
142+
143+
**Reuses Spd3 + sandwich_x16. Only the σ_temporal table is new.** Three motion bands stratify:
144+
145+
| Band | Frequency | Amplitude | σ_temporal (Frobenius) |
146+
|---|---|---|---|
147+
| Cardiac | ~6 Hz | ~5 mm | needs literature value |
148+
| Respiratory | ~0.3 Hz | ~20 mm | needs literature value |
149+
| Micro-motion | ~120 Hz | ~0.1 mm | needs literature value |
150+
151+
**Cross-cutting research item #5 from splat4d cascade.** PASS gate is arbitrary until echocardiography literature pins these down.
152+
153+
**For the cognitive shader path**: σ_temporal represents NARS truth-confidence decay across frames, NOT physical motion. Same math, different calibration. Both interpretations share the substrate.
154+
155+
### L7 — Mesh → splat fitting (skeleton-anchored PR 2a/2b) — **DRAFTED**
156+
157+
```
158+
Per-bone-segment PCA:
159+
μ_seg = (1/n) · Σ vᵢ Welford accumulator, ~3n FMA
160+
Σ_seg = (1/n) · Σ (vᵢ − μ)(vᵢ − μ)ᵀ single-pass via Welford, ~6n FMA
161+
Quaternion from rotation matrix: ~30 ops (Shepperd's method, sign-tracking)
162+
Fiber-direction Σ alignment (muscles):
163+
axis = normalize(insertion − origo) 3-vec normalize
164+
Σ_major = axis · axisᵀ · σ_length² 3 mul + outer product
165+
Σ_minor from cross_section_mm² 2 scalar fills
166+
```
167+
168+
**Precision class: EXACT** for the offline mesh→splat conversion (cached in build.rs); doesn't run on the hot path.
169+
170+
### L8 — Cognitive overlay — **THE GAP CATEGORY**
171+
172+
The five primitives that the three splat docs DON'T cover but our PR-X4/X9/Z1 docs require:
173+
174+
#### G1 — INT4×N packed dot product
175+
176+
For thinking-style (32-dim INT4) and qualia (16-dim INT4) cell signatures.
177+
178+
```
179+
thinking_dot(a: [u8; 16], b: [u8; 16]) -> i32:
180+
unpack u4 pairs → i8 (table lookup or shift-and-mask)
181+
vpdpbusd-style i8×u8 → i32 accumulator (AVX-512 VNNI / NEON dotprod)
182+
```
183+
184+
**Hardware path:**
185+
- AVX-512 VNNI `vpdpbusd`: i8 × u8 → i32 accumulator, 64 ops per instruction → 2 instructions for 32-dim INT4
186+
- ARM NEON `sdot`: i8.4 dot u8.4 → i32, 4 ops per instruction → 8 instructions for 32-dim
187+
- AMX BF16 tile op handles INT8 not INT4 directly — would need software unpacking
188+
- Scalar fallback: 32-way unrolled
189+
190+
**Precision class: EXACT** (integer dot product). **GAP** — not in any sprint doc; needs to land before PR-X7 typed cell-DSL.
191+
192+
#### G2 — NARS truth-revision kernel
193+
194+
Replaces alpha-compositing in W7's PR-X4 closure swap.
195+
196+
```
197+
revise(T1, T2) = (
198+
freq: (f1·c1 + f2·c2) / (c1 + c2),
199+
conf: (c1 + c2) / (c1 + c2 + k), where k = 1 by NARS convention
200+
)
201+
```
202+
203+
4 FMA + 1 div per cell pair. **Precision class: NEEDS AUDIT**. The confidence numerator/denominator near c1+c2≈0 is the precision risk; `vrcp14ps` (14-bit mantissa, used by splat3d for perspective division) is likely insufficient — Newton-Raphson refinement (one step → 28 bits) probably required.
204+
205+
**GAP** — designed in PR-X4 §"W7 closure swap" but not implemented.
206+
207+
#### G3 — Basin XOR-popcount (OGIT-schema-driven)
208+
209+
Per-cell basin matching against the 4096-atom CAM codebook, gated by the OGIT family bitmap.
210+
211+
```
212+
For cell with edge u64:
213+
family = ogit_schema.family_of(cell.basin_hint) // O(1)
214+
for basin_idx in family_bitmap.iter_ones(): // ~16-64 candidates
215+
delta = cell.edge XOR codebook[basin_idx].edge // 1 XOR
216+
dist = delta.count_ones() // 1 popcnt
217+
track min dist + idx
218+
return best_basin_idx
219+
```
220+
221+
**Hardware**: `popcnt` is single-cycle on AVX-512 (`vpopcntq`) and on NEON (`cnt` + `addv`). **EXACT** by construction. **GAP** — drafted in PR-X9 §"Encoding modes" but not implemented; needs OGIT-rs hydrate path (PR-Z1 + PR-Z2).
222+
223+
#### G4 — x265-style CTU mode encoder (skip/merge/delta/escape)
224+
225+
The per-cell rate-distortion loop that picks encoding mode in PR-X9's lazy storage.
226+
227+
```
228+
For each cell (basin_idx, true_value):
229+
skip_cost = 0 if true_value == basin else INF
230+
merge_cost = 2 if delta(neighbor) decodes to true_value within ε else INF
231+
delta_cost = 8 + |true_value - basin - decode(quantized_delta)|·λ
232+
escape_cost = 64 (always available)
233+
234+
pick min(skip, merge, delta, escape)
235+
```
236+
237+
Per-cell: ~4 compares + 1 subtract + 1 quantize. Inner loop of PR-X9's `encode_from_dense`. **EXACT** integer arithmetic. **GAP** — drafted in PR-X9 §"Encoding modes" but not implemented.
238+
239+
#### G5 — fast_exp precision audit for NARS
240+
241+
The splat3d `fast_exp_x16` (Schraudolph 1999) has 3% relative error. Acceptable for alpha attenuation (visual ε); **probably unacceptable for NARS confidence-cascade convergence** because the error compounds multiplicatively across tier propagation.
242+
243+
Decision needed:
244+
- **(a)** Add `precise_exp_x16` path with 4th-order Padé polynomial (~7 FMA, accurate to 1e-7) and use it inside NARS revise closures
245+
- **(b)** Re-derive NARS revise as a closed-form rational that avoids exp entirely (truth-revision is fractional, not exponential — exp only enters if we use confidence-as-exponential-decay)
246+
- **(c)** A/B test 3% fast_exp against precise_exp on a synthetic NARS cascade and measure convergence drift
247+
248+
**Lean: (b)** — NARS truth-revision doesn't actually need exp; it's a weighted average + saturation. The exp came in via splat alpha-compositing. If W7 closure-swap replaces alpha with NARS, the exp call goes away with it.
249+
250+
## Precision-class summary
251+
252+
| Class | Definition | Primitives |
253+
|---|---|---|
254+
| **EXACT** | Bit-exact across reorderings | Spd3 ops, Mahalanobis², radix sort, CascadeAddr, NARS revise (after G5 (b)), basin XOR-popcount, CTU mode encoder, Hilbert-3D |
255+
| **FAST OK** | 1-5% relative error acceptable | fast_exp_x16 (alpha only), vrcp14ps (perspective div), vrsqrt14ps (view-dir norm), SH eval (deg-3 truncation) |
256+
| **VERIFY** | A/B audit before cognitive use | fast_exp_x16 in NARS context (G5), `from_scale_quat` near-degenerate cases, near-singular Σ_image conic inverse |
257+
258+
## Cross-cutting research items (consolidating from all 3 docs + this review)
259+
260+
From the three splat sprint docs:
261+
1. BodyParts3D coverage (skeleton-anchored)
262+
2. Muscle attachment table (skeleton-anchored)
263+
3. Clarius access (now optional via SyntheticBinding)
264+
4. FMA license (CC-BY-SA implications)
265+
5. Pillar-8 σ_temporal calibration (cardiac/respiratory/micro literature values)
266+
267+
From this arithmetic review (NEW):
268+
6. **Hilbert-3D encode/decode algorithm choice** (Butz's algorithm vs Skilling's algorithm vs precomputed rotation table)
269+
7. **INT4×N packed dot product strategy** (VNNI vs software unpack vs AMX with INT8 widening)
270+
8. **NARS revise precision class decision** (G5 (a) / (b) / (c) above)
271+
9. **CTU mode encoder λ-RDO calibration** (borrow x265 medium-preset λ table vs NARS-confidence-derived)
272+
10. **Codebook size const-generic strategy** (PR-X9 Q7: u8 vs u16 basin_idx)
273+
274+
## Recommended ordering
275+
276+
**Phase 0 (substrate, parallel-safe):**
277+
- L4 Hilbert-3D encode/decode — single-worker, ~3 days, ~200 LoC. Unblocks splat4d cascade PR 1 AND PR-X4 cascade addressing.
278+
- G1 INT4×32 packed dot product — single-worker, ~3 days, ~150 LoC. Unblocks PR-X7 typed cell-DSL.
279+
280+
**Phase 1 (medical + cognitive co-substrate, sequential):**
281+
- L6 Pillar-8 temporal sandwich — needs σ_temporal literature values first
282+
- L5 Gaussian-mixture moment-match — needs L6
283+
- L7 Mesh→splat fitting — needs L0 (shipped)
284+
285+
**Phase 2 (cognitive-only):**
286+
- G3 Basin XOR-popcount — needs PR-Z1 + PR-Z2 OR embedded-TTL escape hatch
287+
- G4 CTU mode encoder — needs G3
288+
- G2 NARS truth-revision — needs G5 decision
289+
290+
**Phase 3 (closure swap):**
291+
- W7: replace splat alpha-compose closure with NARS revise. Single-PR scope. Drops fast_exp from the cognitive path entirely.
292+
293+
The CRITICAL observation: **Phase 0 unblocks both the medical sprint (splat4d skeleton-anchored) AND the cognitive sprint (PR-X4 + PR-X9)** because Hilbert-3D + INT4×N are dependencies of both. Build them first; both downstream stacks accelerate together.
294+
295+
## Recommended workshop
296+
297+
Before the joint plan-review savant, do ONE 30-min math workshop that:
298+
1. Confirms σ_temporal values from echo/respiratory literature (item 5)
299+
2. Picks the Hilbert-3D algorithm (item 6 — recommend Butz/Skilling table-driven)
300+
3. Decides G5 NARS precision class (recommend (b) — drop exp from NARS path)
301+
4. Drafts the precise_exp_x16 path (FOR splat alpha; the cognitive path doesn't need exp after G5(b))
302+
303+
After that workshop, the joint savant has a clean math surface to rule on. Without it, the savant will surface these as open questions per design doc and slow the sprint.
304+
305+
## Cross-references
306+
307+
- `/root/.claude/uploads/.../7b0ea082-splat3d_sprint_prompt.md` — splat3d sprint (shipped as ndarray PR #153, 2026-05-18)
308+
- `/root/.claude/uploads/.../cdcb7d3d-splat4d_cascade_sprint.md` — splat4d cascade sprint (proposed, supersededby skeleton-anchored)
309+
- `/root/.claude/uploads/.../7071b77a-splat4d_skeleton_anchored_sprint.md` — splat4d skeleton-anchored sprint (current proposal)
310+
- `.claude/knowledge/pr-x3-cognitive-grid-design.md` — BlockedGrid substrate, shipped
311+
- `.claude/knowledge/pr-x4-design.md` — Gaussian splat cascade onto BlockedGrid
312+
- `.claude/knowledge/pr-x9-design.md` — lazy basin-codebook storage
313+
- `.claude/knowledge/pr-z1-ogit-cognitive-bootstrap.md` — OGIT Cognitive namespace bootstrap

0 commit comments

Comments
 (0)