Skip to content

feat(bb): tree-reduce SMVP variant + restore deterministic stock#23349

Draft
AztecBot wants to merge 27 commits into
zw/msm-webgpu-mont-mul-benchfrom
cb/b7178f5b65e7
Draft

feat(bb): tree-reduce SMVP variant + restore deterministic stock#23349
AztecBot wants to merge 27 commits into
zw/msm-webgpu-mont-mul-benchfrom
cb/b7178f5b65e7

Conversation

@AztecBot
Copy link
Copy Markdown
Collaborator

@AztecBot AztecBot commented May 17, 2026

Summary

Tree-reduce SMVP is wired into the production MSM pipeline as a use_tree_reduce variant on top of the deterministic sb-baseline batch-affine SMVP. Init dispatch and the three finalize stages (collect → batch_inverse → apply) stay shared with the stock path; only the per-bucket round loop is replaced.

This branch also restores stock MSM determinism (prior zw rewrites of SMVP files were non-deterministic; reverted to sb baseline while keeping Karatsuba+Yuval Montgomery mult).

Apple M2 head-to-head (real GPU)

BrowserStack macOS Sequoia · Chrome 148, logN=16 (n=65 536 MSM), 5 timed runs per variant after a pre-warm. Driven via the new bench-msm-variant page → POST /results JSONL. Both variants reuse the same GpuContext + CachedBases across runs.

variant median mean min max deterministic (5/5 same gpu.xy)
stock 152.2 ms 152.0 ms 150.0 ms 154.7 ms yes
tree-reduce 410.4 ms 414.5 ms 407.1 ms 430.3 ms yes

stockTreeAgree=true — both variants return identical (gpu.x, gpu.y), bit-for-bit. Correctness is solid.

Tree-reduce is currently 2.7× slower than stock on real M2.

Where the regression is

The tree-reduce inner loop (Phase 1 + recursive Phase 2 layers) is fast on GPU, but the orchestrator has per-layer host/device sync points that dominate wall time:

  1. Per-layer readback: runTreeReduce does await readbackU32(device, current.bucketId) between every Phase 2 layer so the CPU can compute per-WG pair counts and slice bounds. At logN=16 this is 6 layers → 6 round-trips. Each round-trip is a device.queue.submit([…]) + onSubmittedWorkDone() stall on the M2.
  2. Per-layer allocation: every Phase 2 iteration calls device.createBuffer(...) ~6 times (slice bounds, output offsets, prefix scratch, output {bid, x, y}). On a real GPU this is cheap per call but accumulates with the per-layer sync.
  3. Single CommandEncoder mid-flush: the ebid prelude is recorded into the caller's encoder, then submitted + awaited before runTreeReduce, then a fresh encoder is used for scatter + finalize. That's one extra full-pipeline stall per MSM call.

The standalone tree-reduce bench (bench-smvp-tree) measured ~24-34 ms at much smaller working sets on the same M2; at production scale (input_size × num_subtasks = 65 536 × 17 ≈ 1.1M entries) the per-layer sync dominates.

What's needed to close the gap

To make tree-reduce beat stock end-to-end:

  • Move pair-count + slice-bound computation onto the GPU (eliminate per-layer readback). One small kernel per layer that scans outBucketId and writes per-WG counts directly to the next layer's params buffer.
  • Reuse intermediate buffers across layers instead of device.createBuffer per layer. The max layer size is bounded by the input; allocate once at max bound, treat as a ping-pong pair.
  • Indirect-dispatch the Phase 2 loop: if the host doesn't need to readback, the entire layer chain can run in one encoder submission.

None of this is in the PR. The PR ships the integration + correctness; the performance work is separate.

Variant API

compute_bn254_msm_batch_affine(
  context, cachedBases, scalars,
  log_result, bpr_bench_flags, profile_capture, bpr_inner_loop,
  /* use_tree_reduce */ true,
);

URL param on the dev page: ?use_tree_reduce=1.

Plumbing changes (commit e80e589)

  • smvp_batch_affine_gpu(commandEncoderRef, …, use_tree_reduce) — tree path mid-flushes the encoder before runTreeReduce reads back entry_bucket_id, then swaps to a fresh encoder for scatter + finalize. Caller observes via commandEncoderRef.current.
  • compute_curve_msm and compute_bn254_msm_batch_affine forward use_tree_reduce; msm.ts wraps its commandEncoder in a ref and rebinds after the smvp call.
  • get_device opts in to maxStorageBuffersPerShaderStage=10 and maxComputeWorkgroupStorageSize=32768 when the adapter reports them — phase1 binds 10 storage buffers and needs ~27 KB workgroup scratch.

Bench harness

New page dev/msm-webgpu/bench-msm-variant.html runs the full compute_bn254_msm_batch_affine N times per variant and posts JSONL summaries. Driven by the existing BS runner via:

node dev/msm-webgpu/scripts/run-browserstack.mjs \
  --target macos --page bench-msm-variant \
  --logn 16 --runs 5 --variants stock,tree

Test plan

  • Stock noble-direct deterministic + matches noble.x (local SwiftShader)
  • Tree-reduce noble-direct deterministic + matches noble.x (local SwiftShader)
  • Apple M2 head-to-head: both variants deterministic + identical (gpu.x, gpu.y), bit-for-bit
  • Close the 2.7× perf gap: move pair-count + slice-bound to GPU; reuse intermediate buffers; chain layers into one encoder

Updated by claudebox

Adds a remote-device bench loop for the MSM-webgpu dev pages so the
tree-reduce work can validate against real WebGPU hardware (Apple M2,
Snapdragon 8 Elite, Tensor G4) from a workstation without a local GPU.

- vite.config.ts: results/progress POST endpoints write JSONL to files
  named by MSM_WEBGPU_RESULTS_FILE / MSM_WEBGPU_PROGRESS_FILE; allow
  .trycloudflare.com so the dev server is reachable via Cloudflare
  Quick Tunnel.
- results_post.ts: tiny in-page client used by bench/sanity pages to
  POST progress + final-state payloads (no keepalive — the page is
  alive when the bench completes).
- bench-batch-affine.ts: post per-batch progress and a terminal
  done/error row.
- scripts/run-browserstack.mjs: spawn vite + cloudflared, drive a BS
  worker through the REST API, watchdog-tail the JSONL with
  first-progress / stall / deadline budgets.
- scripts/bs-targets.mjs: macOS Sequoia Chrome, S25 Ultra, Pixel 9
  Pro XL presets (WebGPU stable). iPhone 15 Pro listed but flagged
  as needs-iOS-26-or-newer.

Validated against macOS Sequoia Chrome 148 (Apple M2, hc=8) on
?total=8192&sizes=64,256,1024:
  B=64  ns/pair=305.2  median=2.500ms
  B=256 ns/pair=146.5  median=1.200ms
  B=1024 ns/pair=219.7 median=1.800ms
@AztecBot AztecBot added the claudebox Owned by claudebox. it can push to this PR. label May 17, 2026
AztecBot added 18 commits May 17, 2026 12:06
Implements smvp_tree_partition.ts: the host computes per-WG slice
boundaries by binary search on bucketStart[], no GPU pre-pass. Uses
the analytical identity running_adds(i) = i - bucket_idx(i) from
msm-tree-reduce.md.

Documents a design ambiguity the plan didn't call out: the identity
under-counts when bucketStart contains empty buckets (bucket_idx
jumps faster than the entry count grows). Resolved by requiring
compacted input; compactBucketStart() + assertCompact() do the
one-pass cleanup and a side activeBucketIds[] map carries the
original bucket index for kernels that tag partials.

Exports:
  - computeTotalAdds, bucketIdx, runningAdds, findAddsBoundary
  - compactBucketStart, assertCompact
  - buildSliceLayout(bucketStart, numWgs) -> SliceLayout
    { sliceStart, outputCount, outputOffset, totalAdds }

24 Jest tests pass — including the pair-detection brute-force walk
that catches the empty-bucket regression, the heavy-bucket-skew case
(7+ of 8 WGs fall inside a single 10k-population bucket), and the
pathological totalAdds < numWgs case.

No GPU code touched.
…validated)

Phase 1 of the tree-reduce SMVP: pair detection + cooperative batch-
affine + per-bucket-tagged write-out, one workgroup per slice.

Files:
- src/msm_webgpu/wgsl/cuzk/smvp_tree_phase1.template.wgsl — the kernel.
  Thread-0 serial pair-detection preamble fills a workgroup-shared
  pair_list (packed PAIR + UNPAIRED entries in slice walk order, which
  is already bucket-sorted so no reorder postlude is needed). Phase
  A/B/C/D batch-affine pattern from bench_batch_affine.template.wgsl,
  with rank-indexed chunks over the PAIR sub-stream so a single
  fr_inv_by_a amortises across the WG. UNPAIRED entries get a final
  cooperative copy pass with sign-flip. Loop bounds all `const`
  (MAX_PAIRS = MAX_SLICE_ENTRIES baked at compile time; v0 uses 128 to
  keep workgroup memory comfortable).

- src/msm_webgpu/cuzk/shader_manager.ts — gen_smvp_tree_phase1_shader
  generator + import wiring.

- dev/msm-webgpu/bench-smvp-tree-phase1.{html,ts} — standalone bench
  page with a CPU reference. The reference walks the slice with the
  same paired/unpaired state machine and computes Mont-form affine
  adds via BigInt mod-inverse; correctness is checked bit-for-bit
  against the GPU output.

Status: structure-complete but NOT yet correctness-validated on
hardware. The BS macOS Chrome 148 run hangs on the page before the
first log call lands (the previous BS run on the same tunnel for
bench-batch-affine worked fine, so the issue is page-specific not
infrastructure). Likely candidates: an early-eval import side effect
in smvp_tree_partition.ts, the buildSynthetic randomBelow loop
generating off the main thread, or a Mont-form-conversion stall.
Worth investigating with browser console access; the BS screenshot
API doesn't surface uncaught errors.

Documents a design decision in the shader header: Phase 1 does NOT
collapse same-bucket pair results sequentially into a single per-
bucket partial inside the slice (the plan's "merge consecutive same-
bucket results into running sum" wording). Sequential merging would
break batch-affine amortisation and would need (pop-1) sequential
adds per heavy bucket. Instead Phase 1 halves per bucket (ceil(p/2)
outputs per bucket per slice), letting the recursive Phase 2 dispatch
do the rest of the reduction in log layers.

The plan's wg_output_count[k] = "buckets touched" formula is
overridden here by the per-slice CPU pair-detection walk that
computes the actual output count.
The window.error / unhandledrejection listeners and skip_gpu URL flag
were added to narrow down a BS-side hang in the phase1 bench page;
they didn't surface the underlying issue and have been removed. Page
remains structurally the same as bench-batch-affine.ts plus the
buildSliceLayout import and the phase1-specific synthetic-data
generation + CPU reference.
Phase 1 of the tree-reduce SMVP now passes correctness on local
Chromium WebGPU (SwiftShader): 20/20 outputs match the CPU reference
bit-for-bit on the small-N smoke (num_wgs=2, slice_entries=16).

Three real bugs found and fixed by getting local WebGPU into the
debug loop (via Playwright + chrome-headless-shell, no GPU on the
dev container so SwiftShader is used):

1. randomBelow consumed only the LOW BYTE of each rng() output. For
   the 32-bit LCG the low 8 bits cycle every 256 outputs, so a 32-byte
   randomBelow draw cycles every 8 calls — fatal when the caller
   builds a Set of distinct values. Fixed to consume the full 32 bits.
   Latent bug in bench-batch-affine.ts too; harmless there because the
   only check is `pxMont !== qxMont` on adjacent calls.

2. WGSL `get_p()` redeclared in smvp_tree_phase1.template.wgsl.
   Already provided by the `montgomery_product_funcs` partial.
   Removed the local definition.

3. Shader needs 10 storage buffers per stage; WebGPU's default cap is
   8. Adapter actually exposes 10+. get_device now requests the
   adapter max for `maxStorageBuffersPerShaderStage` alongside
   `maxComputeWorkgroupStorageSize`.

CPU reference rewritten to do all arithmetic in canonical (non-Mont)
form, then convert back to Mont for the diff against GPU output. The
prior Mont-form-in-place pass got the inverse semantics wrong:
fr_inv_by_a(dx_mont) returns inv_dx_canon * R^2 (a "double Mont"
form, picked because the subsequent montgomery_product strips one R
factor to give Mont-form slope), not inv_dx_canon * R as the original
reference assumed.

GPU bench wall-time: ~6.5ms for 32 entries / 20 outputs / 1 dispatch
on SwiftShader CPU-emulated WebGPU. Not a benchmark number — real
silicon will be 100× faster.
Phase 2 of the tree-reduce SMVP: recursive halving over partials.

Structurally identical to Phase 1 (same pair-detection state machine,
same Phase A/B/C/D batch-affine, same per-WG output write-out) but
takes `(bucket_id, AffinePoint)` tuples directly rather than
`(sign_bit | scalar_idx)` from the raw schedule + a separate
entry_bucket_id table. One less indirection, no sign flip.

Output schema matches Phase 1 so the recursion can rebind the same
buffers and just swap the input/output roles each layer.

Correctness gate: 19/19 outputs match CPU reference bit-for-bit on
the small smoke (num_wgs=2, slice_entries=16) on local SwiftShader.
GPU bench wall: 5.4ms (CPU-emulated WebGPU; M2 would be ~10× faster
based on Phase 1 readback).

Done definition for this step met.
…artial)

Drives Phase 1 → CPU sort → Phase 2 → CPU sort → Phase 2 → ... until
every bucket has one partial. CPU-side resort between phases (Step 4
is deferred to GPU follow-up — choice documented in module header).

Standalone bench-smvp-tree.{html,ts} compares the final per-bucket
partials against a CPU reference that computes the full sequential
sum per bucket directly.

Status:
  - Phase 1 alone: 1/1 buckets match (entries=2)
  - Phase 1 + 1× Phase 2 with mixed pair_result+unpaired input
    (entries=3): 1/1 buckets match
  - Phase 1 + 1× Phase 2 with two pair_result inputs (entries=4):
    1/1 MISMATCH

Repro: load `bench-smvp-tree.html?entries=4&buckets=1&seed=42` on
local SwiftShader Chromium. CPU reference matches the sequential-add
of 4 canonical points; orchestrator's Phase 2 output disagrees.
Phase 2 standalone test (against synthetic Mont-form pair-like
inputs) passes 19/19, so the bug must live in the boundary between
Phase 1's output buffers and Phase 2's input expectations — likely
a Mont-form / BigInt-stride mismatch that the standalone Phase 2
test wasn't hitting because its inputs are generated as fresh random
Mont values rather than the output of a previous batch-affine.

Next step in this debug path: instrument the orchestrator to print
the Phase 1 readback values and diff each (P_2k + P_2k+1) against
its corresponding CPU pair-add for entries=4. That narrows whether
Phase 1's emitted bytes are wrong vs. whether Phase 2 misreads them.

Step 6 (production swap) is unblocked from a structural standpoint
— if the Phase 1/2 chain is fed by the existing transpose +
bucket_start, the same bug surfaces and gives a concrete failing
Quick Sanity Check to triangulate with.
…5 validated)

The previous reference summed each bucket's points sequentially:
  ((P0+P1)+P2)+P3+...
which only matches the GPU's tree-reduce parenthesization
  (P0+P1)+(P2+P3)+...
when the inputs are on the EC group. The synthetic bench uses random
off-curve bigints (we test the algebraic affine-add formula, not the
group law), so the two orderings produce different bytes.

Fixed by walking each bucket via the same pair-detection state
machine the GPU uses, recursing layer-by-layer until one partial
remains. Bench passes 5/5 buckets bit-for-bit on local SwiftShader
(entries=40, buckets=5, seed=99) — including bucket=4 which has
pop=9 and recurses through 4 layers.

This validates the full Phase 1 → CPU sort → Phase 2 → CPU sort →
... chain. Step 5 correctness gate met.
The tree-reduce orchestrator (cuzk/smvp_tree.ts) is correctness-validated
standalone but not yet integrated into the production MSM pipeline.
This marker documents the integration checklist at the swap site so a
follow-up session can wire it in without re-discovering the contract.
Bumped to 256 + 200 entries / 12 buckets validated correctness OK on
local SwiftShader (5 layers, 0 mismatches, 140 ms wall) but BS macOS
Chrome 148 fails to compile the resulting shader within the worker's
initial-load window — either maxComputeWorkgroupStorageSize exceeded
or the static-bound pair_list loops blow out the WGSL compile budget.

Keeping 128 for the validated path (5/5 buckets bit-for-bit on M2
at entries=40). Scaling further is a follow-up that needs pair_list
hoisted to global memory + per-WG pair_count uniform sized for the
runtime count instead of MAX_PAIRS-bounded loop iterations.
…SWEET_B=1024

Phase 1/2 shaders rearchitected for thread utilization at the plan's
target SWEET_B=1024 batch-affine size. v1's two main flaws:

1. Per-thread O(MAX_PAIRS) scans for rank → raw_slot lookup AND
   backward search for prev PAIR's raw_slot in Phase D. At
   MAX_PAIRS=1024 that's 1024 idle iterations per thread per phase.

2. `pair_bucket` in workgroup memory inflated per-WG storage past the
   32 KiB cap, forcing MAX_SLICE_ENTRIES=128 and 8× more WGs than the
   plan called for.

v2 fixes both. Thread-0 preamble builds 4 workgroup-shared arrays in
ONE sequential pass:
- pair_idx_a, pair_idx_b: per-raw-slot (PAIR or UNPAIRED) input entry indices
- prev_raw_for_pair: per-raw-slot pointer to immediate prior PAIR's
  raw_slot (O(1) lookup in Phase D, no backward scan)
- rank_to_raw: per-PAIR-rank pointer to raw_slot (O(1) Phase A/D
  iteration over PER_THREAD_PAIRS, not MAX_PAIRS)
pair_bucket writes go straight to global `output_bucket_id` from the
preamble — never in workgroup memory.

Workgroup memory at MAX_PAIRS=1024 / TPB=64:
  4 × 4 KB (pair arrays) + 2 × 5.12 KB (wg_fwd/bwd) + ~80 B = 26.4 KB
fits in M2's 32 KiB cap.

Phase A/D inner loops now iterate exactly PER_THREAD_PAIRS = 16
times each (down from MAX_PAIRS = 1024 in v1). 64× fewer idle
iterations per thread per phase.

Validation on local SwiftShader (Chromium headless, no GPU on dev
container):
- Phase 1 standalone at 4096 entries / 8 WGs × 512 entries: 2057
  outputs, 0 mismatches, 6.5 ms median.
- Orchestrator at 2048 entries / 64 buckets: 64/64 buckets match
  full-reduce CPU reference bit-for-bit. 3 layers, 18.8 ms total GPU
  wall (10.0 + 5.5 + 3.3 across phase1 + phase2 layer2 + layer3).

Apple M2 should be ~10× faster (SwiftShader is CPU-emulated WebGPU).
Pending BS validation.
…y bucket-sorted

First-principles observation: Phase 1 / Phase 2 outputs are ALREADY
globally bucket-sorted. Input entry_bucket_id is monotone non-
decreasing (CSR layout); each WG walks its non-overlapping
contiguous slice left-to-right emitting in walk order; WG outputs
concatenated preserve monotonicity. No sort needed.

Removes the readback-of-points + JS sort + upload between every
phase. Saves O(N × NUM_LIMBS_U32 × 4) bytes of bus traffic + the
O(N log N) JS sort per layer × log layers.

Still does a small (4 B / partial) bucket-id readback to compute
per-WG pair-count + output offsets host-side. Asserts global sort
on the readback as a debug guard — cheap and catches partition
regressions.

Termination changed from "no more pair-adds possible" (required full
bucket-id scan) to "count equals input num_active_buckets" (known
from initial input). One bucket-id readback per phase, point data
never moves between phases.

Bench at 8192 entries / 256 buckets / 5 layers on local SwiftShader:
- 256/256 buckets match full-reduce CPU reference bit-for-bit
- GPU wall: 21.9 + 9.9 + 8.7 + 8.8 + 5.5 = 54.8 ms total

For comparison the prior CPU-sort version at 2048 entries / 64
buckets / 3 layers was 140 ms total. 4× scale, 0.4× time — ~10×
speedup from this change plus the v2 thread-utilization fix.

Bench entry cap raised from 512 → 2^18 (1 << 18) and bucket cap
from 64 → 2^14 so we can run real production-scale workloads.
…to finalize pipeline

Two small kernels that turn the orchestrator's sparse
(bucket_id, AffinePoint) outputs into the dense
(running_x, running_y, bucket_active) arrays the existing
finalize_collect → finalize_inverse → finalize_apply pipeline
expects. With these in place the production swap in msm.ts is
mechanical: replace the round-loop dispatch with
runTreeReduce + scatter_init + scatter, and re-use the finalize
chain unchanged for the affine→Jacobian + magnitude-bucket fold.

scatter_init: one thread per bucket slot, zeros running_x/y +
bucket_active across the full T*num_columns dense layout.

scatter: one thread per orchestrator output, writes
running_x[bucket_id]=P.x, running_y[bucket_id]=P.y,
bucket_active[bucket_id]=1.

Both kernels are trivially parallel (no atomics, no synchronisation
beyond the bucket_active write which is the only output ever
written by any thread for that bucket_id since the orchestrator's
output is unique-per-bucket).
…alize pipeline

`smvp_batch_affine_gpu_tree` is the production adapter that:
  1. Reads CSR row pointers from `all_csc_col_ptr_sb`, computes
     per-entry bucket id, uploads.
  2. Runs the v2 tree-reduce orchestrator (`runTreeReduce`).
  3. Inits the dense workspace (`running_x/y_sb`, `bucket_active_sb`)
     via `scatter_init` (one thread per bucket slot).
  4. Scatters the tree-reduce output (sparse, one per active bucket)
     into the dense workspace via `scatter` (one thread per output).
  5. Returns. Caller continues with the existing `finalize_collect` →
     `finalize_inverse` → `finalize_apply` chain unchanged for the
     affine→Jacobian conversion and the magnitude-bucket fold.

`buildTreeAdapterPipelines` compiles all four pipelines (phase1,
phase2, scatter, scatter_init) once per (num_words, max_slice_entries)
shape; cache the handle for the warm bench loop.

ShaderManager wiring for `gen_smvp_tree_scatter_shader` +
`gen_smvp_tree_scatter_init_shader` added alongside the existing
phase1/phase2 generators.

The actual msm.ts call-site swap is one more edit: replace the
current `smvp_batch_affine_gpu(...)` call with two calls — first
`smvp_batch_affine_gpu_tree(...)` to populate running_x/y +
bucket_active via tree-reduce, then the existing finalize chain.
That swap is mechanical now that the adapter is in place; pending
the Quick Sanity Check correctness gate.
Validates the tree-reduce's main perf claim from the plan: a heavily
skewed input (one bucket with pop = entries/2, the rest uniform) is
handled in O(log heavy_pop) layers regardless of skew.

Measured on Apple M2 via BS at entries=65536 / buckets=512 / skew=heavy
(heavy bucket pop = 32 832):
  layers: 16
  total GPU wall: 34.6 ms

For comparison the same input at skew=uniform (max pop ~256):
  layers: 6
  total GPU wall: 24.3 ms

Heavy skew → only 1.4× more time despite a bucket that the current
round-loop MSM would need ~32 832 sequential rounds to reduce. The
plan's "5–10× faster on heavy-bucket workloads" claim looks
conservative.

Bench page now accepts `?skew=heavy` and abbreviates the pops log
for runs with > 16 buckets.
Adds a `use_tree_reduce` flag-gated branch inside
smvp_batch_affine_gpu that swaps the round-loop for the v2
tree-reduce pipeline:
  init (existing) → entry_bucket_id (new) → tree-reduce (new) →
  scatter (new) → finalize_collect → finalize_inverse →
  finalize_apply (all existing, unchanged).

Wiring:
  - `compute_curve_msm` / `compute_bn254_msm_batch_affine` plumb
    `use_tree_reduce` through to smvp_batch_affine_gpu.
  - dev-page main.ts reads `?use_tree_reduce=1` and forwards it to
    the Quick Sanity Check path.
  - New `smvp_tree_entry_bucket_id` shader derives entry_bucket_id
    from the per-subtask CSR row-pointer layout
    (row_ptr[subtask*(num_columns+1) + bucket_local]). Per-subtask
    binary search; one thread per entry.
  - runTreeReduce no longer needs the bucketStart parameter (was
    already unused; removed cleanly).

State on local SwiftShader:
  - Stock sanity at logN=16: state=done, gpu.x prefix
    e04e8689dc4d92e6, 4.4 s wall.
  - Tree sanity at logN=16: state=done (no crash), but gpu.x prefix
    27e87ad6dbd157b6 — output disagrees with stock. Algorithm
    correctness for the per-bucket affine sums was validated
    standalone at 65 K entries / 1024 buckets on Apple M2
    (1024/1024 buckets bit-for-bit against CPU tree-reduce ref).
    So the divergence is somewhere in the production-layout bridge:
    most likely entry_bucket_id derivation against the real CSR
    (per-subtask layout), the cross-subtask slice alignment of
    Phase 1, or the scatter's bucket_global → workspace slot
    mapping interacting unexpectedly with init's seeding pass.

Pending follow-up: instrument running_x readback after the
tree-reduce + scatter path and diff slot-by-slot against the
stock path's running_x to localize the divergence to a bucket
range. The shaders are stable so once we narrow the failing
bucket the fix should be tight.
Gated behind window.__tree_debug = true. Dumps the first 32 entries
of the tree-reduce's derived entry_bucket_id plus the first / last
of the CSR row_ptr for subtask 0. Used to verify the per-subtask
binary-search kernel against the production CSR layout — confirmed
correct output for the logN=16 sanity input (num_columns=32768,
num_subtasks=18, input_size=65536, totalEntries=1179648).

The tree-reduce path runs to completion but produces a different
final MSM gpu.x than stock. Bug is somewhere after entry_bucket_id —
either in Phase 1/2 chain operating on the production CSR vs my
synthetic test layout, or in scatter's interaction with the
finalize stage's reads. Awaiting a follow-up debug pass with
per-bucket running_x diffing (needs splitting smvp_batch_affine_gpu
so we can intercept the buffer between init+scatter and finalize).
The v2 preamble had thread 0 do a 1024-op sequential pair-detection
state machine while 63 threads idled at workgroupBarrier — a 64x
thread-utilisation loss for the per-WG critical path. v3 distributes
the preamble across all TPB threads:

  Step 1: each thread loads PER_THREAD_ENTRIES = MAX_SLICE_ENTRIES/TPB
          buckets from its chunk and computes "last break position"
          locally.
  Step 2: TPB-wide Hillis-Steele max-scan reconstructs pos_in_run for
          every entry (log2(TPB) stages).
  Step 3: each thread determines emit / pair flags from pos_in_run
          parity and successor-bucket equality.
  Step 4: TPB-wide prefix-sum of per-thread emit + pair counts assigns
          raw_slot and pair_rank ranges per thread.
  Steps 5-6: each thread writes its pair_idx_a/b, rank_to_raw, and
          prev_raw_for_pair entries from its assigned ranges.

The new pair schedule is identical to the v2 greedy state machine (same
parity-based pairing within each contiguous same-bucket run, fresh
open=None per slice). Heavy-skew 65K/512 bench passes bit-for-bit on
SwiftShader.

Phase A is also tightened to load_point_x_only (Phase 1) or direct
input_x[idx] (Phase 2). y is not needed for dx = Q.x - P.x; skipping
the y reads halves Phase A's point-data bandwidth.

Adds run-bench-smvp-tree.mjs to drive the page locally without
BrowserStack for fast iteration.
Before this change, smvp_tree's entry_bucket_id (ebid) GPU kernel
dispatched in its own command encoder and was submitted before the
caller's commandEncoder ran transpose. ebid read all_csc_col_ptr_sb
in its current GPU state — which still held the PRIOR MSM call's CSR.
For warm-context benchmarks that's same data, but the persistent
buffer's stale data made debug runs (different scalars per call)
silently miss the bug. In the BS dump comparison with seeded scalars,
this manifested as tree's running_x activating buckets that init
didn't, plus subtle per-subtask sum drift.

Fix: record ebid into the caller's commandEncoder (after transpose
and ba_init), then finish + submit it so the GPU runs transpose
through ebid before runTreeReduce reads back entry_bucket_id. After
that, swap to a fresh commandEncoder for scatter + finalize so the
caller can continue recording BPR onto the new encoder.

Required signature change: smvp_batch_affine_gpu now takes a
commandEncoderRef wrapper so it can mutate the encoder mid-call;
msm.ts re-binds its local commandEncoder after the call returns.
Stock path is unaffected (it never swaps the ref).

Reduces production-integration mismatches from 18/18 subtasks to
4/18 (specific subtasks 2, 4, 6, 17 still drift bit-for-bit — likely
a separate Phase 2 cross-slice carry edge case to be investigated).
GPU entry_bucket_id is still validated bit-for-bit against the host
in the standalone bench's multi-subtask + GPU-ebid mode.

Also extends bench-smvp-tree.ts to drive multi-subtask synthetic
inputs through the GPU ebid kernel (matches production layout), and
adds a per-subtask SMVP fingerprint dump to compare stock vs tree
running_x without depending on warm-context stale data.
@AztecBot AztecBot changed the title feat(bb): BrowserStack-driven MSM-webgpu bench harness perf(bb): WebGPU MSM tree-reduce SMVP — parallel preamble + ebid timing fix May 17, 2026
AztecBot added 4 commits May 17, 2026 19:50
Adds msm-noble-direct autorun mode that bypasses WASM boot (which
requires the cpp wasm build that isn't built locally) and goes
straight GPU → noble cross-check. Runs the MSM 3-5 times in the same
session and reports both determinism and noble agreement.

Used to investigate the production-integration correctness gap and
found a separate underlying issue: stock WebGPU MSM at logN=16 produces
NON-DETERMINISTIC results across runs with identical inputs (seeded
scalars, identical SRS). Reproducible on both BS Apple M2 Chrome 148
AND local SwiftShader (CPU-emulated WebGPU which should be fully
deterministic). None of the 5 runs match the noble reference MSM.

The non-determinism reproduces with all of:
- The committed batch_affine + msm.ts (with commandEncoderRef change).
- A pre-commit revert of those two files (only the v3 parallel
  preamble + ebid timing fix removed; everything else still present).
- The default Karatsuba+Yuval Mont mult AND the legacy CIOS variant
  (added a `?mont_legacy=1` URL gate to A/B them).
- Fresh GPU context per call (`?fresh_ctx=1`).

So the non-determinism is in the stock SMVP pipeline itself on this
branch — not introduced by the tree-reduce work — and blocks any
stock-vs-tree comparison until the underlying bug is localized. The
debug code in this commit makes the bug visible in one autorun.

Also drops the temporary `?mont_legacy=1` debug from shader_manager
since it ruled out the Mont mult variant.
Adds ?zero_workspace=1 debug that clearBuffer's every persistent
SMVP workspace buffer (running_x/y, bucket_active, bucket_cursor,
pair_*, round_count, bucket_sum_*) at the start of every MSM call.
With this on, stock STILL produces 3 different gpu.x values across 3
runs in the same session on local SwiftShader. So the non-determinism
is not caused by previous-call state leaking through the persistent
buffer pool — it's a real algorithmic / WGSL race in the stock
pipeline on this branch.
Stock WebGPU MSM at logN=16 was producing non-deterministic gpu.x values
across runs and never matching the noble CPU reference. Bisected:

- sb/msm-webgpu branch (Suyash): 3x noble-direct = same gpu.x bit-for-bit
  AND matches noble.x.
- zw/msm-webgpu-mont-mul-bench (Zac): 3x noble-direct = three different
  gpu.x values, none match noble.

Reverting the SMVP-side files between sb and zw to the sb versions while
keeping Zac's Karatsuba+Yuval mont mult (rendered into mont_product_src
by shader_manager) restores the deterministic + correct result:

  3 runs of stock MSM, identical seeded inputs:
  gpu.x[0,1,2] = 0x235999aa…
  noble.x       = 0x235999aa…   ← match

Files reverted to sb/msm-webgpu:
  - src/msm_webgpu/cuzk/batch_affine.ts
  - src/msm_webgpu/cuzk/gpu.ts
  - src/msm_webgpu/msm.ts
  - src/msm_webgpu/wgsl/cuzk/batch_inverse_parallel.template.wgsl  (WPB pooling removed)
  - src/msm_webgpu/wgsl/cuzk/batch_affine_dispatch_args.template.wgsl
  - src/msm_webgpu/wgsl/cuzk/batch_inverse.template.wgsl
  - src/msm_webgpu/wgsl/cuzk/bpr_bn254.template.wgsl
  - src/msm_webgpu/wgsl/field/fr_pow.template.wgsl

Kept from zw/msm-webgpu-mont-mul-bench:
  - shader_manager.ts (Karat+Yuval mont mult rendering, BY inverse refs)
  - mont_pro_product_karat_yuval.template.wgsl (the faster mont mult)
  - by_inverse.template.wgsl + by_inverse_a.template.wgsl (BY inverse,
    available for callers that opt in but not used by the reverted
    batch_inverse_parallel — fr_inv_by_a stayed only in the tree-reduce
    path)
  - All smvp_tree_* shaders and orchestrator

The tree-reduce integration into batch_affine.ts (use_tree_reduce branch)
is dropped by this revert and needs to be re-applied on the sb-based
batch_affine.ts in a follow-up commit so the algorithm work shipped in
this PR can be re-enabled without breaking stock correctness.
…-based SMVP

Restores the tree-reduce SMVP variant on top of the deterministic
sb-baseline batch_affine.ts. The init dispatch and the three finalize
stages (collect → batch_inverse → apply) stay shared with the stock
path; only the per-bucket round loop is replaced.

Plumbing changes:

  - smvp_batch_affine_gpu now takes commandEncoderRef and a
    use_tree_reduce flag. Tree-reduce mid-flushes the encoder before
    runTreeReduce reads back entry_bucket_id, then swaps in a fresh
    encoder for scatter + finalize. Caller observes the swap via
    commandEncoderRef.current.
  - compute_curve_msm and compute_bn254_msm_batch_affine forward
    use_tree_reduce; msm.ts wraps its commandEncoder in a ref and
    rebinds after the smvp call.
  - get_device requests maxStorageBuffersPerShaderStage=10 and
    maxComputeWorkgroupStorageSize=32768 when the adapter supports
    them — phase1 binds 10 storage buffers and the workgroup scratch
    needs ~27 KB at TPB=64/MAX_SLICE=1024.

Validated correct + deterministic on local Chromium SwiftShader at
logN=16 (3x consecutive runs all match noble.x bit-for-bit).
@AztecBot AztecBot changed the title perf(bb): WebGPU MSM tree-reduce SMVP — parallel preamble + ebid timing fix feat(bb): tree-reduce SMVP variant + restore deterministic stock May 17, 2026
AztecBot added 4 commits May 18, 2026 00:48
Adds a dedicated WebGPU MSM bench page that runs the full
compute_bn254_msm_batch_affine pipeline N times for each variant
(stock and/or tree-reduce) and posts per-run timings + summary
(median/mean/min/max, deterministic-across-runs flag, cross-variant
gpu.x agreement) to the dev-server /results endpoint.

The page reuses GpuContext + CachedBases across runs (so the warm-path
cost is what gets measured), does one untimed pre-warm per variant
to amortise pipeline JIT, and skips the noble CPU reference entirely
(correctness is already verified via the noble-direct autorun on a
fast local Chromium).

run-browserstack.mjs gains --logn, --runs, --variants passthroughs
and a new --page bench-msm-variant entry, so this page can be driven
on Apple M2 (or any other BS preset) via:

  node dev/msm-webgpu/scripts/run-browserstack.mjs \
    --target macos --page bench-msm-variant \
    --logn 16 --runs 5 --variants stock,tree
…ase wall-clock

bench-msm-variant now sets `profile_capture: {}` on the last run of each
variant and prints per-family GPU profile aggregations. For the tree
variant, also surfaces runTreeReduce's per-layer wall-clock timings
(its inner encoders bypass the main Profiler's QuerySet) via a
globalThis dump that bench-msm-variant reads back into the JSONL.

Baseline-only — no algorithm changes. Lets the upcoming iteration loop
attribute time to ebid / each tree layer / scatter / finalize / BPR
so re-architecture decisions are profile-driven.
…nt buffers)

Eliminates the per-Phase-2-layer CPU readback chain in runTreeReduce.
All metadata (slice_bounds, wg_output_offset, layer_counts, indirect
dispatch args) now produced by GPU prelude+scan kernels. Entire tree
chain (ebid + count_active + prelude+scan+phase1 + N*(prelude+scan+phase2)
+ scatter_args + scatter) records into the caller's commandEncoder.

One submit per MSM. Persistent ping-pong buffers (no per-call alloc).

Validated bit-for-bit vs noble on local SwiftShader at logN=16.
Surfaces per-tree-kernel GPU timestamps on bench-msm-variant so the
180ms tree compute blob in __untimestamped becomes attributable per
(prelude / scan / phase1 / phase2 × N / scatter_args / count_active / ebid).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant