Skip to content

feat(ecc): VectorField Fq Mont-mul + K=5 MSM batch_affine_add#23353

Draft
notnotraju wants to merge 2 commits into
rk/wasm-simd-03-accumulatorfrom
rk/wasm-simd-04-fq-mont-mul
Draft

feat(ecc): VectorField Fq Mont-mul + K=5 MSM batch_affine_add#23353
notnotraju wants to merge 2 commits into
rk/wasm-simd-03-accumulatorfrom
rk/wasm-simd-04-fq-mont-mul

Conversation

@notnotraju
Copy link
Copy Markdown
Contributor

Stacked on top of #23210 (rk/wasm-simd-03-accumulator).

Two commits:

  1. VectorField Fq Mont-mul specialization — extracts the Mont-mul body into vector_field_mont_mul_body.inl.hpp and adds an explicit specialization for Bn254FqParams alongside the existing Bn254FrParams one. Each specialization remains a separate TU function (preserves register scope, V8 reproduces the gist's hand-scheduled WAT). 9 new VectorFieldFqTest cases mirror the Fr coverage.

  2. K=5 q1s1 path in batch_affine_add_interleaved — uses the new Fq specialization to run 5 independent batch-inversion chains in parallel through MSM's affine-add inner loop. Per group of 5 pairs (10 points), 30 scalar muls collapse to 6 width-5 vec muls (+ 12 amortized split-tree muls). Asymptotic ~5× kernel speedup on the mul work.

    Dispatch: __wasm_simd128__ && Fq == bb::fq && num_points >= 20. Below threshold, on native, or on non-BN254 curves: falls through to the original K=1 path unchanged.

    Includes snapshot-before-write logic: output slot for one lane can alias the input slot of a later lane in the same group (typical for large MSM bucket sums); buffering all 5 lanes' reads before any writes prevents y3 corruption.

Why this exists

The V8 chonk breakdown shows MSM evaluate_work_units is ~50% of WASM proving time. batch_affine_add_interleaved is its workhorse. Artem's PR #23004 hits the same surface at width-2 via paired-fp51 Mont-mul; per the Slack microbench discussion, the q1s1 (5-wide) kernel wins per-mul by ~50% over fp51 at width ≥ 4. This PR is the first consumer of that width advantage in MSM. Cross-engine deterministic (integer SIMD, not relaxed-SIMD) — no Edge 147 / Safari class of bugs.

End-to-end measurement to follow (microbench + chonk under V8/Node + BrowserStack matrix). Marking draft.

Tests

  • Native ecc_tests: 865/865 PASS (K=5 dormant; K=1 fallback intact)
  • WASM ecc_tests under wasmtime: 865/905 PASS, 40 SKIPPED (intentional), 0 FAILED — K=5 actively exercised

Stack

  • rk/wasm-simd-01-vector-field → rk/wasm-simd-02-vectorized-for → rk/wasm-simd-03-accumulator → this PR

Lifts the operator* WASM kernel body into vector_field_mont_mul_body.inl.hpp
and stamps it for both Bn254FrParams and Bn254FqParams. The macros
(BB_VF_LOAD_LIMBS, BB_VF_KARATSUBA_STAGES_1_4, BB_VF_RUN_STAGES_6_THROUGH_10)
already reference unqualified R_INV_WASM / P_WASM / R_INV_MOD_2_29 — those
resolve in each specialization's enclosing class scope to the appropriate
Params constants, so the same body produces a correctly-bound kernel per
Params.

Each specialization remains explicit (rather than templating the body) so
LLVM emits each as a standalone TU function, preserving the register-scope
that lets V8 reproduce the gist's hand-scheduled WAT layout.

New VectorFieldFqTest suite (9 tests) mirrors the Fr coverage for the
operations exercised by curve arithmetic: ctor, add, sub, mul (150 random
trials), eq, is_zero, distributivity, mul-by-one, type alias. Verified
native ecc_tests 35/35 and wasm ecc_tests under wasmtime 35/35 PASS.

Prereq for MSM-side q1s1 integration in subsequent PRs.
Width-5 fast path for batch_affine_add_interleaved, using the
VectorField<Bn254FqParams> Mont-mul from the prior commit. Runs 5
independent batch-inversion chains in parallel, collapses each pass's
N scalar muls into N/5 width-5 vec muls (asymptotic ~5×).

Dispatch: __wasm_simd128__ && Fq == bb::fq && num_points >= 20. Below
threshold or on native, falls through to the original K=1 path unchanged.

Snapshot-before-write per group: output slot for one lane can alias the
input slot of a later lane in the same group; buffering all 5 lanes'
reads before any writes prevents y3 corruption at large N.

Tests: ecc_tests 37/37 PASS native + wasmtime (K=5 exercised under wasmtime).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant