feat(ecc): VectorField Fq Mont-mul + K=5 MSM batch_affine_add#23353
Draft
notnotraju wants to merge 2 commits into
Draft
feat(ecc): VectorField Fq Mont-mul + K=5 MSM batch_affine_add#23353notnotraju wants to merge 2 commits into
notnotraju wants to merge 2 commits into
Conversation
Lifts the operator* WASM kernel body into vector_field_mont_mul_body.inl.hpp and stamps it for both Bn254FrParams and Bn254FqParams. The macros (BB_VF_LOAD_LIMBS, BB_VF_KARATSUBA_STAGES_1_4, BB_VF_RUN_STAGES_6_THROUGH_10) already reference unqualified R_INV_WASM / P_WASM / R_INV_MOD_2_29 — those resolve in each specialization's enclosing class scope to the appropriate Params constants, so the same body produces a correctly-bound kernel per Params. Each specialization remains explicit (rather than templating the body) so LLVM emits each as a standalone TU function, preserving the register-scope that lets V8 reproduce the gist's hand-scheduled WAT layout. New VectorFieldFqTest suite (9 tests) mirrors the Fr coverage for the operations exercised by curve arithmetic: ctor, add, sub, mul (150 random trials), eq, is_zero, distributivity, mul-by-one, type alias. Verified native ecc_tests 35/35 and wasm ecc_tests under wasmtime 35/35 PASS. Prereq for MSM-side q1s1 integration in subsequent PRs.
Width-5 fast path for batch_affine_add_interleaved, using the VectorField<Bn254FqParams> Mont-mul from the prior commit. Runs 5 independent batch-inversion chains in parallel, collapses each pass's N scalar muls into N/5 width-5 vec muls (asymptotic ~5×). Dispatch: __wasm_simd128__ && Fq == bb::fq && num_points >= 20. Below threshold or on native, falls through to the original K=1 path unchanged. Snapshot-before-write per group: output slot for one lane can alias the input slot of a later lane in the same group; buffering all 5 lanes' reads before any writes prevents y3 corruption at large N. Tests: ecc_tests 37/37 PASS native + wasmtime (K=5 exercised under wasmtime).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on top of #23210 (rk/wasm-simd-03-accumulator).
Two commits:
VectorField Fq Mont-mul specialization — extracts the Mont-mul body into
vector_field_mont_mul_body.inl.hppand adds an explicit specialization forBn254FqParamsalongside the existingBn254FrParamsone. Each specialization remains a separate TU function (preserves register scope, V8 reproduces the gist's hand-scheduled WAT). 9 newVectorFieldFqTestcases mirror the Fr coverage.K=5 q1s1 path in
batch_affine_add_interleaved— uses the new Fq specialization to run 5 independent batch-inversion chains in parallel through MSM's affine-add inner loop. Per group of 5 pairs (10 points), 30 scalar muls collapse to 6 width-5 vec muls (+ 12 amortized split-tree muls). Asymptotic ~5× kernel speedup on the mul work.Dispatch:
__wasm_simd128__ && Fq == bb::fq && num_points >= 20. Below threshold, on native, or on non-BN254 curves: falls through to the original K=1 path unchanged.Includes snapshot-before-write logic: output slot for one lane can alias the input slot of a later lane in the same group (typical for large MSM bucket sums); buffering all 5 lanes' reads before any writes prevents y3 corruption.
Why this exists
The V8 chonk breakdown shows MSM
evaluate_work_unitsis ~50% of WASM proving time.batch_affine_add_interleavedis its workhorse. Artem's PR #23004 hits the same surface at width-2 via paired-fp51 Mont-mul; per the Slack microbench discussion, the q1s1 (5-wide) kernel wins per-mul by ~50% over fp51 at width ≥ 4. This PR is the first consumer of that width advantage in MSM. Cross-engine deterministic (integer SIMD, not relaxed-SIMD) — no Edge 147 / Safari class of bugs.End-to-end measurement to follow (microbench + chonk under V8/Node + BrowserStack matrix). Marking draft.
Tests
Stack