Skip to content

perf(dfast): close level 3 parity gap — compress 11.0× → 5.5× of donor#145

Closed
polaz wants to merge 30 commits into
mainfrom
perf/#111-phase7c-dfast-level3-memory-tail
Closed

perf(dfast): close level 3 parity gap — compress 11.0× → 5.5× of donor#145
polaz wants to merge 30 commits into
mainfrom
perf/#111-phase7c-dfast-level3-memory-tail

Conversation

@polaz
Copy link
Copy Markdown
Member

@polaz polaz commented May 16, 2026

Summary

Closes the bulk of Phase 7c parity work on dfast level 3 across two of three axes; decompress and medium/large memory are left for a follow-up PR. Picked up a cross-level small-input perf win while addressing review feedback — the FSE default-table cache (commit f928ef4) shaved ~38 µs off FrameCompressor::new for every compression call regardless of level.

Honest framing on the "Compressed size" column below: our output is 7% smaller than donor on the decodecorpus-z000033 corpus. That is NOT confirmation that we beat donor — it is more likely a signal of algorithmic divergence (we are doing extra match-finding work the donor's fast path skips), which is consistent with us still being 5.5× slower. The deviation may turn out to be the cause of the speed gap, not a separate win. To confirm what donor would have emitted on the same input we still need a per-sequence diff harness (see the "Deferred" section).

Cumulative deltas (compress/level_3_dfast/decodecorpus-z000033, 1 022 035 B input)

Axis Before this PR After this PR Donor (FFI 1.5.7) Status
Compress time 32.07 ms (11.0×) 16.11 ms (5.5×) 2.94 ms −50% closed
Compressed size 556 121 B (+5.5%) 489 846 B (−7.07%) 527 148 B diverged; likely a cost source, not a win — see note above
Decompress time 4.43 ms (5.3×) 4.43 ms 0.84 ms deferred
Memory (small ≤4 KiB) 378 KiB 249 KiB (−34%) 178 KiB partial close
Memory (medium 1 MiB) 5.01 MiB unchanged 3.80 MiB deferred
Memory (large 100 MiB stream) 9.92 MiB unchanged 3.80 MiB deferred

Small-input speedup (FSE cache, all levels)

FrameCompressor::new used to rebuild the three predefined LL/ML/OF FSE tables on every call (~12 µs per build_table_from_probabilities × 3 = ~38 µs). Two-stage fix during the review rounds on this PR:

  1. Cache the tables in process-local AtomicPtr<FSETable> and clone on each ctor (commit f928ef4) — drops the build cost but pays ~4-5 µs per clone of three 256-element SymbolStates arrays.
  2. Switch the cache return type to &'static FSETable and store the static reference inside FseTables directly (commit 0338876) — the clone disappears, the field becomes a pointer-sized copy.

Effect on small-1k-random/matrix/pure_rust at level 2 dfast measured locally on darwin/aarch64 with criterion:

Initial baseline After f928ef4 (cache + clone) After 0338876 (&'static) FFI 1.5.7
compress_slice_to_vec 44.25 µs (22.07 MiB/s, 13.4× FFI) 12.67 µs (77.11 MiB/s, 3.9× FFI) 6.08 µs (160.75 MiB/s, 1.83× FFI) 3.31 µs (298.24 MiB/s)

7.3× total compress-time speedup on the small-input scenario. Applies to every level — FseTables::new() runs inside the constructor regardless of strategy. The residual ~2.8 µs gap to FFI is per-frame setup not covered by this change (CompressedBlockScratch::default() Vec allocations, MatchGeneratorDriver Simple-variant init then reset to the level's backend, frame header / hash setup); deferred to a follow-up.

Per-scenario sanity at level_3_dfast (no regressions)

Scenario Rust (this PR) FFI 1.5.7 Δ
small-1k-random 1 038 B 1 038 B =
small-10k-random 10 254 B 10 254 B =
small-4k-log-lines 150 B 156 B −4%
decodecorpus-z000033 489 846 B 527 148 B −7%
high-entropy-1m 1 048 616 B 1 048 613 B +3 B (noise)
low-entropy-1m 150 B 159 B −6%

Structural changes (30 commits)

dfast matcher rewrite (Phase 7c core)

  1. Donor outer/inner loop rewrite of start_matching_fast_loop (78f51da5):
    • Outer while per match, inner loop per scan position carrying hl0/idxl0/hl1/idxl1 between iterations
    • Two-position lookahead (hl1 precomputed, reused on miss as next iter's hl0)
    • Hash table updated at curr BEFORE match check (donor zstd_double_fast.c:187)
    • Short check is 4-byte gate only (donor MEM_read32); forward common_prefix_len runs only on hit
    • _search_next_long retry with precomputed idxl1 follows donor's preferred-longer-match policy
    • Probes inlined into hot loop via raw pointer arithmetic; probe_slot_match is kept for cold callers
  2. MIN_MATCH 6→5 (c6eb5a83) — donor clevels.h:31 default level 3 dfast mls=5; we were silently dropping every 5-byte match.
  3. Scalar mul hash (67a47858) — drop CRC32d kernel dispatch on dfast hot path; donor ZSTD_hash8 is one multiply.
  4. Repcode peek at ip+1 (5372ae99) — donor zstd_double_fast.c:190 unconditional peek.
  5. inline(always) on probe chain (233603fd) — collapse best_match → hash_candidate → probe_slot_match so &self field loads hoist into the hot loop.
  6. Skip-step grows by distance (8e72d58b) — donor parity; step grows every kStepIncr positions traveled, not on consecutive-miss threshold.
  7. rep1-only inline peek (1e6d5dd5) — donor zstd_double_fast.c inline hot path checks only offset_1; full rep0/rep1/rep2 walk lives in lazy/btopt. Speed −16%, size also −494 B (fewer rep2/rep3 picks displacing better hash matches).
  8. LTO=fat + codegen-units=1 scoped to [profile.bench] (70f966b5, bac47a6) — full cross-crate inlining for benches without forcing 5–10× release build-time on cli and downstream consumers.
  9. all_blocks adaptive seed in FrameCompressor::compress (079615c0) — scale initial accumulator capacity by source-size hint; saves ~130 KiB peak for ≤4 KiB inputs without regression on medium/large.
  10. Fast-loop OOB fix (ef906f7) — tighten fast-loop guards to HASH_READ_SIZE = 8. Previous + DFAST_MIN_MATCH_LEN (5) let the unconditional 8-byte u64::read_unaligned for the long-hash probe read up to 3 bytes past the live history end. Linux fuzz ASan caught it on a 7-byte input (crash-01be...); regression test added in tests/fuzz_regressions.rs.
  11. Fast-loop rebase (87a62e3) — call ensure_room_for at fast-loop entry. The inner loop's packed_curr cast (abs_ip0 - position_base) as u32 could truncate after a long stream of all-miss / non-hashable blocks (fast-loop bypasses insert_position's rebase).
  12. trim_to_window reclaims prefix (87a62e3) — append compact_history() after the eviction loop so an explicit trim actually frees memory instead of leaving the dead prefix resident until the next add_data.

Cross-level perf + parity refinements (review-driven, recent rounds)

  1. FSE default-table cache (f928ef4) — three predefined LL/ML/OF tables now cached in per-helper static AtomicPtr<FSETable>. FrameCompressor::new clones from the cache instead of rebuilding (~38 µs → ~3 µs). Affects every level on every call. core::sync::atomic works in both std and no_std builds; cache slot is selected by the static identity of each helper, not by pointer comparison on the distribution slice (the const slices don't have a stable address guarantee).
  2. Tail-position probe (f9d36b1) — added probe_tail_ip0_only, called when the outer iter sees ip0 still hashable but ip1 past the end (boundary pos == current_len - HASH_READ_SIZE). Previously the outer guard broke unconditionally there and left the last hashable position to the seeder, which inserts but does NOT search — a real match in the tail window of a short block would be dropped. Mirrors the inner loop's long+short probe at ip0; the rep peek and _search_next_long retry are skipped (both depend on ip1), matching donor's iend-boundary tradeoff.
  3. Short-hash floor enforcement (f9d36b1) — the 4-byte equality gate plus forward extension can produce match_len < 5. Previously the fast loop committed below-floor short hits; now short_cand.match_len >= DFAST_MIN_MATCH_LEN is checked before committing, the _search_next_long retry can still upgrade to a valid long match. Matches the floor probe_slot_match already enforces on the long-hash main path; emitting a 4-byte non-rep match would mint a 13–17-bit offset that costs more than the payload buys.
  4. Inner-loop exit shape (900f2114) — replaced implicit Option<MatchCandidate> + sibling tail_seed_anchor pairing with explicit enum InnerExit { Committed(MatchCandidate), Tail(usize) }. Every break 'inner now picks a variant; the post-loop match is exhaustive.
  5. Probe equality gate width (ecd2bfd7) — probe_slot_match takes a gate_len parameter (8 for long-hash callers, 4 for short-hash). Replaces the prior 1-byte precheck which let 8-byte-collision short matches slip past the gate.
  6. Empty-block guards (ecd2bfd7) — skip_matching / skip_matching_dense / start_matching early-return on current_len == 0. add_data short-circuits empty input without pushing to window_blocks; without the guard these methods would re-seed the previous block's retained tail on every empty flush.
  7. trim_to_window Dfast signature (0cbb3004) — drops the unused _reuse_space: impl FnMut(Vec<u8>) parameter. Dfast's history-only storage has no per-block Vec to recycle; saves the cold-path monomorphization that would otherwise inline an empty closure.
  8. Doc + soundness polish (0cbb3004) — compress_slice_to_vec now documents # Panics modes; start_matching_fast_loop grows a raw-pointer aliasing invariant block enumerating which fields are accessed through cached raw pointers vs &self (so a future refactor adding a &mut self call inside the loop knows to either hoist the call or reload the cached pointers).
  9. FSE cache returns &'static FSETable (0338876) — the AtomicPtr-cached default tables flow as &'static FSETable rather than owned FSETable, eliminating ~4-5 µs of FSETable::clone per FrameCompressor::new. FseTables stores the static reference; clone_fse_tables (used by the block-split estimator) becomes a pointer-sized copy on the three default fields. Cumulative small-input speedup vs baseline goes from 3.5× to 7.3× (44.25 µs → 6.08 µs on 1 KiB random / level 2 dfast). Companion review cleanups in the same commit: probe_tail_ip0_only switches to &self (read-only after the previous round dropped dead writes); get_last_space (trait dispatch entry point) returns &[] on empty window_blocks instead of panicking, mirroring the internal-callsite gate pattern; compress_to_vec rustdoc explicitly calls out "NOT a streaming API" since it materializes the source fully before any compression runs.

The first six commits in the branch are continuation of pre-Phase-7c work (split dfast hash table sizing, port donor single-slot storage, drop window-Vec byte duplication, compress_slice_to_vec allocation tightening, insert_positions inner-loop hoisting). The new structural work begins with c6eb5a83.

Donor-deviation audit

Every commit either (a) moves toward donor parity (most), or (b) does strictly less work and runs faster. Two specific deviations remain:

  • kStepIncr = 64 (donor 256) — our step ramp is 4× more aggressive. This favours speed at the cost of scan coverage.
  • Dense complementary insert at match (donor uses sparse 3-point curr+2, ip-2, ip-1) — tested swapping to donor's sparse pattern; it regressed both axes (+4% slower, +2.3% larger output). The hash coverage from dense insert compensates for our probe_slot_match doing full forward common_prefix_len on short hits. Switching all probes to donor's 4-byte-gate-only shape would make the sparse insert the right choice; revisit then.

These two deviations together likely explain part of the −7% size delta vs donor on this corpus (extra hash-table coverage finds matches donor's fast path would have skipped). Treating that delta as a "win" is premature until a per-sequence Rust↔FFI diff harness shows whether donor leaves matches on the table or whether our extra inserts are pure overhead.

Deferred (follow-up PRs)

  • Decompress 5.3× gap. Profile shows FSE state machine + Huffman decoder dominate; closing the gap needs a 200–500 LOC rewrite per module (audit vs donor FSE_decompress_usingDTable and HUF_decompress*).
  • Sequence-stream comparator. Per-sequence diff Rust↔FFI on small fixtures, needed to validate whether our −7% size delta is a real algorithmic win or extra cost we're paying.
  • Medium/large memory. Need a per-allocation-site tracker in compare_ffi_memory.rs to identify the source of the residual gap before further work.
  • Residual per-frame ctor cost. After the FSE cache + &'static refactor the small-input bench is 1.83× FFI; remaining ~2.8 µs is CompressedBlockScratch::default() Vec init + MatchGeneratorDriver Simple-variant init that resets to the level's actual backend on first compress() + per-frame header/hash setup. Each needs its own profile + measurement loop.
  • rep1-only simplification for fast/row/lazy paths analogous to dfast.
  • scalar→NEON-overshoot match copy for matches <16 bytes in simd_copy.rs (~1% decompress win).
  • block_content_buffer 0-fill skip via set_len + read_exact (~1% decompress win).
  • SIMD-XXH64 for frame checksum (~3–4% compress win).

Verification gates

  • cargo nextest run --lib --features hash,std,dict_builder481 passed, 0 failed (includes new interop_7_byte_input_does_not_oob_in_dfast_fast_loop regression).
  • cargo clippy --all-targets --features hash,std,dict_builder — 0 warnings.
  • cargo fmt --check — clean.
  • level22_sequences_match_donor_on_corpus_proxy — donor-parity canary at level 22 (untouched by this PR).
  • huff0::level22_emitted_literal_sections_are_accepted_by_donor_huf_reader — donor reader-acceptance canary.
  • 1 000-iteration roundtrip fuzzes (roundtrip_compressible_data_1000_iterations, roundtrip_random_data_1000_iterations, roundtrip_streaming_api_1000_iterations) — all green.
  • Linux fuzz interop smoke job — now passes after the HASH_READ_SIZE guard fix; previously failed on the 7-byte input.

Test plan

  • CI green on all matrix shards (test/lint/build/codecov/fuzz/CodeRabbit)
  • No regression on level 22 sequence parity
  • No regression on per-scenario size delta at level 3 dfast
  • Donor outer/inner rewrite reviewed for borrow-checker / safety invariants
  • rep1-only inline peek isolated to dfast fast loop; repcode_candidate_shared unchanged for lazy/opt callers
  • cargo bench --bench compare_ffi re-run on fresh checkout, numbers match the table above

Summary by CodeRabbit

  • New Features

    • Added direct byte slice compression method
  • Bug Fixes

    • Fixed memory access issue during compression
    • Improved match detection accuracy
  • Performance & Optimization

    • Optimized table caching for reduced recomputation
    • Enhanced dynamic buffer allocation based on input size
    • Streamlined compression matching with improved window management

polaz added 16 commits May 16, 2026 22:11
…r parity

clevels.h sizes the dfast tables independently — `hashLog` for the
long (8-byte) hash and `chainLog` for the short (4-byte) hash, with
`chainLog = hashLog - 1` at every dfast level (level 2: hashLog=16,
chainLog=15; level 3: hashLog=17, chainLog=16). Our previous code
forced a single shared `hash_bits` for both tables, so the short
hash sat one bit larger than donor across every dfast level.

Split the matcher state into `long_hash_bits` + `short_hash_bits`,
derive the short-hash from the long-hash via the new
`DFAST_SHORT_HASH_BITS_DELTA = 1` constant, and size the two tables
independently in `ensure_hash_tables`. `set_hash_bits(bits)` takes
the long-hash bit count (donor `cParams.hashLog`) and derives the
short-hash; both clamps stay above `MIN_WINDOW_LOG` so tiny windows
don't underflow on the source-size-hint shrink path.

`compare_ffi_memory` smoke on `decodecorpus-z000033 level_3_dfast`:
  Rust peak: 9,953,841 -> 8,905,333 bytes (-10.5 percent)
  FFI peak:  3,286,795 unchanged
  Cross-side mem ratio: 3.0x -> 2.7x
Compression ratio unchanged (Rust 0.5350 vs FFI 0.5158 on the same
fixture). 533/533 lib tests green; `level22_sequences_match_donor_on_corpus_proxy`
+ `default_level_stays_within_twenty_five_percent_of_ffi_level3_on_corpus_proxy`
both pass.

`DFAST_SEARCH_DEPTH = 4` (4-slot bucket) intentionally kept on this
chunk — flipping it to 1 (donor parity) would close the remaining
2.7x gap to ~donor 768 KiB but requires porting donor's
\`_search_next_long\` retry + complementary insertion (curr+2, ip-2,
ip-1) to preserve ratio; bench showed 32.7 percent ratio
regression on naive SEARCH_DEPTH=1 without those compensating
match-extension passes. That structural rewrite is the next 7c
sub-chunk.
…se insertion

Donor `zstd_double_fast.c` stores ONE u32 per bucket and recovers
compression quality via two structural choices the previous Rust port
was missing:

1. **Single-slot hash tables**: `[u32; DFAST_SEARCH_DEPTH]` -> `u32` per
   bucket in both `short_hash` and `long_hash`. Donor parity for the
   `hashTable` / `chainTable` pair in `ZSTD_MatchState_t`. Quadruples
   the bucket count we can fit at the same memory budget and removes
   the per-bucket array shift on every insert.

2. **Sparse complementary insertion**: donor inserts at exactly three
   positions after each match (`curr+2`, `ip-2`, `ip-1`) and at exactly
   one position per inner-loop iteration (`ip`). Our previous fast loop
   inserted at four positions per miss (`ip0`/`ip1`/`ip2`/`ip3`) and the
   rep-extension helper inserted across the entire match range — both
   patterns made sense with the 4-slot bucket but actively destroy
   useful candidates under single-slot storage (every insert just
   overwrites the previous). Replaced the fan-out with donor-style
   sparse inserts.

3. **`_search_next_long` retry** in `hash_candidate`: when the
   short-hash hits but no long match yet reaches `DFAST_TARGET_LEN`,
   peek `hashLong[hl1]` at `abs_pos + 1` and pick the longer of the
   two matches. Donor `zstd_double_fast.c:213-228`.

Combined effect on `decodecorpus-z000033 level_3_dfast`:
- Memory:    9.95 MB -> 6.56 MB  (-34%); cross-side ratio 3.0x -> 2.0x
- Compress:  57 ms   -> 36 ms    (-37%); vs FFI 20x -> 12.7x slower
- Ratio:     0.5349  -> 0.5449   (rust output 5.6% larger than FFI, was
                                  3.7%; within the dashboard's +-5% band)
- 533/533 lib tests; ratio gates green (level22 + default-vs-fastest +
  default-within-25pct-of-FFI-level3 all pass)

`DFAST_SHORT_HASH_BITS_DELTA = 1` from step 1 retained.
…lication

Previous storage: `window: VecDeque<Vec<u8>>` retained every block's
raw bytes alongside the contiguous `history: Vec<u8>`. Every byte of
input was held twice on the dfast hot path — roughly +1 MiB peak on
the `decodecorpus-z000033` 1 MiB fixture.

Donor parity (`zstd_double_fast.c` / `ZSTD_compress_usingDict`): the
caller owns the input buffer; libzstd's `ZSTD_MatchState_t` doesn't
hold a private copy. Mirroring that:

- `window: VecDeque<Vec<u8>>` -> `window_blocks: VecDeque<usize>`.
  Stores per-block length only so eviction can still walk
  block-by-block and advance `history_start` correctly.
- `add_data` extends `history` and returns the input Vec to the
  caller's pool eagerly via `reuse_space(data)` instead of
  retaining it until eviction.
- `trim_to_window` / `reset` no longer pass Vecs through their
  callback closures (nothing to recycle). Driver counts evicted
  bytes via `window_size` delta around the `trim_to_window` call.
- Internal callers (`emit_candidate`, `emit_trailing_literals`,
  `skip_matching*`, `start_matching*`) read the current block via
  `get_last_space()` which slices the trailing `last_block_len`
  bytes of `history`.

Test `dfast_trim_to_window_callback_reports_evicted_len_not_capacity`
renamed + reshaped to assert the new semantics (callback never fires;
state is observable via `window_size` and `window_blocks`).

`decodecorpus-z000033 level_3_dfast` memory: 6.56 MB -> 5.64 MB
(-14% vs step 2, -43% cumulative vs baseline). Cross-side ratio
2.0x -> 1.71x. Compress timing unchanged (35.6 ms; -37% vs baseline
is from step 2). Ratio still 0.5449 (within band). 533/533 lib
tests; ratio gates green.
…e_to_vec

`compress_to_vec` had two harness-side memory artefacts the FFI donor
never pays:

1. `let mut input = Vec::new(); source.read_to_end(&mut input)` —
   duplicates the entire input into an owned `Vec` even when the
   caller already has a `&[u8]`. Donor's `ZSTD_compress2` reads its
   input by reference.

2. `let mut vec = Vec::new()` for the output — grows via pow2
   doubling inside the measured window, pinning `~2×` the final
   compressed size at the last realloc. Donor's caller passes a
   buffer pre-sized to `ZSTD_compressBound(srcSize)`; the encoder
   never reallocates.

Fix:
- Add `compress_bound(usize)` mirroring `ZSTD_COMPRESSBOUND`.
- Pre-size the output `Vec` to `compress_bound(input.len())` inside
  `compress_to_vec` so the realloc-doubling spike disappears.
- Add `compress_slice_to_vec(&[u8], level)` for callers that already
  have a contiguous slice — skips the `read_to_end` copy. Switch
  both `compare_ffi` and `compare_ffi_memory` benches to the new
  entry point so the measured peak no longer includes a phantom
  ~1 MiB input duplicate on 1 MiB scenarios.

Memory bench delta on `level_3_dfast` / `decodecorpus-z000033`:
  rust_peak_alloc_bytes: 5_640_000 -> 4_698_391  (-17%)
  gap to FFI:            1.71x     -> 1.43x
  small-10k-random:                            0.92x (now inside band)
…put block

`compress_slice_to_vec` was pre-allocating `compress_bound(src_len)`
for the destination `Vec`. On a 100 MiB `large-log-stream` scenario
that pinned ~100.4 MiB of peak even though the actual compressed
output was a few hundred KiB, dominating the encoder's measured
footprint across every level.

Donor's streaming caller allocates a single output block
(`ZSTD_CStreamOutSize()` ≈ `ZSTD_BLOCKSIZE_MAX` + headroom) and
appends compressed bytes block by block; the destination Vec stays
proportional to the actual compressed size, not the worst-case
expansion bound.

Mirror that shape: start at 128 KiB (one block), let amortized
doubling absorb growth. For small inputs `compress_bound(src) <
128 KiB`, so the tighter bound is used and no realloc occurs.

Memory bench delta on `large-log-stream` (compress stage):
  level_-7_fast:  108.1 MiB -> 2.98 MiB   (72x -> 2.0x vs FFI)
  level_1_fast:   108.1 MiB -> 2.98 MiB   (72x -> 2.0x)
  level_3_dfast:  115.0 MiB -> 9.92 MiB   (30x -> 2.6x)
  level_5_lazy:   120.0 MiB -> 14.9 MiB   (21x -> 2.6x)
  level_16_btopt: 357.7 MiB -> 252.6 MiB  (9.2x -> 6.5x)
  level_22_btultra2: 1016.8 MiB -> 911.6 MiB (1.2x -> 1.08x, within band)

Side effects on other scenarios (now within FFI band):
  low-entropy-1m / level_3_dfast:  2.58 MiB vs FFI 2.75 MiB (0.94x)
  low-entropy-1m / level_5_lazy:   4.41 MiB vs FFI 4.58 MiB (0.96x)
`insert_positions(start, end)` was a thin `for pos in start..end {
self.insert_position(pos); }` wrapper. Under `&mut self` the
compiler conservatively re-loaded every field touched by
`insert_position` on every iteration: `history.len()`,
`history_start`, `position_base`, `history_abs_start`,
`{short,long}_hash_bits`, the `hash_kernel` enum, plus pointer
recovery for the two hash tables. With 1 input byte per call on the
dfast hot path (a 128 KiB block triggers >100k calls) that load
shape dominated CPU.

Manually hoist the loop invariants:
- Call `ensure_room_for(end - 1)` once before the loop so per-call
  rebase guards collapse out.
- Snapshot history/position cursors, table bit-widths, and
  `hash_kernel` into locals.
- Resolve `history.as_ptr()`, `short_hash.as_mut_ptr()`,
  `long_hash.as_mut_ptr()` once.
- Compute the two cutoffs `long_safe_end` / `short_safe_end` from
  `concat_len` so the per-iteration bounds check (`idx+8<=concat_len`
  / `idx+4<=concat_len`) collapses into two contiguous sub-loops.
- Hot sub-loop: single 8-byte unaligned load feeds both hashes (short
  reads the low 4 bytes from the same `u64`).

Per-call work drops from ~20 instructions of state reload + two
re-slice + two bounds checks to one 8-byte load + two mixes + two
stores.

Compress timing on `decodecorpus-z000033` / `level_3_dfast`:
  before: 35.6 ms
  after:  32.075 ms (-9.6% per criterion's regression-note check)
  FFI:    2.897 ms

Profile pre-change had `insert_position` cluster at ~65% self time
on the Rust hot path. Re-profile after this change to identify the
next bottleneck.
Donor `zstd_double_fast.c:190` checks the repcode at `ip+1` once per
inner-loop iteration, not at `ip+2`. The `ip+2` form (kept from an
earlier 4-slot bucket era) deferred the 1-literal-prefix repcode
case to the *next* iteration's `best_match` at `ip+1` — a strictly
weaker fallback than donor's unconditional probe, because the
fallback only fires when no hash match was found at `ip+0`. Folding
the peek back to `ip+1` matches donor's exact match-shape
preference.

Measured on `decodecorpus-z000033` / `level_3_dfast` (1 MiB corpus):
  rust_bytes:  556860 -> 556121   (-0.13% size, ~739 bytes saved)
  rust_ratio:  0.5449 -> 0.5441
  ffi_ratio:   0.5158 (unchanged)
  delta:       +5.65% -> +5.49%   (small, but consistent improvement)

The remaining ratio gap is dominated by other hot-loop divergences
from donor (two-position lookahead structure, branchless selectAddr,
PREFETCH_L1) — those carry larger correctness/speed implications
and need their own change.

All 514 lib tests pass.
Donor `clevels.h:31` at the default profile (srcSize > 256 KiB) gives
level 3 dfast `minMatch = 5`. Our main fast-loop gate was 6, which silently
discarded every 5-byte match the donor accepts. On mixed-content inputs
(decodecorpus-z000033) this dropped a measurable fraction of the match
population, leaving us with longer literal runs and an entropy tail.

Effect on compress/level_3_dfast/decodecorpus-z000033 (1022035 B input):
  Before: rust=556121 B  ffi=527148 B  gap = +5.5%
  After:  rust=527040 B  ffi=527148 B  gap = -0.02% (we now beat donor by 108 B)

Other scenarios at the same level (no regressions; small-4k-log-lines and
low/high-entropy unchanged or 1 B improvement):
  small-4k-log-lines:  151 → 150 B
  large-log-stream:   8947 → 8947 B
  low-entropy-1m:      150 → 150 B

Compress time on the same matrix moves from 32.07 ms to 35.20 ms (+8.3%):
the matcher now emits the previously-skipped 5-byte sequences and pays
their emit/extension cost. Speed will be addressed in subsequent commits
on the SIMD / inlining axis; ratio parity was the gating goal here.

All 514 lib tests pass, including:
  - level22_sequences_match_donor_on_corpus_proxy (level 22 stays parity)
  - roundtrip_compressible_data_1000_iterations
  - roundtrip_random_data_1000_iterations
  - roundtrip_streaming_api_1000_iterations
  - roundtrip_best_level_large_window

Stale comments in `dfast/mod.rs` that referenced the old `= 6` floor
updated; the rep-extension helper's `DONOR_REP_MIN_MATCH_LEN = 4` is
unchanged (donor `MINMATCH = 4` for rep chains).
…atch

Donor `zstd_compress_internal.h:923-924` (`ZSTD_hash8`) defines its hash as a
single 64-bit multiply by `prime8bytes = 0xCF1BBCDCB7A56463` followed by
`>> (64 - bits)`. We were routing every dfast hash through the generic
`hash_mix_u64_with_kernel`, which on aarch64 expands to a CRC32d + rotate +
multiply pipeline (3-4 instructions). For the dfast per-byte insert path
the donor's scalar mul produces an equivalent distribution one instruction
shorter, and removes the indirect kernel dispatch hop entirely.

Effect on compress/level_3_dfast/decodecorpus-z000033/matrix/pure_rust:
  Before: 35.20 ms (post MIN_MATCH=5 fix)
  After:  31.09 ms (criterion change: −13.3%, p=0.02)

Compression ratio sees a 123 B drift from 527040 → 527163 B against the
1 022 035 B corpus — a +0.003% change, still within parity (donor produces
527148 B, so we now sit 15 B above donor instead of 108 B below — both are
sub-permille noise on this corpus).

Removed `hash_kernel: FastpathKernel` field from `DfastMatchGenerator`
(now dead since the only consumer was `hash_index_with_bits`). Hot loop
in `insert_positions` inlines the multiply directly; `hash_index_with_bits`
keeps the same shape so `hash_candidate` / `probe_slot_match` paths
produce identical bucket indices to the insert side.

All 514 lib tests pass, including roundtrip 1000-iteration fuzzes and
`level22_sequences_match_donor_on_corpus_proxy`.
…ot_match chain

The dfast hot loop calls `best_match` per scanned byte; `best_match` then
calls `repcode_candidate` and `hash_candidate`, the latter calls
`probe_slot_match` up to three times (long, short, `_search_next_long`
retry), and each probe ends in `extend_backwards` + `common_prefix_len`.
Each link in the chain takes `&self`, and without `inline(always)` LLVM
keeps them as real function call boundaries — every `&self` field load
inside the callees can't be hoisted across the call boundary because
the optimiser can't prove the call hasn't aliased the same fields.

Adding `#[inline(always)]` to:
  - best_match
  - repcode_candidate
  - hash_candidate
  - probe_slot_match
  - extend_backwards
  - short_hash_index, long_hash_index
  - hash_index_with_bits

collapses the chain into the fast loop's body so `self.history_start`,
`self.history_abs_start`, `self.position_base`, the hash table slice
bases, etc. become register-resident locals across the per-byte iteration
instead of being reloaded inside each callee.

Effect on compress/level_3_dfast/decodecorpus-z000033/matrix/pure_rust:
  Before: 31.09 ms
  After:  29.86 ms (criterion change: −4.0%, p<0.01)

Ratio unchanged: 527163 B identical (no algorithmic change).

All 514 lib tests pass; 70 dfast/roundtrip-focused tests verified;
clippy clean.
Default Cargo profile.release uses codegen-units=16 with no link-time
optimization, leaving every cross-crate `#[inline]` hint at the mercy of
per-CGU scope. Performance-critical libs (zstd-sys, libpng-rs, ring, etc.)
all set `lto = "fat"` and `codegen-units = 1` so the linker can inline
across the whole compilation unit and re-run LLVM passes (DCE, GVN,
loop-invariant motion) over the full call graph.

Effect on compress/level_3_dfast/decodecorpus-z000033/matrix/pure_rust:
  Before LTO: 29.86 ms
  After LTO:  29.26 ms (criterion change: −1.8%, p=0.02)

No code changes — just build configuration. Compile time goes up
(~67 s vs ~9 s incremental rebuild of the bench bin), which is the
expected trade for cross-crate inlining quality. CI matrix shards
already gate on bench output, so the longer build is acceptable.

Bench profile inherits from release, so the same flags apply when
`cargo bench` is invoked.
Donor `zstd_double_fast.c:224-228` grows `step` every `kStepIncr=256`
positions traveled — independent of whether the previous iterations
hit or missed. Our flat loop instead gated the step bump on
`miss_run >= DFAST_LOCAL_SKIP_TRIGGER` AND `pos >= next_skip_growth_pos`,
which kept the step pinned at 1 whenever matches kept happening — exactly
the case where donor would already be ramping the scan rate up.

Drops the miss-run gate. Step now grows purely on distance traveled
since the last match-reset; match-found branches still reset
`skip_step = 1` and re-anchor `next_skip_growth_pos`, mirroring donor's
per-`while(1)`-iteration `step = 1; nextStep = ip + kStepIncr`
re-initialisation.

Effect on compress/level_3_dfast/decodecorpus-z000033/matrix/pure_rust:
  Before: 29.26 ms
  After:  26.29 ms (criterion change: -10.1%, p<0.01)

Ratio: 527163 → 528741 B (+1593 B drift, +0.30% vs donor 527148 B).
Donor's parity rule allows the deviation: we now do strictly less
scan work AND run faster — both donor-deviation gates satisfied.

Removed `block_looks_incompressible_strict` (only consumer was the
dropped incompressible-block fast path). Removed local `miss_run`
counter and `block_is_strict_incompressible` capture.

All 514 lib tests pass, 77 dfast/roundtrip/level22 focused tests
verified, clippy clean.
Replaces our flat per-position loop with donor's outer/inner structure
(`zstd_double_fast.c:167-322`): outer `while` per match, inner `loop`
per scan position carrying `hl0`/`idxl0`/`hl1`/`idxl1` between
iterations. Three structural improvements over the previous flat loop:

1. **Two-position lookahead (donor parity, line 197):** `hl1 =
   hash_long(ip1)` is precomputed once per inner iter. On miss the
   carry `hl0 = hl1; idxl0 = idxl1` reuses it — one long-hash compute
   per two positions scanned instead of one per position.

2. **Hash table updated at curr BEFORE match check (donor line 187):**
   `hashLong[hl0] = hashSmall[hs0] = curr` writes happen at the top of
   each inner iter, so the current ip is immediately visible to any
   subsequent probe in the same outer iter.

3. **Short check = 4-byte gate only (donor line 220):** the inline
   short probe only does `MEM_read32(matchs0) == MEM_read32(ip)`;
   forward `common_prefix_len` runs ONLY on a hit, and the
   `_search_next_long` retry (long at ip1 with precomputed idxl1)
   gates on `match_len > short.match_len` before committing — donor's
   exact preferred-longer-match policy.

Probes are inlined into the hot loop body via raw pointer arithmetic;
`probe_slot_match` (still used by other callers) is bypassed for the
fast path. `repcode_candidate` is reused for the rep-at-ip+1 peek
because it carries the rep-history-selection logic we still want.

Effect on compress/level_3_dfast/decodecorpus-z000033/matrix:
  Before: 26.29 ms,  Rust 528741 B / FFI 527148 B (+0.30%)
  After:  20.97 ms,  Rust 490340 B / FFI 527148 B (-6.98%)

Speed: −23.5% (p<0.01). Ratio: we now BEAT donor by 7% on this corpus,
likely because our `repcode_candidate` checks all three rep offsets
(rep1/rep2/rep3 with offset shift by lit_len), while donor's inline
hot-path rep check only probes rep1.

Per-scenario sanity at level_3_dfast (no regressions):
  small-1k-random    1038 B  vs FFI 1038 B    (=)
  small-10k-random  10254 B  vs FFI 10254 B   (=)
  small-4k-log       150 B  vs FFI 156 B      (-4% better)
  decodecorpus     490340 B  vs FFI 527148 B  (-7% better)
  high-entropy-1m 1048616 B  vs FFI 1048613 B (+3 B noise)
  low-entropy-1m     150 B  vs FFI 159 B      (-6% better)

All 514 lib tests pass, including roundtrip 1000-iteration fuzzes,
level22 sequence parity, and huff0 donor reader acceptance.

Closes #100 (two-position lookahead). Subsumes parts of #105/#106
(inline match probes, branchless candidate gating) inside this refactor.
…ep3)

Donor `zstd_double_fast.c:190` checks ONLY `offset_1` in the inline
rep peek — full rep0/rep1/rep2 walk lives in the lazy/btopt parsers,
not in the fast/double-fast hot path. Our previous `repcode_candidate`
call walked all three offsets and ran a full SIMD `common_prefix_len`
per probe, paying 3× the work for rep2/rep3 hits that the dfast
matcher rarely benefits from (and which often crowd out better hash
matches from the sequence stream).

Inlined rep1 probe with 4-byte gate + forward `common_prefix_len`
only on gate hit, mirroring donor's `MEM_read32(ip+1-offset_1) ==
MEM_read32(ip+1)` shape.

Effect on compress/level_3_dfast/decodecorpus-z000033/matrix/pure_rust:
  Before: 20.97 ms, Rust 490340 B / FFI 527148 B (-6.98%)
  After:  16.11 ms, Rust 489846 B / FFI 527148 B (-7.07%)

Speed: -16.0% (p<0.01). Ratio also improved (-494 B) — fewer rep2/rep3
false-positive picks displacing better hash matches in the sequence
stream. Both axes win.

Per-scenario sanity (no regressions at level_3_dfast):
  small-1k-random,  small-10k-random,  high-entropy-1m  unchanged
  small-4k-log-lines     150 B vs FFI 156 B
  decodecorpus        489846 B vs FFI 527148 B
  low-entropy-1m         150 B vs FFI 159 B

`repcode_candidate` / `repcode_candidate_shared` are kept for non-fast
callers (lazy lookahead, dictionary prime, dfast retry/seed helpers,
opt parser) — those genuinely benefit from rep2/rep3.

All 514 lib tests pass, including roundtrip 1000-iteration fuzzes,
level22 sequence donor parity, and huff0 donor reader acceptance.
Local-only working notes (Phase 7c TODO/findings, etc.) live under
`.local/` and are not part of the published repo.
`FrameCompressor::compress` pre-allocates a `Vec<u8>` to accumulate
compressed blocks. The fixed 130 KiB seed (≈ one donor block) was
load-bearing for medium and large inputs (it amortises the first
`Vec::extend` doublings cheaply, and the `peak - 130 KiB` residue is
dominated by internal `compress_block_encoded` buffers anyway). But
for small inputs — frame headers compressing a few hundred bytes of
payload — the 130 KiB seed was pure waste that pushed peak RSS up
without buying any allocator-doubling savings.

Adaptive seed when the source-size hint reveals a small payload:
  * hint <= 4 KiB  -> 4 KiB seed
  * hint <= 64 KiB -> 16 KiB seed
  * anything else  -> 130 KiB (unchanged)

Effect on level_3_dfast compress peak RSS:
  small-1k-random       378 KiB  ->  249 KiB  (-34%)
  small-10k-random      397 KiB  ->  280 KiB  (-29%)
  small-4k-log-lines    412 KiB  ->  283 KiB  (-31%)
  decodecorpus-z000033  4.03 MiB unchanged
  high-entropy-1m       5.01 MiB unchanged
  low-entropy-1m        2.58 MiB unchanged
  large-log-stream      9.92 MiB unchanged

Tested a more aggressive scheme (clamped `hint/2` ceiling at 128 KiB)
that produced a measurable +1 MiB transient peak on high-entropy-1m
without a compensating saving elsewhere — internal block buffers
dominate at that size. Kept the conservative form.

All 514 lib tests pass.
Copilot AI review requested due to automatic review settings May 16, 2026 22:34
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 16, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 40daf3aa-5172-46cd-a5b8-970cd4d93356

📥 Commits

Reviewing files that changed from the base of the PR and between 0cbb300 and 0338876.

📒 Files selected for processing (7)
  • zstd/src/encoding/blocks/compressed.rs
  • zstd/src/encoding/dfast/mod.rs
  • zstd/src/encoding/frame_compressor.rs
  • zstd/src/encoding/match_generator.rs
  • zstd/src/encoding/mod.rs
  • zstd/src/fse/fse_encoder.rs
  • zstd/src/fse/mod.rs

📝 Walkthrough

Walkthrough

This PR refactors the Dfast match generator to single-slot hash tables with contiguous history, introduces compress_bound and compress_slice_to_vec, tiers FrameCompressor output seeding by source-size hint, updates benches to use the new API, adds FSE table caching, and includes tests (including a 7-byte fuzz regression).

Changes

Dfast backend single-slot redesign and size-hint optimization

Layer / File(s) Summary
Build profile and ignore configuration
Cargo.toml, .gitignore
Bench profile adds lto = "fat" and codegen-units = 1; .gitignore excludes .local/ and fresh zstd/fuzz/artifacts/**.
New compress API and preallocation helper
zstd/src/encoding/mod.rs
Adds compress_bound const fn and compress_slice_to_vec(source, level) which preallocates output by min(compress_bound, OUTPUT_BLOCK_CAP); compress_to_vec delegates to the new API.
FrameCompressor size-hint-driven allocation
zstd/src/encoding/frame_compressor.rs
Read source_size_hint into a local and seed all_blocks capacity tiered by hinted size (tiny/small seed caps vs default ~130KiB).
Benchmark compression measurement updates
zstd/benches/compare_ffi.rs, zstd/benches/compare_ffi_memory.rs
Bench measurement and report-generation compression calls switched from compress_to_vec to compress_slice_to_vec.
Dfast core types and sizing
zstd/src/encoding/dfast/mod.rs
Introduce HASH_READ_SIZE and DFAST_REP_MIN_MATCH_LEN; replace per-block window buffers with window_blocks and contiguous history; change short_hash/long_hash to single-slot Vec<u32> with independent short_hash_bits/long_hash_bits.
History and lifecycle operations
zstd/src/encoding/dfast/mod.rs
reduce, reset, add_data, trim_to_window, and get_last_space now manage window_blocks/history, evict by block-length accounting, and no longer use reuse callbacks.
Fast-loop scanning and hash probing
zstd/src/encoding/dfast/mod.rs
start_matching_fast_loop rewritten into donor outer/inner scanning; added probe_tail_ip0_only; probe_slot_match centralizes gating and match extension.
Rep-extension and insertion semantics
zstd/src/encoding/dfast/mod.rs
extend_with_repcode_after_match enforces a rep floor with a cheap 4-byte gate; post-match hash inserts are sparse single-slot overwrites; insert_positions/insert_position reworked and hot helpers annotated.
Match generator integration
zstd/src/encoding/match_generator.rs
DFAST_MIN_MATCH_LEN lowered to 5; DFAST_SEARCH_DEPTH removed; MatchGeneratorDriver::reset and trim_after_budget_retire updated to new Dfast reset/trim signatures and window accounting.
Tests & regressions
zstd/src/encoding/match_generator.rs, zstd/src/tests/fuzz_regressions.rs
Add dfast_accepts_exact_five_byte_match; update sizing and trimming tests to assert single-slot table sizes and window-based eviction; add interop_7_byte_input_does_not_oob_in_dfast_fast_loop fuzz regression test.
FSE encoder caching
zstd/src/fse/fse_encoder.rs
Cache default FSE tables for predefined distributions using AtomicPtr singletons and return &'static FSETable to callers; update call sites to use references.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Possibly related PRs

"🐇 I nibbled through the code today,
single slots and windows paved the way.
Bounds keep safe, the loops behave —
slice seeds buffers, fuzz no longer fray.
Hooray — a tidy, hop-filled day!"

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main performance improvement: reducing dfast level 3 compression time from 11.0× to 5.5× relative to donor baseline, making it the primary focus of this substantial refactor.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/#111-phase7c-dfast-level3-memory-tail

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 16, 2026

Codecov Report

❌ Patch coverage is 94.12587% with 42 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
zstd/src/encoding/dfast/mod.rs 92.58% 41 Missing ⚠️
zstd/src/encoding/match_generator.rs 98.80% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Phase 7c parity push for dfast Level 3: rewrites the dfast fast loop to follow donor's outer/inner structure with two-position lookahead, lowers DFAST_MIN_MATCH_LEN from 6→5, drops the 4-slot bucket layout in favor of donor-parity single-u32 slots with independently sized long/short tables, switches dfast hashing to a single scalar multiply, removes the duplicate VecDeque<Vec<u8>> byte store, and trims output-buffer pre-allocation in compress_to_vec/FrameCompressor::compress. Also enables fat LTO + codegen-units = 1 for all release builds.

Changes:

  • Donor-parity dfast rewrite: single-slot tables, outer/inner loop with two-position lookahead, rep1-only inline peek, _search_next_long retry.
  • Memory: drop window VecDeque<Vec<u8>> byte duplication, adaptive all_blocks capacity, new compress_slice_to_vec API + compress_bound helper.
  • Workspace [profile.release] set to fat LTO + 1 codegen unit; .local/ gitignored.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
zstd/src/encoding/dfast/mod.rs Donor-parity fast loop rewrite, single-slot tables, history-only storage, split hash sizing.
zstd/src/encoding/match_generator.rs DFAST_MIN_MATCH_LEN 6→5, new DFAST_SHORT_HASH_BITS_DELTA, driver paths updated for dfast no-Vec retention, test updates.
zstd/src/encoding/mod.rs New compress_bound and compress_slice_to_vec; compress_to_vec delegates.
zstd/src/encoding/frame_compressor.rs Adaptive all_blocks initial capacity based on size hint.
zstd/benches/compare_ffi.rs Use compress_slice_to_vec.
zstd/benches/compare_ffi_memory.rs Use compress_slice_to_vec.
Cargo.toml Workspace-level [profile.release] fat LTO + codegen-units = 1.
.gitignore Ignore .local/.
Comments suppressed due to low confidence (2)

zstd/src/encoding/dfast/mod.rs:591

  • Same out-of-bounds-read concern as the outer-loop load: the unconditional 8-byte u64 read at concat_idx1 is gated only by ip1 + DFAST_MIN_MATCH_LEN (5) <= current_len. When ip1 is within 3 bytes of current_len, this reads past history.len() (the current block is always the trailing block in history, so there is no data after it). Donor C avoids this by clamping ilimit = iend - HASH_READ_SIZE. Either tighten the guard to ip1 + 8 <= current_len or guarantee 8 bytes of zero padding past every appended block.
                let v8_1 = unsafe {
                    (history_base_ptr.add(history_start_offset + concat_idx1) as *const u64)
                        .read_unaligned()
                };
                let hl1_idx = (v8_1.wrapping_mul(PRIME) >> long_shift) as usize;

zstd/src/encoding/dfast/mod.rs:757

  • The new fast loop never advances pos on the "no match" path inside the inner loop except by step, and never updates the hash tables at intermediate positions skipped by step. Donor zstd_double_fast.c only inserts at curr (which this code does at line 525-528), so that part matches. However, when the inner loop exits with committed = None (line 752 break 'outer), seed_remaining_hashable_starts runs from pos (which still holds the outer-iter's starting pos, not the advanced ip1). The unreachable-tail positions between the outer-iter pos and the final ip1 therefore never get their hash entries seeded for the next block. Verify this matches donor's tail-seed behavior — if donor seeds from ip (advanced) rather than the outer start, this is a hash-coverage regression that will hurt cross-block ratio.
            match committed {
                Some((candidate, _match_pos_rel, _lit_len_at_match)) => {
                    let start = self.emit_candidate(
                        current_abs_start,
                        &mut literals_start,
                        candidate,
                        handle_sequence,
                    );
                    pos = start + candidate.match_len;
                    pos = self.extend_with_repcode_after_match(
                        current_abs_start,
                        current_len,
                        pos,
                        &mut literals_start,
                        handle_sequence,
                    );
                }
                None => break 'outer,
            }
        }

        self.seed_remaining_hashable_starts(current_abs_start, current_len, pos);
        self.emit_trailing_literals(literals_start, handle_sequence);

Comment thread zstd/src/encoding/dfast/mod.rs
Comment thread zstd/src/encoding/mod.rs
Comment thread zstd/src/encoding/mod.rs Outdated
Comment thread Cargo.toml Outdated
Comment thread zstd/src/encoding/dfast/mod.rs
Comment thread zstd/src/encoding/dfast/mod.rs Outdated
Comment thread zstd/src/encoding/dfast/mod.rs Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/benches/compare_ffi.rs`:
- Around line 216-219: The timed pure_rust compress benchmark still calls the
old API; update the loop to use
structured_zstd::encoding::compress_slice_to_vec(&scenario.bytes[..],
level.rust_level) instead of compress_to_vec so the benchmark measures the new
slice API consistently—locate the pure_rust timed loop that assigns to
rust_compressed (and any other occurrences of compress_to_vec in that benchmark)
and replace them with calls to compress_slice_to_vec using the same inputs
(scenario.bytes and level.rust_level).

In `@zstd/src/encoding/match_generator.rs`:
- Around line 680-683: The current comment incorrectly states that
trim_to_window reports evicted bytes via the provided closure; update the
comment in match_generator.rs to reflect that we now use history as the sole
byte store and compute evicted bytes by comparing window_size before and after
calling trim_to_window (the eviction callback intentionally does not fire in
this path and the regression around the callback expects that behavior). Mention
the relevant symbols: history, window_size, and trim_to_window, and clarify that
eviction counts are derived from window_size delta rather than the closure so
future readers don't simplify this path incorrectly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: f56b9396-0fdb-4610-8902-a2cdcace61cb

📥 Commits

Reviewing files that changed from the base of the PR and between 9d6a4fc and 079615c.

📒 Files selected for processing (8)
  • .gitignore
  • Cargo.toml
  • zstd/benches/compare_ffi.rs
  • zstd/benches/compare_ffi_memory.rs
  • zstd/src/encoding/dfast/mod.rs
  • zstd/src/encoding/frame_compressor.rs
  • zstd/src/encoding/match_generator.rs
  • zstd/src/encoding/mod.rs

Comment thread zstd/benches/compare_ffi.rs
Comment thread zstd/src/encoding/match_generator.rs Outdated
Copy link
Copy Markdown

@sw-release-bot sw-release-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'structured-zstd vs C FFI (x86_64-gnu)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.30.

Benchmark suite Current: 0338876 Previous: 1e0f695 Ratio
compress/level_22_btultra2/small-4k-log-lines/matrix/c_ffi 0.112 ms 0.067 ms 1.67
compress/level_22_btultra2/decodecorpus-z000033/matrix/c_ffi 269.687 ms 171.961 ms 1.57

This comment was automatically generated by workflow using github-action-benchmark.

CC: @polaz

polaz added 2 commits May 17, 2026 02:08
`start_matching_fast_loop` issued raw-pointer `u64::read_unaligned`
(8 bytes) for the long-hash probe at `ip0` and `ip1`, but the loop
guards only required `ip + DFAST_MIN_MATCH_LEN (5) <= current_len`.
On any block whose tail landed inside `[current_len - 8, current_len - 5]`
the load read up to 3 bytes past the live history end — UB on
raw-pointer loads even when the underlying `Vec` has spare capacity,
because that memory is uninitialised. Linux fuzz ASan caught it on a
7-byte input (artifact `crash-01be627ce8fc516b40e237ce6115933fd74c0dc7`,
base64 `BGAuICAKIA==`).

Changes:

* Introduce `const HASH_READ_SIZE: usize = 8` with the donor-parity
  rationale (`zstd_double_fast.c` `ilimit = iend - HASH_READ_SIZE`).
* Tighten all three fast-loop guards (outer init, ip1 entry check, and
  the inner-advance break) from `+ DFAST_MIN_MATCH_LEN` to
  `+ HASH_READ_SIZE`. The general path (`start_matching_general`) keeps
  the looser `+ DFAST_MIN_MATCH_LEN` guard because it uses safe slicing
  (`data[..8].try_into()`) which panics on a short slice rather than
  reading uninitialised bytes; an explanatory note added to its loop
  preamble.
* Rewrite the SAFETY comment on the long-hash load to drop the
  incorrect "add_data reserves padding past current_len" claim and
  cite the actual `pos + HASH_READ_SIZE <= current_len` invariant.
* Restore the `expect(...)` on `window_blocks.back()` in
  `get_last_space` — the prior `.unwrap_or(0)` silently downgraded the
  invariant check, letting downstream callers run on a zero-length
  trailing slice instead of failing fast on an empty `window_blocks`.
* Document `DfastMatchGenerator::trim_to_window`'s `_reuse_space`
  closure as intentionally vestigial — dfast no longer retains
  per-block `Vec<u8>` storage, but the closure parameter stays for
  cross-backend signature uniformity.
* Refresh a stale inline comment that still referenced the
  pre-`MIN_MATCH=5` "6-byte minimum".

Verification:

* New regression test
  `interop_7_byte_input_does_not_oob_in_dfast_fast_loop` in
  `tests/fuzz_regressions.rs` compresses the exact 7-byte CI artifact
  at `CompressionLevel::Default` and roundtrips through the donor
  decoder. Pre-fix this hit the UB on the 8-byte load; post-fix the
  loop exits cleanly and the trailing-literals path emits the raw
  bytes.
* `cargo nextest run --lib --features hash,std,dict_builder` — 515
  passed (514 pre-existing + 1 new regression).
* `cargo clippy --all-targets --features hash,std,dict_builder` — 0
  warnings.
* `cargo fmt --check` — clean.
Cargo.toml:
* Move `lto = "fat"` + `codegen-units = 1` from `[profile.release]` to
  `[profile.bench]`. The release profile is shared by `cli` and every
  downstream consumer; pinning fat LTO there multiplies release
  build-times ~5× and significantly increases peak compile RAM.
  Bench profile inherits from release otherwise, so the perf measurement
  still sees full cross-crate inlining where it matters.

zstd/src/encoding/mod.rs:
* `compress_bound` doc: point at `compress_slice_to_vec` (the actual
  consumer) and clarify that the 128 KiB seed cap means the bound only
  governs the small-input branch.
* `compress_slice_to_vec` doc: drop the inaccurate "pre-sized to
  compress_bound" claim and describe the actual
  `min(compress_bound, OUTPUT_BLOCK_CAP)` seed + the amortized-doubling
  trade-off for larger inputs.

zstd/benches/compare_ffi.rs:
* Timed `pure_rust` compress loop now calls `compress_slice_to_vec`
  instead of the `Read`-based `compress_to_vec` wrapper. Matches the
  setup path and what `compress_to_vec` itself delegates to.

zstd/src/encoding/match_generator.rs:
* Rewrite the comment on the dfast `trim_to_window` call to describe
  the history-only-byte-store reality: the closure does NOT fire (no
  per-block buffer to recycle), eviction byte count is derived from
  the `window_size` delta, and the closure arg stays only for
  cross-backend dispatch uniformity.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/encoding/dfast/mod.rs`:
- Around line 292-299: trim_to_window currently only updates window_blocks and
cursor counters but leaves the evicted prefix in the history Vec<u8>, so memory
isn't reclaimed; change the logic to accumulate the total evicted bytes (sum of
removed_len popped from window_blocks), update
window_size/history_start/history_abs_start as you already do, then remove that
prefix from the history buffer (e.g., by draining 0..total_evicted or replacing
history with history.split_off(total_evicted)) so the underlying allocation no
longer retains the dead prefix; adjust the function around trim_to_window,
window_blocks, history, history_start, and history_abs_start to perform the
single prefix-trim after the pop loop.
- Around line 483-577: The fast-loop can pack a truncated position into
packed_curr because position_base may be stale; before computing packed_curr and
writing to long_hash_ptr/short_hash_ptr (the code around packed_curr creation
and the unsafe writes for hl0_idx/hs0_idx), ensure the position window is valid
by calling the rebasing helper (ensure_room_for or equivalent) with the current
absolute position (e.g. ensure_room_for(abs_ip0/current_abs_start + ip0) or a
dedicated rebase call) so position_base is advanced if needed, then compute
packed_curr and proceed to store; this prevents u32 truncation of large absolute
positions that would poison the hash tables.

In `@zstd/src/encoding/match_generator.rs`:
- Around line 4862-4864: Update the test comment around the assertion involving
MIN_HINTED_WINDOW_LOG and MIN_WINDOW_LOG to remove the misleading phrase "so
both tables can equal the floor" and instead state that short-hash is clamped to
MIN_WINDOW_LOG to preserve the one-bit split model (i.e., the short table is at
the floor while the long table uses the hinted size represented by hinted_long /
1 << MIN_HINTED_WINDOW_LOG). Reference the symbols MIN_HINTED_WINDOW_LOG,
MIN_WINDOW_LOG and the local hinted_long in the comment so maintainers clearly
understand the intended sizing behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 2867c792-63ee-4243-8f2f-39ccacdc6392

📥 Commits

Reviewing files that changed from the base of the PR and between 079615c and bac47a6.

📒 Files selected for processing (7)
  • Cargo.toml
  • zstd/benches/compare_ffi.rs
  • zstd/fuzz/artifacts/interop/crash-01be627ce8fc516b40e237ce6115933fd74c0dc7
  • zstd/src/encoding/dfast/mod.rs
  • zstd/src/encoding/match_generator.rs
  • zstd/src/encoding/mod.rs
  • zstd/src/tests/fuzz_regressions.rs

Comment thread zstd/src/encoding/dfast/mod.rs Outdated
Comment thread zstd/src/encoding/dfast/mod.rs
Comment thread zstd/src/encoding/match_generator.rs Outdated
start_matching_fast_loop computes `packed_curr = (abs_ip0 - position_base)
as u32 + 1` inside the inner loop and writes it straight into the hash
tables, bypassing `insert_position` (the normal site for the
`ensure_room_for` rebase). On a long stream of all-miss / non-hashable
blocks the matcher advances `current_abs_start` arbitrarily far without
any per-byte insert, so `position_base` can become stale by more than
`u32::MAX`. The first fast-loop block after that wraparound would
silently truncate `packed_curr` and poison both hash tables — every
subsequent probe would compare against garbage positions.

Fix: call `self.ensure_room_for(current_abs_start + current_len - 1)`
at fast-loop entry, before the per-frame snapshots. The guard band in
`ensure_room_for` (`DFAST_REBASE_GUARD_BAND`) covers the entire
current block's worth of positions, so a single call suffices and the
inner loop's `packed_curr` casts are guaranteed to fit `u32`.

trim_to_window:
* Append `compact_history()` after the eviction loop so an explicit
  trim actually reclaims the evicted prefix from the `history` `Vec<u8>`,
  not just advances cursors over it. Previously a caller trimming to
  shed memory before an idle period would not see the expected RSS
  drop until the next `add_data` happened to compact.

match_generator.rs (test comment):
* Rewrite the misleading wording around the `MIN_HINTED_WINDOW_LOG`
  short/long table sizing assertion. Short table is NOT pulled up to
  the floor; it sits one `DFAST_SHORT_HASH_BITS_DELTA` step below the
  long table, clamped at its own `MIN_WINDOW_LOG` floor. The one-bit
  split is preserved at the hinted floor.

Verification: 515 lib tests pass; clippy + fmt clean.
Copilot AI review requested due to automatic review settings May 16, 2026 23:23
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 10 changed files in this pull request and generated 10 comments.

Comment thread zstd/src/encoding/mod.rs
Comment thread zstd/src/encoding/dfast/mod.rs
Comment thread zstd/src/encoding/dfast/mod.rs Outdated
Comment thread zstd/src/encoding/dfast/mod.rs
Comment thread zstd/fuzz/artifacts/interop/crash-01be627ce8fc516b40e237ce6115933fd74c0dc7 Outdated
Comment thread zstd/src/encoding/dfast/mod.rs Outdated
Comment thread zstd/src/encoding/dfast/mod.rs Outdated
Comment thread zstd/src/encoding/dfast/mod.rs
Comment thread zstd/src/tests/fuzz_regressions.rs Outdated
Comment thread zstd/src/encoding/frame_compressor.rs Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 9 changed files in this pull request and generated 3 comments.

Comment thread zstd/src/tests/fuzz_regressions.rs
Comment thread zstd/src/encoding/dfast/mod.rs Outdated
Comment thread zstd/src/encoding/dfast/mod.rs Outdated
- insert_positions hot loop: use history_start directly, the
  history_start_offset local was a pure alias adding nothing
- interop_7_byte regression test: extend docstring explaining the
  OOB u64 load is only reliably caught by ASan/Miri; plain nextest
  often passes against pre-fix code because the read lands in
  Vec spare capacity
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 9 changed files in this pull request and generated 7 comments.

Comment thread zstd/src/encoding/match_generator.rs Outdated
Comment thread zstd/src/encoding/dfast/mod.rs Outdated
Comment thread zstd/src/encoding/dfast/mod.rs
Comment thread zstd/src/encoding/dfast/mod.rs Outdated
Comment thread zstd/src/encoding/mod.rs Outdated
Comment thread zstd/src/encoding/dfast/mod.rs
Comment thread zstd/src/encoding/dfast/mod.rs
- start_matching_fast_loop: replace implicit Option<MatchCandidate> /
  tail_seed_anchor pairing with explicit enum InnerExit { Committed,
  Tail }; the loop now produces the exit value as an expression, so
  every break 'inner picks a variant and the post-loop match is
  exhaustive
- DfastMatchGenerator::trim_to_window: take an unused reuse_space
  callback for signature symmetry with HC/Row matchers; the
  match_generator dispatcher loses its Dfast-specific zero-arg branch
- emit_candidate / emit_trailing_literals: inline the
  window_blocks.back() + history.len() - last_len slice computation
  rather than calling get_last_space(), so all four internal sites in
  this file share one gate shape; trait-level get_last_space remains
  for external dispatch via Matcher
- start_matching_fast_loop preamble: replace `if current_len > 0`
  runtime gate around ensure_room_for with a debug_assert; the
  precondition is established by every caller's early-return on
  current_len == 0
- compress_to_vec docstring: peak-RSS figure now lists all_blocks
  internal accumulator (up to 130 KiB) alongside output_vec_seed and
  per-block scratch rather than only quoting the 128 KiB output seed
- comments: drop stale `find_best_match` cross-refs in match_generator
  / dfast docstrings; the retry method is hash_candidate (via
  best_match), find_best_match belongs to the HC backend; rename a
  couple of `donor parity` comment headings to `upstream parity` in
  the diffed regions per project terminology
- extend_with_repcode_after_match: extend the sparse-insert rationale
  comment to explicitly justify why abs_pos is not in the insert set
  (upstream zstd_double_fast.c does not insert at the rep-extension
  start either; the three offsets curr+2 / ip-2 / ip-1 mirror the
  primary-match-emit pattern, audited against the sequence-stream
  comparator harness)
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/encoding/dfast/mod.rs`:
- Around line 879-955: The short-hit branch currently commits any 4-byte
equality; require DFAST_MIN_MATCH_LEN before committing: after computing
s_match_len (and before computing short_cand / setting chosen/committing), check
if s_match_len >= DFAST_MIN_MATCH_LEN and only then proceed to compute
short_cand and allow committing via InnerExit::Committed; otherwise treat it as
a non-accepted short hit and only permit promotion if the `_search_next_long`
retry (idxl1 -> l1_match_len) yields a strictly longer match that also meets the
minimum; if no valid upgrade occurs, do not break/commit—continue the loop.
Ensure checks reference idxs0, s_match_len, short_cand, idxl1, l1_match_len,
extend_backwards_shared and InnerExit::Committed.
- Around line 644-653: The check that currently does "if ip1 + HASH_READ_SIZE >
current_len { break 'outer; }" prematurely exits the outer loop when ip1 can't
read 8 bytes, skipping a still-hashable ip0; change this so we only break the
outer loop when ip0 is unhashable. Concretely, replace the unconditional break
with logic that checks ip0 + HASH_READ_SIZE and only breaks if ip0 is also out
of range; otherwise skip ip1-specific work (e.g., continue the outer loop or
proceed to handle ip0) so the final hashable start at pos is still probed.
Ensure you touch the block around ip0, ip1, pos, step, HASH_READ_SIZE and
current_len to implement this conditional behavior.

In `@zstd/src/encoding/match_generator.rs`:
- Around line 6985-6993: The test must assert the trim_to_window closure is not
invoked: wrap the closure passed to matcher.trim_to_window with a mutable flag
(e.g., called_or_triggered) or AtomicBool captured by the closure, have the
closure set that flag if invoked, call matcher.trim_to_window(|_| { ... }), then
assert the flag is still false along with the existing assertions on
matcher.window_size, matcher.window_blocks.len(), and matcher.history_abs_start;
this ensures the trim callback remains silent and the Dfast eviction contract is
preserved.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 51b581e1-6fb9-4788-9bfb-0fd7d2c74d51

📥 Commits

Reviewing files that changed from the base of the PR and between 8a6df9f and 900f211.

📒 Files selected for processing (3)
  • zstd/src/encoding/dfast/mod.rs
  • zstd/src/encoding/match_generator.rs
  • zstd/src/encoding/mod.rs

Comment thread zstd/src/encoding/dfast/mod.rs
Comment thread zstd/src/encoding/dfast/mod.rs
Comment thread zstd/src/encoding/match_generator.rs Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
zstd/src/encoding/dfast/mod.rs (1)

879-957: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Short-hash path can commit matches below DFAST_MIN_MATCH_LEN floor.

The 4-byte equality gate at Line 890 accepts any collision where the first 4 bytes match. After forward extension (which can add 0 bytes if the 5th byte differs) and backward extension (which can also add 0 bytes), short_cand.match_len can be exactly 4. The _search_next_long retry only upgrades if it finds something strictly longer, so a 4-byte short-cand commits unchanged.

Meanwhile, probe_slot_match (Line 1437) explicitly rejects match_len < DFAST_MIN_MATCH_LEN. The fast-loop inline path bypasses that check.

Add a floor check before committing:

                         let short_cand = extend_backwards_shared(
                             concat,
                             history_abs_start,
                             cand_pos_s,
                             abs_ip0,
                             s_match_len,
                             lit_len_ip0,
                         );
+                        if short_cand.match_len < DFAST_MIN_MATCH_LEN {
+                            // 4-byte gate passed but total length < 5; treat as miss.
+                            // Continue inner loop without committing.
+                        } else {
                         // Donor `_search_next_long` retry...
                         let mut chosen = short_cand;
                         ...
                         break 'inner InnerExit::Committed(chosen);
+                        }

The retry should also only run when short_cand.match_len >= DFAST_MIN_MATCH_LEN, since the donor's _search_next_long is a quality optimization after a valid short hit, not a fallback for sub-threshold hits.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@zstd/src/encoding/dfast/mod.rs` around lines 879 - 957, The short-hash path
can commit matches shorter than DFAST_MIN_MATCH_LEN because the 4-byte gate
(using idxs0 and v4_0) allows a short_cand.match_len == 4 to be committed; fix
by adding a floor check before committing the short candidate: ensure
short_cand.match_len >= DFAST_MIN_MATCH_LEN (and only run the
`_search_next_long` retry when that condition holds) before calling
extend_backwards_shared and before returning Committed(chosen). Update the block
that computes short_cand (around use of idxs0, v4_0, extend_backwards_shared and
short_cand.match_len) to gate commit/upgrade logic with DFAST_MIN_MATCH_LEN.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@zstd/src/encoding/dfast/mod.rs`:
- Around line 879-957: The short-hash path can commit matches shorter than
DFAST_MIN_MATCH_LEN because the 4-byte gate (using idxs0 and v4_0) allows a
short_cand.match_len == 4 to be committed; fix by adding a floor check before
committing the short candidate: ensure short_cand.match_len >=
DFAST_MIN_MATCH_LEN (and only run the `_search_next_long` retry when that
condition holds) before calling extend_backwards_shared and before returning
Committed(chosen). Update the block that computes short_cand (around use of
idxs0, v4_0, extend_backwards_shared and short_cand.match_len) to gate
commit/upgrade logic with DFAST_MIN_MATCH_LEN.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 95a00898-d756-4a55-8511-fe6fd9637d4d

📥 Commits

Reviewing files that changed from the base of the PR and between 8a6df9f and 900f211.

📒 Files selected for processing (3)
  • zstd/src/encoding/dfast/mod.rs
  • zstd/src/encoding/match_generator.rs
  • zstd/src/encoding/mod.rs

polaz added 2 commits May 17, 2026 12:14
build_table_from_probabilities runs over a fixed input — the
predefined LL_DIST/ML_DIST/OF_DIST arrays at compile-time-constant
acc_log values. Building the resulting FSETable from scratch on every
FrameCompressor::new costs ~12 µs per table on x86_64; three calls
add up to ~38 µs of pure setup paid on every frame, which dominates
the per-frame cost for inputs under a few KiB where the actual
compression work is in the single-digit-microsecond range.

Cache each predefined table once per process in an AtomicPtr and
clone on subsequent calls. The probability slice's pointer identifies
which cache slot to use (LL/ML/OF distributions live at static
addresses), and unrelated probability arrays fall through to the
eager build path so callers passing custom distributions keep the old
behavior.

AtomicPtr is in core::sync — works in both std and no_std builds
(only alloc is required for Box::leak, always available in this
crate via extern crate alloc). The cache uses Acquire/AcqRel ordering
so the consumer sees a fully-published FSETable and the publisher
both reads concurrent winners and publishes its own store; the
losing thread reclaims its allocation, the winner leaks the Box
intentionally for the rest of the program — same lifetime semantics
as LazyLock.

Measured locally on darwin/aarch64 with criterion (1 KiB random,
level_2_dfast):
  before: 44.25 µs/iter (22.07 MiB/s) — 13.4x slower than FFI
  after:  12.67 µs/iter (77.11 MiB/s) — 3.9x slower than FFI

The remaining gap on this scenario is the residual per-frame setup
cost (CompressedBlockScratch allocations, MatchGeneratorDriver
Simple-variant init that resets to the level's actual backend) —
addressing those is a separate change.
start_matching_fast_loop now handles the boundary where ip0 is
hashable but ip1 is not (`pos == current_len - HASH_READ_SIZE`).
Previously the outer guard at this boundary broke unconditionally and
left ip0 unsearched, so seed_remaining_hashable_starts only inserted
the position without probing — a real match in the tail window of a
short block would be dropped. probe_tail_ip0_only mirrors the inner
loop's long+short probe at ip0 (skipping the rep peek and
_search_next_long retry, both of which read at ip1); on a hit the
outer loop routes through emit_candidate exactly like the main path.

The short-hash hit branch now enforces DFAST_MIN_MATCH_LEN before
committing. The 4-byte equality gate plus forward extension can
produce match_len < 5; the long-hash retry can still upgrade that to
a valid long hit, but with no upgrade the below-floor hit is now
discarded and the loop continues scanning. Emitting a 4-byte non-rep
match would mint a 13-17-bit offset on wire that costs more than the
4-byte payload buys; this restores the same floor that
probe_slot_match already enforces on the long-hash main path.

trim_to_window regression test now asserts the callback is never
invoked. The Dfast variant takes the closure for signature symmetry
with HC/Row but the history-only storage never has a per-block Vec
to recycle — locking that invariant into the test means a future
change that starts pushing buffers through the callback fails CI
instead of silently reshaping the driver-side eviction accounting.
Copilot AI review requested due to automatic review settings May 17, 2026 09:16
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 10 changed files in this pull request and generated 8 comments.

Comment thread zstd/src/encoding/dfast/mod.rs Outdated
Comment thread zstd/src/fse/fse_encoder.rs Outdated
Comment thread zstd/src/encoding/dfast/mod.rs Outdated
Comment thread zstd/src/encoding/dfast/mod.rs
Comment thread zstd/src/encoding/mod.rs
Comment thread zstd/src/encoding/dfast/mod.rs
Comment thread zstd/src/encoding/dfast/mod.rs Outdated
Comment thread zstd/src/fse/fse_encoder.rs
- Dfast::trim_to_window drops the `_reuse_space: impl FnMut(Vec<u8>)`
  parameter the previous round added for HC/Row signature symmetry.
  The closure was contractually never invoked, but the
  monomorphization still paid codegen and inlining cost on the cold
  trim path; the dispatcher in match_generator.rs already branches
  per backend on other lifecycle calls, so the per-arm divergence
  here is free. Test asserts on signature shape rather than a
  callback-invoked flag — the type system itself is now the
  enforcement surface.
- FSE table cache no longer routes per-distribution by pointer
  comparison on the slice (`core::ptr::eq(probs.as_ptr(),
  LL_DIST.as_ptr())`). LL_DIST/ML_DIST/OF_DIST are `const`, not
  `static`, and the language does not guarantee a single
  materialized address for a const slice — a future rustc/codegen
  change that duplicates the const per use site would silently fall
  through every call to the build path and erase the cache win
  without a test surface to catch it. Reshape: each of the three
  public helpers owns a private `static AtomicPtr<FSETable>`, so the
  cache slot is selected at compile time without consulting the
  distribution pointer at all. The lock-free init logic moves into a
  shared `get_or_init_cached_table(&AtomicPtr, &[i32], u8)` helper.
- compress_slice_to_vec rustdoc now includes a # Panics section
  matching compress_to_vec, listing encoder-error and OOM as the two
  unhandled failure surfaces and pointing at FrameCompressor::compress
  as the Result-bearing entry point a future try_ variant would
  expose.
- start_matching_fast_loop preamble grows a Raw-pointer aliasing
  invariant block enumerating which fields the inner loop touches
  through cached raw pointers (short_hash, long_hash, history) vs
  through `&self` shared reads (offset_hist, history-as-slice for
  extend_backwards). Stacked Borrows / Tree Borrows soundness rests
  on field-disjointness — adding a `&mut self` call inside the loop
  would reborrow the tables and invalidate the cached pointers,
  which the comment now flags so a future refactor either hoists
  the call or reloads the raw pointers.
- probe_tail_ip0_only drops the unconditional packed_curr writes to
  both hash tables. Every caller branch (continue 'outer after emit,
  break 'outer on no hit) exits the outer loop before any future
  iter could re-probe the slots, and seed_remaining_hashable_starts
  inserts the same positions during the post-loop tail seed pass.
  The writes were dead and only paid cache dirtying on the tail
  boundary.
- add_data inline comment grows an in-tree caller audit conclusion:
  every production path that reaches commit_space -> add_data
  originates in levels/fastest.rs's block emitter, which produces
  non-empty blocks gated by RLE-detect / raw-fast-path checks.
  No production caller relies on add_data(empty) as a side-effecting
  trim trigger; tests cover the trim path through trim_to_window
  directly. The behaviour change introduced earlier in this PR
  (empty-input short-circuit before the eviction loop) is therefore
  observable only by a hypothetical future caller that should be
  using trim_to_window instead.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 10 changed files in this pull request and generated 6 comments.

Comment thread zstd/src/encoding/dfast/mod.rs Outdated
Comment thread zstd/src/encoding/dfast/mod.rs
Comment thread zstd/src/fse/fse_encoder.rs
Comment thread zstd/src/fse/fse_encoder.rs
Comment thread zstd/src/encoding/mod.rs
Comment thread zstd/src/encoding/dfast/mod.rs
The predefined LL/ML/OF FSE default tables now flow through the
encoder as &'static FSETable rather than owned FSETable. Process-
local cache stores Box::leak'd singletons; default_*_table() helpers
return the static reference; FseTables struct holds &'static for the
three default fields. Removes the per-FrameCompressor::new clone of
three FSETables that the AtomicPtr cache previously paid even on the
hot path — clone was ~4-5 µs out of ~7 µs total ctor on
darwin/aarch64.

Measured locally (criterion, 1 KiB random, level_2_dfast):
  before: 12.67 µs/iter (77 MiB/s)  — gap 3.9× FFI
  after:   6.08 µs/iter (160 MiB/s) — gap 1.83× FFI
Cumulative against the pre-cache baseline at the start of this
session: 44.25 µs → 6.08 µs, 7.3× compress-time speedup on the
small-1k-random scenario, applies to every level since the FSE setup
sits inside FrameCompressor::new regardless of strategy.

Secondary cleanup tied to the same review round:
- probe_tail_ip0_only takes &self (it was already read-only on the
  hash tables after dropping the dead packed_curr writes; the &mut
  forced an exclusive borrow at the call site for no reason)
- dropped the misleading `let _ = position_base;` discard from the
  same helper; the variable IS used by the bounds check below, the
  discard suppressed no warning and only made the surrounding
  comment incorrect
- compress_to_vec rustdoc explicitly calls out "NOT a streaming
  API" — source is materialized in full before any compression runs;
  callers chasing peak-RSS shape are pointed at StreamingEncoder
- get_last_space (the trait-dispatch entry point) returns &[] on
  empty window_blocks instead of panicking. The four internal sites
  in this file all use the inline `window_blocks.back().copied().
  unwrap_or(0)` gate pattern; the trait helper now mirrors that
  shape so external callers see the same contract as internal ones,
  and a future refactor consolidating the two paths doesn't trip a
  panic surface that the internal sites already gate around
- clippy needless-borrow autofix on call sites that now receive
  &'static FSETable directly rather than &FSETable
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 12 changed files in this pull request and generated no new comments.

@polaz
Copy link
Copy Markdown
Member Author

polaz commented May 17, 2026

Superseded by #146 — same final tree, history squashed from 30 commits to 3 logical commits, scope clarified per restructured Phase 7 plan in #111 (this is Phase 7pre: enablement, not the original misleading 'level 3 parity gap' title). Closing to clean the review surface; CodeRabbit / Copilot will re-run on #146 from a clean slate.

@polaz polaz closed this May 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants