Skip to content

perf(7pre): enablement — donor outer/inner dfast Level 3 + cross-level FSE cache#146

Merged
polaz merged 10 commits into
mainfrom
perf/7pre-enablement
May 17, 2026
Merged

perf(7pre): enablement — donor outer/inner dfast Level 3 + cross-level FSE cache#146
polaz merged 10 commits into
mainfrom
perf/7pre-enablement

Conversation

@polaz
Copy link
Copy Markdown
Member

@polaz polaz commented May 17, 2026

Summary

Phase 7pre — cross-level enablement / groundwork (see #111 for the restructured Phase 7 plan, default-first per-strategy ordering).

This PR replaces #145 (closed) — same final code state, history squashed to 3 logical commits, scope clarified. The original PR was titled "close level 3 parity gap" but grew well beyond Level 3 into cross-level work; under the restructured taxonomy this is Phase 7pre: enablement, prerequisite for the subsequent per-strategy / per-axis sub-phases.

Honest framing on the "Compressed size" column below: our output is 7% smaller than upstream on the decodecorpus-z000033 corpus. That is NOT confirmation that we beat upstream — it is more likely a signal of algorithmic divergence (we are doing extra match-finding work the upstream fast path skips), which is consistent with us still being 5.5× slower. The deviation may turn out to be the cause of the speed gap, not a separate win. To confirm what upstream would have emitted on the same input we still need the sequence-stream comparator (Phase 7-tooling-seq-cmp in the restructured #111).

Cumulative deltas (compress/level_3_dfast/decodecorpus-z000033, 1 022 035 B input)

Axis Before this PR After this PR FFI 1.5.7 Status
Compress time 32.07 ms (11.0×) 16.11 ms (5.5×) 2.94 ms −50% closed
Compressed size 556 121 B (+5.5%) 489 846 B (−7.07%) 527 148 B diverged; likely a cost source, not a win — see note above
Decompress time 4.43 ms (5.3×) 4.43 ms 0.84 ms Phase 7-decompress lane in #111
Memory (small ≤4 KiB) 378 KiB 249 KiB (−34%) 178 KiB partial close
Memory (medium 1 MiB) 5.01 MiB unchanged 3.80 MiB Phase 7-memory lane in #111
Memory (large 100 MiB stream) 9.92 MiB unchanged 3.80 MiB Phase 7-memory lane in #111

Small-input speedup (FSE cache, all levels)

FrameCompressor::new used to rebuild the three predefined LL/ML/OF FSE tables on every call (~12 µs × 3 = ~38 µs). Two-stage fix in commit 65d4cbdb:

  1. Cache tables in process-local AtomicPtr<FSETable>, clone on each ctor — drops the build cost but pays ~4-5 µs per clone of three 256-element SymbolStates arrays.
  2. Switch the cache return type to &'static FSETable and store the static reference inside FseTables directly — the clone disappears, the field becomes a pointer-sized copy.

Effect on small-1k-random/matrix/pure_rust at level 2 dfast (criterion, darwin/aarch64):

Initial baseline After cache (clone) After static refactor FFI 1.5.7
compress_slice_to_vec 44.25 us (22.07 MiB/s, 13.4x FFI) 12.67 us (77.11 MiB/s, 3.9x FFI) 6.08 us (160.75 MiB/s, 1.83x FFI) 3.31 us (298.24 MiB/s)

7.3x total compress-time speedup on the small-input scenario. Applies to every level. The residual ~2.8 us gap to FFI is per-frame setup not covered by this change; covered by Phase 7-compress-default in #111.

Per-scenario sanity at level_3_dfast (no regressions)

Scenario Rust (this PR) FFI 1.5.7 Delta
small-1k-random 1 038 B 1 038 B =
small-10k-random 10 254 B 10 254 B =
small-4k-log-lines 150 B 156 B -4%
decodecorpus-z000033 489 846 B 527 148 B -7%
high-entropy-1m 1 048 616 B 1 048 613 B +3 B (noise)
low-entropy-1m 150 B 159 B -6%

Commits

This PR is squashed from 30 commits in #145 into 3 logical commits — same final tree, cleaner review surface:

  1. feat(dfast): donor outer/inner loop rewrite for Level 3 baseline (8c5e7ae)
  2. perf(fse,encoding): cache FSE default tables as &static + tighten ctor (65d4cbd)
  3. test(zstd): interop 7-byte regression + bench-API alignment + gitignore scratch dirs (fd22dd3)

Phase 7 sub-phase plan (per #111)

This PR is the entry point for the restructured Phase 7 sub-phase taxonomy. After merge, the remaining sub-phases are:

Lane A — Compress (default-first): 7-compress-default (Dfast Level 3 residual) -> 7-compress-better (Lazy Level 7) -> 7-compress-best (BtOpt/BtUltra 16-19) -> 7-compress-fastest -> 7-compress-dfast-2 -> 7-compress-greedy -> 7-compress-lazy-lower -> 7-compress-lazy-upper -> 7-compress-btultra2.

Lane B — Decompress (NEW): 7-decompress-baseline (block_content_buffer 0-fill skip + NEON-overshoot match copy) -> 7-decompress-fse (state machine register-resident + BMI2 pdep/pext + NEON bit-extract) -> 7-decompress-huffman.

Lane C — Memory (NEW): 7-memory-tracker (per-alloc-site backtrace tracker, prerequisite) -> 7-memory-medium -> 7-memory-large.

Lane D — Tooling (NEW): 7-tooling-seq-cmp (sequence-stream comparator, prerequisite for 7-compress-default audit).

Verification gates

  • cargo nextest run --lib --features hash,std,dict_builder — 481 passed, 0 failed (includes interop_7_byte_input_does_not_oob_in_dfast_fast_loop regression).
  • cargo clippy --all-targets --features hash,std,dict_builder — 0 warnings.
  • cargo fmt --check — clean.
  • level22_sequences_match_donor_on_corpus_proxy — donor-parity canary at level 22 (untouched by this PR).
  • huff0::level22_emitted_literal_sections_are_accepted_by_donor_huf_reader — donor reader-acceptance canary.
  • 1 000-iteration roundtrip fuzzes (roundtrip_compressible_data_1000_iterations, roundtrip_random_data_1000_iterations, roundtrip_streaming_api_1000_iterations) — all green.
  • Linux fuzz interop smoke job — passes after the HASH_READ_SIZE guard fix; previously failed on the 7-byte input.

Test plan

  • CI green on all matrix shards (test/lint/build/codecov/fuzz/CodeRabbit)
  • No regression on level 22 sequence parity
  • No regression on per-scenario size delta at level 3 dfast
  • Donor outer/inner rewrite reviewed for borrow-checker / safety invariants (raw-pointer aliasing invariant block in start_matching_fast_loop)
  • rep1-only inline peek isolated to dfast fast loop; repcode_candidate_shared unchanged for lazy/opt callers
  • cargo bench --bench compare_ffi re-run on fresh checkout, numbers match the table above
  • Phase 7 plan in feat: encoder architecture rewrite (split monolith, const-generic Strategy, arena allocator) #111 updated to reflect this PR as 7pre and reordered sub-phases default-first

Summary by CodeRabbit

  • New Features

    • Added an opt-in feature to support constrained targets for table caching.
  • Bug Fixes

    • Fixed an out-of-bounds crash in the fast compression path.
  • Performance Improvements

    • Reduced peak memory during compression via smarter buffer sizing.
    • Cached entropy tables to avoid repeated recomputation.
    • Added a benchmark build profile with LTO for more representative results.
  • Tests

    • Added regression tests to guard compressor stability.
  • Chores

    • Updated ignore rules to skip local fuzz artifacts and temp dirs.

Review Change Stack

polaz added 3 commits May 17, 2026 13:50
Port `start_matching_fast_loop` to upstream zstd 1.5.7's outer/inner
structure (`zstd_double_fast.c`). The outer `while` runs once per
match-found-and-stored; the inner loop carries `hl0`/`idxl0`/`hl1`/
`idxl1` between iterations with two-position lookahead so the long
hash is computed once per two positions scanned, not per position.
Hash tables are updated at `curr` BEFORE the match check (upstream
parity at `zstd_double_fast.c:187`); the short check is a 4-byte
gate only with forward `common_prefix_len` running only on hit; the
`_search_next_long` retry with precomputed `idxl1` follows the
preferred-longer-match policy. Match-finder probes inline into the
hot loop via raw-pointer arithmetic; `probe_slot_match` is preserved
for the cold callers (`hash_candidate`, dictionary prime, retry/
seed). The inner-loop exit shape is an explicit
`enum InnerExit { Committed(MatchCandidate), Tail(usize) }`; every
`break 'inner` picks a variant so the post-loop match is exhaustive,
removing the implicit pairing the previous draft had between
`Option<MatchCandidate>` and a sibling `tail_seed_anchor`.

Supporting changes inside the same fast-loop region:

- DFAST_HASH_BITS ceiling lowered from 20 to 17 per upstream
  clevels.h:31 Level 3 sizing; slot storage moves to single u32 +
  `position_base` rebase scheme (upstream `ZSTD_window_reduce`
  parity).
- DFAST_MIN_MATCH_LEN drops from 6 to 5 per upstream `clevels.h:31`
  default Level 3 mls=5 — we were silently dropping every 5-byte
  match. DFAST_REP_MIN_MATCH_LEN stays at 4 (rep extensions inherit
  the 4-byte upstream floor since rep encoding carries no on-wire
  offset cost).
- Hash kernel switches to upstream's scalar multiply
  (`v.wrapping_mul(prime8bytes)` then shift), dropping the CRC32d
  kernel dispatch on the dfast hot path; the CRC32d kernel was 3-4
  instructions of dispatch overhead per insert before the inline,
  upstream's one-mul-shift is strictly cheaper.
- Repcode peek inlined at ip+1 with a 4-byte gate (upstream
  zstd_double_fast.c:190); only `offset_1` is probed inline, the
  full rep0/rep1/rep2 walk lives in the lazy/btopt strategies.
  Removing the rep2/rep3 walk from the dfast inline saved both
  speed (16% on level_3_dfast/decodecorpus-z000033) and ratio
  (−494 B; fewer rep2/rep3 false-positive picks displacing better
  hash matches).
- Skip-step now grows by scan distance rather than consecutive-miss
  threshold (upstream zstd_double_fast.c:224-228 parity): step
  increases every `DFAST_SKIP_STEP_GROWTH_INTERVAL = 64` positions
  traveled (upstream uses 256 — see "Donor-deviation audit" in
  PR body), so blocks the strict `block_looks_incompressible_strict`
  gate used to bail out of early now scan through the standard
  ramp. The strict helper itself stays for `levels/fastest.rs` and
  `incompressible.rs`.
- Fast-loop boundary tail-position probe (`probe_tail_ip0_only`)
  covers the case where `ip0` is still hashable but `ip1` falls
  past the end (`pos == current_len - HASH_READ_SIZE`). Mirrors the
  inner loop's long+short probe at ip0; the rep peek and
  `_search_next_long` retry are skipped (both depend on ip1),
  matching upstream's iend-boundary tradeoff.
- Short-hash hit enforces `DFAST_MIN_MATCH_LEN` floor before
  committing. The 4-byte equality gate plus forward extension can
  produce `match_len < 5`; the `_search_next_long` retry can still
  upgrade to a valid long match, but with no upgrade the
  below-floor hit is now discarded and the loop continues. Matches
  the floor `probe_slot_match` already enforces on the long-hash
  main path.
- Fast-loop guards tightened from `+ DFAST_MIN_MATCH_LEN (5)` to
  `+ HASH_READ_SIZE (8)`. Previous guard let the unconditional
  8-byte `u64::read_unaligned` for the long-hash probe read up to 3
  bytes past the live history end; Linux fuzz ASan caught it on a
  7-byte input. Regression test added in `tests/fuzz_regressions.rs`.
- Fast-loop entry calls `ensure_room_for` to rebase position_base
  before any inner-loop slot pack. The inner loop's `packed_curr`
  cast `(abs_ip0 - position_base) as u32` could truncate after a
  long stream of all-miss / non-hashable blocks (fast-loop bypasses
  `insert_position`'s rebase).
- `trim_to_window` now drops the dead prefix via `split_off`
  instead of `compact_history`'s `drain` (which leaves the original
  Vec's capacity pinned); the trim boundary pays one realloc in
  exchange for actually shedding the prefix to the system
  allocator. Signature also drops the unused `_reuse_space` closure
  (Dfast's history-only storage has no per-block Vec to recycle —
  unlike HC / Row which need the callback).
- `skip_matching` / `skip_matching_dense` / `start_matching` early-
  return on `current_len == 0`. `add_data` short-circuits empty
  input without pushing to `window_blocks`; without the guard these
  methods would re-seed the previous block's retained tail on
  every empty flush.
- `get_last_space` (trait dispatch entry point) returns `&[]` on
  empty `window_blocks` instead of panicking, mirroring the inline
  gate pattern used by `skip_matching` / `start_matching` /
  `emit_candidate` / `emit_trailing_literals`.
- `probe_slot_match` takes a `gate_len` parameter (8 for long-hash
  callers, 4 for short-hash). Replaces the prior 1-byte precheck
  which let 8-byte-collision short matches slip past the gate.

`Cargo.toml` adds `[profile.bench]` overrides (`lto = "fat"`,
`codegen-units = 1`) scoped to bench builds only — full cross-crate
inlining for benches without forcing 5-10× release build-time on
`cli` and downstream consumers.

Verification on `compress/level_3_dfast/decodecorpus-z000033`
(1 022 035 B input):
  Compress time: 32.07 ms (11.0× FFI) -> 16.11 ms (5.5× FFI)
  Compressed size: 556 121 B (+5.5%) -> 489 846 B (−7.07%)
  Memory (small ≤4 KiB): 378 KiB -> 249 KiB (−34%)

Compressed-size delta is honestly framed: smaller-than-donor does
NOT necessarily mean we win — extra match-finding work the
upstream fast path skips can produce smaller output AND slower
encode time simultaneously. The sequence-stream comparator (#99,
deferred to Lane D of restructured Phase 7 in #111) is the audit
tooling needed to triage.
`FrameCompressor::new` used to rebuild the three predefined LL/ML/OF
FSE tables on every call (~12 µs per `build_table_from_probabilities`
× 3 = ~38 µs). On a 1 KiB frame the actual compression work is
~1-5 µs, so the predefined-table build dominated per-frame cost for
tiny inputs.

Two-stage fix:

1. Each `default_*_table()` helper owns a private
   `static AtomicPtr<FSETable>` initialized via lock-free
   compare-exchange (`core::sync::atomic`, works in std and no_std
   alike — only `alloc` is required). First call leaks a
   `Box<FSETable>` to the cache; subsequent calls return the same
   pointer. The slot is selected at compile time by the static
   identity of each helper, not by pointer comparison on the
   distribution slice — `LL_DIST`/`ML_DIST`/`OF_DIST` are `const`,
   not `static`, and rustc does not guarantee a single materialized
   address for const slice references (a future codegen change that
   duplicates the const per use site would silently bypass a
   pointer-keyed cache). The helper now returns `&'static FSETable`.

2. `FseTables` stores the static reference directly
   (`ll_default: &'static FSETable`) rather than an owned
   `FSETable`. `FrameCompressor::new`'s FSE setup becomes three
   pointer-sized copies instead of three deep clones of 256-element
   `SymbolStates` arrays. `clone_fse_tables` (block-split estimator
   path) becomes a pointer copy on the three default fields too;
   only the `*_previous` per-frame tables still need a deep clone.

`get_or_init_cached_table` uses `Acquire` on the load and `AcqRel`
on the compare-exchange success path. The success-path ordering is
stronger than the minimal `Release` would suggest; matching the
failure `Acquire` keeps the success/failure branches symmetric on
rustc's atomic surface, and the path runs at most three times per
process so any cost is in the noise.

Verification on `compress/level_2_dfast/small-1k-random/matrix/
pure_rust` (criterion, darwin/aarch64, fresh checkout):
  Pre-cache baseline:      44.25 µs (22.07 MiB/s, 13.4× FFI)
  After AtomicPtr cache:   12.67 µs (77.11 MiB/s,  3.9× FFI)
  After &'static refactor:  6.08 µs (160.75 MiB/s, 1.83× FFI)
  FFI 1.5.7:                3.31 µs (298.24 MiB/s)

Cumulative: 7.3× compress-time speedup on the small-input scenario.
Applies to every level since `FseTables::new()` runs inside
`FrameCompressor::new` regardless of strategy.

Adjacent encoding tightening in the same commit:

- `compress_to_vec` is split — slice-based work moves into a new
  public `compress_slice_to_vec` that accepts `&[u8]` directly,
  while the `Read`-based variant remains a thin wrapper. Both seed
  the output `Vec` with `min(compress_bound(src), 128 KiB)` instead
  of the prior `compress_bound(src)` upfront pin — saves ~100 MiB
  peak on `large-log-stream` (100 MiB highly-compressible input).
  `compress_slice_to_vec` rustdoc enumerates the RSS shape change
  and the `# Panics` modes (encoder error, OOM); `compress_to_vec`
  rustdoc explicitly flags that it's NOT a streaming API — the
  source is materialized fully before any compression runs, peak
  bounded by `source.len()`. Both surfaces point at
  [`StreamingEncoder`] for unbounded inputs.
- `FrameCompressor::compress`'s `all_blocks` accumulator pre-
  reserves capacity scaled by the source-size hint instead of a
  flat 128 KiB. Saves ~130 KiB peak for sub-4 KiB inputs without
  regression on medium / large.

The `start_matching_fast_loop` preamble grows a raw-pointer
aliasing invariant SAFETY block enumerating which fields the inner
loop touches through cached `as_mut_ptr` raw pointers
(`short_hash`, `long_hash`, `history`) vs which it accesses through
`&self` shared reads (`offset_hist`, slice views of `history`).
The pattern is sound under Stacked Borrows / Tree Borrows because
the fields are physically disjoint allocations; the comment locks
the invariant in so a future refactor adding a `&mut self` call
inside the loop has to either hoist the call or reload the cached
raw pointers.

needless-borrow clippy autofixes in
`encoding/blocks/compressed.rs`, `fse/mod.rs`, and
`encoding/match_generator.rs` (consumers of the now-`&'static`
default tables no longer need an outer `&`).
…re scratch dirs

- `tests/fuzz_regressions.rs`: new
  `interop_7_byte_input_does_not_oob_in_dfast_fast_loop` regression
  for the Linux-fuzz `interop` artifact that exposed the dfast
  fast-loop OOB read. Bytes inline; pins to
  `CompressionLevel::Level(3)` so a future retune of the `Default`
  alias cannot accidentally route the test off the dfast path.
  Roundtrips through the in-tree decoder. Docstring calls out the
  sanitizer caveat — a plain `cargo nextest run` may complete
  successfully against pre-fix code because the OOB
  `read_unaligned` usually lands inside the live `Vec`'s spare
  capacity; the regression reliably fires only under ASan (Linux
  fuzz CI) or `cargo +nightly miri test`. Treat green local test
  as a smoke check, the authoritative signal is the CI fuzz job.

- `benches/compare_ffi.rs` + `compare_ffi_memory.rs`: minor
  alignment with the public-API split (the new
  `compress_slice_to_vec` entry point); no behaviour changes to
  the bench harness, just keeping the call sites in sync.

- `.gitignore`: ignore `.local/` (per-branch scratch /
  helper-doc directory used for `PHASE7C_FINDINGS.md` etc.),
  cargo-fuzz artifacts under `zstd/fuzz/artifacts/**` (libfuzzer
  content-hash filenames change on each run; in-repo regression
  vectors live inline in `tests/fuzz_regressions.rs` instead).
Copilot AI review requested due to automatic review settings May 17, 2026 10:54
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 17, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8ac47496-61d5-4a97-a127-d317245b4eb1

📥 Commits

Reviewing files that changed from the base of the PR and between 256c28b and b86f51e.

📒 Files selected for processing (2)
  • zstd/src/encoding/match_generator.rs
  • zstd/src/tests/fuzz_regressions.rs

📝 Walkthrough

Walkthrough

Caches predefined FSE tables with cfg-specific backings, exposes ref-based accessors, pre-allocates compression outputs via a compress_bound heuristic, refactors Dfast to a contiguous history and single-slot hash tables with rewritten matching, and updates benches, tests, Cargo features, and .gitignore.

Changes

FSE Caching, Encoder Integration, and Dfast Matcher Refactor

Layer / File(s) Summary
FSE table caching and default-table API
zstd/src/fse/fse_encoder.rs, zstd/src/fse/mod.rs
Add cached/static and boxed fallback paths for default LL/ML/OF FSE tables and expose FseDefaultTable alias; update test callsites to deref when needed.
Encoder & blocks: use default refs and clone behavior
zstd/src/encoding/frame_compressor.rs, zstd/src/encoding/blocks/compressed.rs
Change FseTables default fields to FseDefaultTable, add *_default_ref() accessors, update clone_fse_tables, and pass default refs into choose_table/previous_table and tests.
compress_bound and preallocation
zstd/src/encoding/mod.rs
Add compress_bound(src_size) const fn; make compress_to_vec delegate to compress_slice_to_vec and pre-size output Vec with min(compress_bound(len), OUTPUT_BLOCK_CAP).
FrameCompressor hint capture and all_blocks tiering
zstd/src/encoding/frame_compressor.rs
Capture source_size_hint locally and compute tiered initial capacity for all_blocks from tiny/small/default thresholds.
Dfast state: history and single-slot hash tables
zstd/src/encoding/dfast/mod.rs
Replace per-block retained VecDeque<Vec<u8>> with contiguous history + window_blocks; switch multi-slot buckets to single-slot Vec<u32> tables and add short/long bit sizing and index helpers.
Dfast matching loops, probe, rep-extension, and probing
zstd/src/encoding/dfast/mod.rs
Rewrite start_matching_fast_loop, add probe_tail_ip0_only, centralize probe_slot_match, enforce rep-extension floor, and update insertion/probing semantics for single-slot tables.
Dfast constants, driver API, and match-generator integration
zstd/src/encoding/match_generator.rs, zstd/src/encoding/dfast/mod.rs
Lower DFAST_MIN_MATCH_LEN 6→5, remove DFAST_SEARCH_DEPTH, add DFAST_SHORT_HASH_BITS_DELTA, change reset/trim_to_window contract to no-callback form, and fix eviction accounting to use window_size deltas.
Dfast tests and regression coverage
zstd/src/encoding/match_generator.rs, zstd/src/encoding/dfast/mod.rs, zstd/src/tests/fuzz_regressions.rs
Update tests for single-slot equality/indexing and window_blocks state, remove callback-based eviction assertions, and add exact-5-byte and 7-byte regressions.

Benchmark Updates, Config, and Regression Tests

Layer / File(s) Summary
Benchmark compression API migration
zstd/benches/compare_ffi.rs, zstd/benches/compare_ffi_memory.rs
Benches switch Rust-side compression calls from compress_to_vec to compress_slice_to_vec.
Dfast fast-loop OOB regression test
zstd/src/tests/fuzz_regressions.rs
Add regression test compressing a fixed 7-byte input and asserting successful roundtrip decode.
Cargo bench profile, zstd feature, and gitignore
Cargo.toml, zstd/Cargo.toml, .gitignore
Add [profile.bench] (lto="fat", codegen-units=1), add optional critical-section dep and public feature, and ignore zstd/fuzz/artifacts/** and .local/ in .gitignore.
Gate internal tests module
zstd/src/lib.rs
Add #[cfg(test)] to mod tests; so internal tests compile only during test builds.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Possibly related PRs

Poem

🐰 I cached the tables beneath the moonlit tree,

I sized my vec to fit what soon would be,
From bucketed nests to single-slot delight,
Five-byte matches hop and keep the loop polite,
A rabbit hums: regressions fixed—goodnight.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'perf(7pre): enablement — donor outer/inner dfast Level 3 + cross-level FSE cache' directly and accurately describes the primary changes: performance improvements through donor loop rewrite for dfast Level 3 and FSE table caching across compression levels.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/7pre-enablement

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 17, 2026

Codecov Report

❌ Patch coverage is 94.46640% with 42 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
zstd/src/encoding/dfast/mod.rs 92.61% 41 Missing ⚠️
zstd/src/encoding/match_generator.rs 99.07% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Phase 7pre enablement work to bring dfast Level 3 closer to donor behavior/perf and lay groundwork for cross-level improvements, including reducing per-frame setup costs and tightening memory behavior.

Changes:

  • Reworked dfast internals (min match length, table layout, fast-loop structure, history storage) to align more closely with donor semantics and reduce memory footprint.
  • Added a process-local cache for predefined LL/ML/OF FSE tables and refactored encoder state to store &'static FSETable references to avoid per-frame rebuild/clone cost.
  • Added a fuzz regression test for the 7-byte interop crash and aligned benches to use the new slice-to-Vec compression entry point.

Reviewed changes

Copilot reviewed 11 out of 12 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
zstd/src/tests/fuzz_regressions.rs Adds 7-byte regression test covering the prior dfast fast-loop OOB/UB scenario.
zstd/src/fse/mod.rs Updates tests/call sites for predefined FSE helpers now returning &'static FSETable.
zstd/src/fse/fse_encoder.rs Implements predefined FSE table caching and changes default-table helpers to return &'static references.
zstd/src/encoding/mod.rs Adds compress_bound() and new compress_slice_to_vec(); routes compress_to_vec() through it and seeds output capacity.
zstd/src/encoding/match_generator.rs Updates dfast constants and driver lifecycle logic to match dfast backend refactors and default-table signature changes.
zstd/src/encoding/frame_compressor.rs Stores default FSE tables by reference; tweaks block accumulation capacity based on source size hint.
zstd/src/encoding/dfast/mod.rs Major dfast matcher refactor: single-slot hash tables, history-only storage, fast-loop safety/perf changes, and guard tightening.
zstd/src/encoding/blocks/compressed.rs Adapts FSE table selection/clone logic to &'static default tables.
zstd/benches/compare_ffi.rs Uses compress_slice_to_vec() to avoid benchmarking input-buffering overhead.
zstd/benches/compare_ffi_memory.rs Uses compress_slice_to_vec() for more representative peak-memory measurement.
Cargo.toml Sets [profile.bench] to fat LTO + codegen-units=1 for more realistic bench results.
.gitignore Ignores new local fuzz artifacts and .local/ scratch output.

Comment thread zstd/src/fse/fse_encoder.rs
Comment thread zstd/src/fse/fse_encoder.rs Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/encoding/dfast/mod.rs`:
- Around line 1438-1445: Add a debug assertion to ensure the structural
precondition that the trailing-block length is positive before slicing: in the
block where you compute last_len from
self.window_blocks.back().copied().unwrap_or(0) (used by emit_candidate), insert
a debug_assert!(last_len > 0) (or equivalent) immediately before creating
current = &self.history[self.history.len() - last_len..] so tests fail at the
source of the invariant violation rather than later when indexing current.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: fe52333b-a095-4426-94af-042976806f2e

📥 Commits

Reviewing files that changed from the base of the PR and between 9d6a4fc and fd22dd3.

📒 Files selected for processing (12)
  • .gitignore
  • Cargo.toml
  • zstd/benches/compare_ffi.rs
  • zstd/benches/compare_ffi_memory.rs
  • zstd/src/encoding/blocks/compressed.rs
  • zstd/src/encoding/dfast/mod.rs
  • zstd/src/encoding/frame_compressor.rs
  • zstd/src/encoding/match_generator.rs
  • zstd/src/encoding/mod.rs
  • zstd/src/fse/fse_encoder.rs
  • zstd/src/fse/mod.rs
  • zstd/src/tests/fuzz_regressions.rs

Comment thread zstd/src/encoding/dfast/mod.rs
…debug_assert

- `fse_encoder`: gate the `AtomicPtr<FSETable>` lock-free cache
  behind `#[cfg(target_has_atomic = "ptr")]`. Targets that lack
  atomic pointer support (e.g. `thumbv6m-none-eabi`, AVR) were
  failing to compile against the unconditional `core::sync::atomic`
  import. Added a `#[cfg(not(target_has_atomic = "ptr"))]` fallback
  that uses a `static mut` raw pointer with single-threaded serial
  access — targets without atomic pointer support are by
  construction single-threaded embedded environments, so a
  non-atomic mutable static is sound. Same `Box::leak` lifetime,
  same `&'static FSETable` return shape as the atomic path, so
  consumers don't see a feature-flag-shaped API. Mirrors the
  pattern the crate already uses for `decoding::dictionary::
  DictionaryHandle` (`Arc` vs `Rc`). Verified with
  `cargo check --target thumbv6m-none-eabi --no-default-features
  --features hash`.
- `fse_encoder`: header comment block rewritten to describe both
  the atomic and no-atomic paths and to drop the stale "hand
  callers a clone" wording — the API now returns `&'static
  FSETable`, callers no longer clone.
- `dfast::emit_candidate`: added `debug_assert!(last_len > 0)`
  immediately after the `window_blocks.back().copied().unwrap_or(0)`
  read. The function's structural precondition (it runs only after
  a successful match in the active block) was previously documented
  in a comment but not enforced in debug builds — the assert
  surfaces a violation at the source rather than via a confusing
  out-of-range panic on the `current[*literals_start..start]`
  subslice below.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 12 changed files in this pull request and generated 1 comment.

Comment thread zstd/src/fse/fse_encoder.rs Outdated
…uild

Replaces the prior static-mut no-atomic fallback (which assumed
non-concurrency on no-atomic targets — wrong, interrupt /
preemption reentrancy on bare-metal can still cause concurrent
first-call initialization, making the static-mut slot a data race
surface and UB) with a three-way split driven by
target_has_atomic = "ptr" and the new optional critical-section
feature:

1. target_has_atomic = "ptr" (every modern desktop / server /
   mobile / Cortex-M3+ / RISC-V-A target) — unchanged lock-free
   AtomicPtr cache. Sound under multi-threaded access (rayon
   worker threads racing on first call all converge through the
   compare_exchange; subsequent reads of &static FSETable are
   immutable shared references which are Sync because FSETable
   has no interior mutability).
2. not(target_has_atomic = "ptr") + critical-section feature —
   static CACHE: CsCachedTablePtr (newtype wrapping
   UnsafeCell<*mut FSETable> with manual unsafe impl Sync),
   accessed exclusively under critical_section::with. CS impl
   supplied by the downstream embedded runtime (cortex-m-rt,
   riscv-rt, embassy-executor, esp-hal, ...). First call
   takes the CS, builds + Box::into_raws, publishes; subsequent
   calls return the cached pointer. CS held for one
   build_table_from_probabilities (~12 us) on the cold path,
   zero overhead afterwards.
3. not(target_has_atomic = "ptr") without critical-section —
   per-call build + Box::leak. No cache, accepts the memory
   accumulation tradeoff against the UB risk of a synchronization-
   free static-mut slot. Same as the pre-PR status quo on
   no-atomic, with the addition of intentional leak (which embedded
   users who repeatedly construct FrameCompressor should mitigate
   by enabling the critical-section feature).

Cargo.toml gains the optional dep critical-section = "1.2" gated
by a new feature of the same name. On atomic targets the dep is
dead code and never instantiated.

Header comment block rewritten to document all three paths,
including the explicit "lack of atomic instructions is not a
non-concurrency guarantee" rationale that prompted this revision.
The previous draft's safety argument was wrong; the new comment
makes the trade-offs explicit per target family.

Verified host atomic path (cargo clippy / nextest 481 passed) and
both no-atomic sub-paths (cargo check --target thumbv6m-none-eabi
with and without critical-section feature) build clean.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/fse/fse_encoder.rs`:
- Around line 819-830: The current no-atomic fallback permanently leaks memory
by Box::leak in build_and_leak_table; replace the leaking behavior with an
owned, non-'static return so callers hold responsibility for dropping the table:
change build_and_leak_table to construct the table with
build_table_from_probabilities and return either Box<FSETable> or FSETable
(instead of &'static FSETable), update all call sites to accept and drop the
owned table (or call .into_inner() where appropriate), and remove the Box::leak
to avoid unbounded memory growth on non-atomic targets.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: c533324b-2991-4c7b-bb9f-7e078b21bc3e

📥 Commits

Reviewing files that changed from the base of the PR and between 9001027 and bb70fbd.

📒 Files selected for processing (2)
  • zstd/Cargo.toml
  • zstd/src/fse/fse_encoder.rs

Comment thread zstd/src/fse/fse_encoder.rs Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 13 changed files in this pull request and generated 1 comment.

Comment thread zstd/src/fse/fse_encoder.rs
Removes the per-call Box::leak from the no-atomic-without-
critical-section fallback. The leak accumulated memory unboundedly
across multiple FrameCompressor::new calls on the smallest
supported targets (Cortex-M0/M0+, AVR, MSP430) — CodeRabbit and
Copilot both flagged the unbounded-growth concern.

Type-level fix via a new fse_encoder::FseDefaultTable cfg alias:

- target_has_atomic = "ptr" OR feature "critical-section":
  &'static FSETable (cached singleton, zero-cost subsequent calls)
- otherwise: alloc::boxed::Box<FSETable> (owned per-frame
  allocation, dropped with the owning FrameCompressor — same
  memory shape as the pre-PR status quo on no-atomic targets)

FseTables struct fields switch from a hard-coded &'static FSETable
to the type alias; both arms Deref to FSETable. Consumers in
encoding/blocks/compressed.rs borrow uniformly through three new
FseTables accessor methods (ll_default_ref / ml_default_ref /
of_default_ref) so the per-target divergence stays inside the
struct surface and downstream call sites do not need cfg.

clone_fse_tables (block-split estimator path, levels 11+) uses an
explicit per-arm clone via cfg-conditional struct-literal fields:
on atomic / CS the field is Copy (zero-cost field-access copy);
on no-atomic-no-CS the field is Box<FSETable> and needs
Clone::clone for a deep copy. The naive uniform .clone() resolves
via auto-deref to FSETable::clone (returns owned FSETable) which
is the WRONG return type for the atomic arm, hence the explicit
split.

Verified: cargo clippy --lib --tests clean on host (atomic);
cargo nextest run --lib passes 481/481.

Pre-existing thumbv6m build failure in tests/roundtrip_integrity.rs
(unconditional extern crate std at line 6 vs the cfg-gated mod
declaration in tests/mod.rs:582) remains — same failure mode on
main, not introduced by this commit. Tracked separately.
Copy link
Copy Markdown

@sw-release-bot sw-release-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'structured-zstd vs C FFI (x86_64-gnu)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.30.

Benchmark suite Current: b86f51e Previous: 1e0f695 Ratio
compress/level_22_btultra2/small-4k-log-lines/matrix/c_ffi 0.113 ms 0.067 ms 1.69
compress/level_22_btultra2/decodecorpus-z000033/matrix/c_ffi 248.653 ms 171.961 ms 1.45

This comment was automatically generated by workflow using github-action-benchmark.

CC: @polaz

The `tests/` directory contains test-only modules with unconditional
`extern crate std;` items (e.g. `roundtrip_integrity.rs:6`). With
`mod tests;` declared unconditionally in `lib.rs`, those modules
were pulled into every lib build — including no_std targets that
do not have `std` available — producing E0463 "can't find crate
for `std`" on `thumbv6m-none-eabi`, AVR, MSP430, etc.

Gating the module declaration on `cfg(test)` is the canonical Rust
pattern: the test code is part of the lib only when the harness is
being built; the production lib stays clean. No other module in
the crate references `crate::tests::*` from non-test code, so the
gate is behaviour-preserving for the test harness path.

Verified:
- cargo check --lib --no-default-features --features hash --target
  thumbv6m-none-eabi — passes (previously failed with 3 E0463 /
  E0432 errors)
- cargo check --lib --no-default-features --features
  hash,critical-section --target thumbv6m-none-eabi — passes
- cargo nextest run --lib on host — 481/481 tests pass, unchanged
- cargo build --lib on host (default features) — clean
Copilot AI review requested due to automatic review settings May 17, 2026 11:53
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 14 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (4)

zstd/src/encoding/match_generator.rs:4235

  • Same FseDefaultTable cfg-typing issue as above: Some(ll) only works when default_*_table() returns &'static FSETable. To keep these tests compiling on no-atomic/no-critical-section targets (where the alias is Box<FSETable>), borrow through Deref (e.g., Some(&*ll)).
    let ll = crate::fse::fse_encoder::default_ll_table();
    let ml = crate::fse::fse_encoder::default_ml_table();
    let of = crate::fse::fse_encoder::default_of_table();
    hc.seed_dictionary_entropy(None, Some(ll), Some(ml), Some(of));

zstd/src/encoding/match_generator.rs:4278

  • Same FseDefaultTable cfg-typing issue: Some(ll) will fail to compile when default_*_table() returns Box<FSETable> (no-atomic/no-critical-section). Borrow through Deref (e.g., Some(&*ll) / Some(&*ml) / Some(&*of)) to keep this test cfg-agnostic.
    let ll = crate::fse::fse_encoder::default_ll_table();
    let ml = crate::fse::fse_encoder::default_ml_table();
    let of = crate::fse::fse_encoder::default_of_table();
    hc.seed_dictionary_entropy(None, Some(ll), Some(ml), Some(of));

zstd/src/encoding/match_generator.rs:4338

  • Same FseDefaultTable cfg-typing issue: Some(ll) assumes default_*_table() returns a reference. On cfgs where it returns Box<FSETable>, this call won’t type-check against Option<&FSETable>. Borrow through Deref so the test remains portable.
    let ll = crate::fse::fse_encoder::default_ll_table();
    let ml = crate::fse::fse_encoder::default_ml_table();
    let of = crate::fse::fse_encoder::default_of_table();
    hc.seed_dictionary_entropy(None, Some(ll), Some(ml), Some(of));

zstd/src/encoding/match_generator.rs:4484

  • Same FseDefaultTable cfg-typing issue: default_*_table() can return Box<FSETable> on no-atomic/no-critical-section builds, so Some(ll) won’t match Option<&FSETable>. Borrow through Deref (e.g., Some(&*ll)) to keep this test compiling across cfgs.
    let ll = crate::fse::fse_encoder::default_ll_table();
    let ml = crate::fse::fse_encoder::default_ml_table();
    let of = crate::fse::fse_encoder::default_of_table();
    stats.seed_dictionary_entropy(Some(&huff), Some(ll), Some(ml), Some(of));

Comment thread zstd/src/fse/mod.rs Outdated
Comment thread zstd/src/encoding/match_generator.rs Outdated
Comment thread zstd/Cargo.toml Outdated
`default_ll_table()` / `default_ml_table()` / `default_of_table()`
now return `FseDefaultTable`, the cfg-typed alias added in the
previous round. On atomic / `critical-section` targets it resolves
to `&'static FSETable` (passes directly anywhere `&FSETable` is
expected); on no-atomic / no-CS it resolves to `Box<FSETable>`
(needs `&*` reborrow to satisfy a `&FSETable` parameter).

Six test sites were assuming the reference shape and would not
compile on the no-atomic / no-CS arm:

- `fse/mod.rs::table_header_bits_exact` (3 `check(...)` calls)
- `match_generator.rs::dictionary_entropy_seed_initializes_opt_state_from_tables`
- `match_generator.rs::dictionary_fse_seed_applies_without_huffman_seed`
- `match_generator.rs::dictionary_seed_overrides_predef_price_mode_on_tiny_input`
- `match_generator.rs::btultra2_seed_pass_disabled_when_dictionary_entropy_seed_present`
- `match_generator.rs::dictionary_huffman_seed_ignored_when_literals_uncompressed`

Wrapped the 15 call-site occurrences in `&*` so the reborrow is
canonical across both cfg arms. On the atomic arm clippy flags
`&*` as needless (the value is already `&FSETable`); each affected
test function carries a localized `#[allow(clippy::borrow_deref_ref)]`
next to its `#[test]` attribute.

`Cargo.toml` `critical-section` feature comment refreshed to match
the actual no-atomic / no-CS behaviour: per-call build of an owned
`Box<FSETable>` stored in `FseTables` and dropped with the owning
`FrameCompressor` — no Box::leak, no unbounded growth (the prior
draft of the comment described leaky behaviour that has since been
fixed).
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 14 changed files in this pull request and generated no new comments.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 681-692: The Dfast branch in commit_space() is incorrectly
counting the input buffer as evicted; update the Dfast handling (use
dfast_matcher_mut()/Dfast matcher) to compute eviction by capturing pre =
dfast.window_size, calling dfast.add_data(space, |...| { ... }) (the existing
callback), and then incrementing evicted_bytes by pre - dfast.window_size
instead of using the input buffer size; this mirrors the
trim_to_window()/window_size-delta pattern used in trim_after_budget_retire and
prevents premature shrinking of dictionary_retained_budget/max_window_size.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: d667e539-adf7-4ab0-9ff3-0744dbb78890

📥 Commits

Reviewing files that changed from the base of the PR and between 3f5a6a9 and 64e0fcc.

📒 Files selected for processing (3)
  • zstd/Cargo.toml
  • zstd/src/encoding/match_generator.rs
  • zstd/src/fse/mod.rs

Comment thread zstd/src/encoding/match_generator.rs
…mit_space

The `commit_space` Dfast branch was counting the INPUT buffer length
as `evicted_bytes`. Dfast's `add_data` callback receives the input
`Vec<u8>` for pool recycling (Dfast stores bytes in the contiguous
`history` buffer; there is no per-block `Vec<u8>` to pop and hand
back, unlike Row / HashChain which the callback contract is shaped
around). Counting `data.len()` therefore conflated "new bytes
ingested" with "old bytes evicted from the window".

The two happen to coincide when the prior window was exactly
saturated and the new input fills it 1:1 — which is the only case
the existing `dfast_add_data_callback_reports_evicted_len_not_capacity`
fixture exercised (max=8, prev=8, new=8 → both numbers are 8). The
bug surfaces on the divergent path: max=10, window=[4,4] (size 8),
new input 5 bytes → pop one 4-byte block (evicted=4), push 5
(window=[4,5], size=9). The old accounting reported `evicted=5`,
which then drives `retire_dictionary_budget` to reclaim 1 byte too
many and shrinks `max_window_size` prematurely — degrading
dictionary retention.

Fix: derive the actual eviction from `window_size` before/after,
plus the input length. The pop loop inside `add_data` decrements
`window_size` by each evicted block length and the final
`extend_from_slice + push_back` adds `space_len`, so
`evicted = pre + space_len - post` is the exact reverse-engineering
of what the pop loop did. Same `window_size`-delta pattern already
in use by `trim_after_budget_retire` for the Dfast variant. The
`pre.saturating_add(space_len).saturating_sub(m.window_size)` form
keeps the arithmetic robust against pathological pre+space_len
overflow on i686 (where `usize::MAX` is reachable in tests of
extreme inputs).

Regression test `dfast_commit_space_eviction_uses_window_size_delta`
forces the divergent case (4 + 4 + 5 = 13 bytes vs max_window = 10)
and asserts the formula yields 4 (the actual eviction), not 5 (the
input size). The prior `dfast_add_data_callback_reports_evicted_len_not_capacity`
fixture is unchanged — its concern (callback data is not the raw
backing capacity) is orthogonal and still holds.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 6990-7048: The test
dfast_commit_space_eviction_uses_window_size_delta currently exercises
DfastMatchGenerator.add_data directly; instead exercise the driver path by
constructing a MatchGeneratorDriver (wrapping the DfastMatchGenerator) and feed
the three blocks [4,4,5] through the driver's public input API so that
MatchGeneratorDriver.commit_space / its driver-backed commit logic runs the same
code path as the production driver; after the third input assert the driver's
observable dictionary_retained_budget (or the driver's window-related
observable) equals 96 (start 100 → evicted 4) and/or that the driver's window
size delta matches the pre+space_len-post calculation to ensure commit_space was
applied by the driver rather than only by DfastMatchGenerator.add_data.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1fb99246-f4d5-4f12-8543-5968a67bd059

📥 Commits

Reviewing files that changed from the base of the PR and between 64e0fcc and 256c28b.

📒 Files selected for processing (1)
  • zstd/src/encoding/match_generator.rs

Comment thread zstd/src/encoding/match_generator.rs
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 14 changed files in this pull request and generated 1 comment.

Comment thread zstd/src/tests/fuzz_regressions.rs
… std

- dfast_commit_space_eviction_uses_window_size_delta now drives the
  `[4, 4, 5]` fixture through `MatchGeneratorDriver::commit_space`
  rather than calling `DfastMatchGenerator::add_data` directly and
  re-deriving the formula on the matcher side. The previous draft
  would have passed even if `commit_space`'s Dfast branch regressed
  back to `evicted_bytes += data.len()` — the matcher-only path
  doesn't run `commit_space` at all. Asserting on the driver's
  observable `dictionary_retained_budget` and the post-retire
  `max_window_size` proves the driver-side accounting is the one
  that fires.
- Trace through the assertion targets (budget = 100, max_window = 10,
  commits [4, 4, 5]):
    1st commit: ws 0 → 4, evicted = 0, no retire. budget 100.
    2nd commit: ws 4 → 8, evicted = 0, no retire. budget 100.
    3rd commit: 8 + 5 > 10 → pop 4-byte block, push 5. ws 9.
      evicted = pre + space_len - post = 8 + 5 - 9 = 4.
      retire(4) → budget 96, max_window 6, returns true.
      trim_after_budget_retire cascade:
        iter1: ws 9 > 6, pop [4] → ws 5, evicted 4.
               retire(4) → budget 92, max_window 2.
        iter2: ws 5 > 2, pop [5] → ws 0, evicted 5.
               retire(5) → budget 87, max_window 0.
        iter3: ws 0, exit.
  Final budget = 87, max_window = 0.
- If the Dfast branch in commit_space regresses to `data.len()` the
  3rd commit would compute evicted = 5; retire(5) would reclaim one
  more byte than the fix on each cascade pass. On the 2nd commit
  (also evicted = 4 under the bug) the cascade alone shrinks
  max_window_size to 0 prematurely, and the 3rd commit then trips
  `assert!(data.len() <= self.max_window_size)` in
  `DfastMatchGenerator::add_data`. Either failure mode surfaces as
  a test failure rather than silently passing.
- `fuzz_regressions.rs::interop_7_byte_input_does_not_oob_in_dfast_fast_loop`
  drops a stray `extern crate std;` line. The test body does not
  reference `std::` anywhere (only `crate::decoding::*` /
  `crate::encoding::*` / `alloc::vec::Vec` via `Vec` shorthand). The
  Rust 2024 edition rules around unused `extern crate` are stricter
  than 2021 and the upcoming clippy `legacy_numeric_constants` /
  `unused_extern_crates` lints would flag this on future toolchain
  bumps. Test still runs identically.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 14 changed files in this pull request and generated 1 comment.

Comment thread zstd/src/encoding/blocks/compressed.rs
@polaz polaz merged commit e148c1c into main May 17, 2026
27 checks passed
@polaz polaz deleted the perf/7pre-enablement branch May 17, 2026 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants