perf(7pre): enablement — donor outer/inner dfast Level 3 + cross-level FSE cache#146
Conversation
Port `start_matching_fast_loop` to upstream zstd 1.5.7's outer/inner
structure (`zstd_double_fast.c`). The outer `while` runs once per
match-found-and-stored; the inner loop carries `hl0`/`idxl0`/`hl1`/
`idxl1` between iterations with two-position lookahead so the long
hash is computed once per two positions scanned, not per position.
Hash tables are updated at `curr` BEFORE the match check (upstream
parity at `zstd_double_fast.c:187`); the short check is a 4-byte
gate only with forward `common_prefix_len` running only on hit; the
`_search_next_long` retry with precomputed `idxl1` follows the
preferred-longer-match policy. Match-finder probes inline into the
hot loop via raw-pointer arithmetic; `probe_slot_match` is preserved
for the cold callers (`hash_candidate`, dictionary prime, retry/
seed). The inner-loop exit shape is an explicit
`enum InnerExit { Committed(MatchCandidate), Tail(usize) }`; every
`break 'inner` picks a variant so the post-loop match is exhaustive,
removing the implicit pairing the previous draft had between
`Option<MatchCandidate>` and a sibling `tail_seed_anchor`.
Supporting changes inside the same fast-loop region:
- DFAST_HASH_BITS ceiling lowered from 20 to 17 per upstream
clevels.h:31 Level 3 sizing; slot storage moves to single u32 +
`position_base` rebase scheme (upstream `ZSTD_window_reduce`
parity).
- DFAST_MIN_MATCH_LEN drops from 6 to 5 per upstream `clevels.h:31`
default Level 3 mls=5 — we were silently dropping every 5-byte
match. DFAST_REP_MIN_MATCH_LEN stays at 4 (rep extensions inherit
the 4-byte upstream floor since rep encoding carries no on-wire
offset cost).
- Hash kernel switches to upstream's scalar multiply
(`v.wrapping_mul(prime8bytes)` then shift), dropping the CRC32d
kernel dispatch on the dfast hot path; the CRC32d kernel was 3-4
instructions of dispatch overhead per insert before the inline,
upstream's one-mul-shift is strictly cheaper.
- Repcode peek inlined at ip+1 with a 4-byte gate (upstream
zstd_double_fast.c:190); only `offset_1` is probed inline, the
full rep0/rep1/rep2 walk lives in the lazy/btopt strategies.
Removing the rep2/rep3 walk from the dfast inline saved both
speed (16% on level_3_dfast/decodecorpus-z000033) and ratio
(−494 B; fewer rep2/rep3 false-positive picks displacing better
hash matches).
- Skip-step now grows by scan distance rather than consecutive-miss
threshold (upstream zstd_double_fast.c:224-228 parity): step
increases every `DFAST_SKIP_STEP_GROWTH_INTERVAL = 64` positions
traveled (upstream uses 256 — see "Donor-deviation audit" in
PR body), so blocks the strict `block_looks_incompressible_strict`
gate used to bail out of early now scan through the standard
ramp. The strict helper itself stays for `levels/fastest.rs` and
`incompressible.rs`.
- Fast-loop boundary tail-position probe (`probe_tail_ip0_only`)
covers the case where `ip0` is still hashable but `ip1` falls
past the end (`pos == current_len - HASH_READ_SIZE`). Mirrors the
inner loop's long+short probe at ip0; the rep peek and
`_search_next_long` retry are skipped (both depend on ip1),
matching upstream's iend-boundary tradeoff.
- Short-hash hit enforces `DFAST_MIN_MATCH_LEN` floor before
committing. The 4-byte equality gate plus forward extension can
produce `match_len < 5`; the `_search_next_long` retry can still
upgrade to a valid long match, but with no upgrade the
below-floor hit is now discarded and the loop continues. Matches
the floor `probe_slot_match` already enforces on the long-hash
main path.
- Fast-loop guards tightened from `+ DFAST_MIN_MATCH_LEN (5)` to
`+ HASH_READ_SIZE (8)`. Previous guard let the unconditional
8-byte `u64::read_unaligned` for the long-hash probe read up to 3
bytes past the live history end; Linux fuzz ASan caught it on a
7-byte input. Regression test added in `tests/fuzz_regressions.rs`.
- Fast-loop entry calls `ensure_room_for` to rebase position_base
before any inner-loop slot pack. The inner loop's `packed_curr`
cast `(abs_ip0 - position_base) as u32` could truncate after a
long stream of all-miss / non-hashable blocks (fast-loop bypasses
`insert_position`'s rebase).
- `trim_to_window` now drops the dead prefix via `split_off`
instead of `compact_history`'s `drain` (which leaves the original
Vec's capacity pinned); the trim boundary pays one realloc in
exchange for actually shedding the prefix to the system
allocator. Signature also drops the unused `_reuse_space` closure
(Dfast's history-only storage has no per-block Vec to recycle —
unlike HC / Row which need the callback).
- `skip_matching` / `skip_matching_dense` / `start_matching` early-
return on `current_len == 0`. `add_data` short-circuits empty
input without pushing to `window_blocks`; without the guard these
methods would re-seed the previous block's retained tail on
every empty flush.
- `get_last_space` (trait dispatch entry point) returns `&[]` on
empty `window_blocks` instead of panicking, mirroring the inline
gate pattern used by `skip_matching` / `start_matching` /
`emit_candidate` / `emit_trailing_literals`.
- `probe_slot_match` takes a `gate_len` parameter (8 for long-hash
callers, 4 for short-hash). Replaces the prior 1-byte precheck
which let 8-byte-collision short matches slip past the gate.
`Cargo.toml` adds `[profile.bench]` overrides (`lto = "fat"`,
`codegen-units = 1`) scoped to bench builds only — full cross-crate
inlining for benches without forcing 5-10× release build-time on
`cli` and downstream consumers.
Verification on `compress/level_3_dfast/decodecorpus-z000033`
(1 022 035 B input):
Compress time: 32.07 ms (11.0× FFI) -> 16.11 ms (5.5× FFI)
Compressed size: 556 121 B (+5.5%) -> 489 846 B (−7.07%)
Memory (small ≤4 KiB): 378 KiB -> 249 KiB (−34%)
Compressed-size delta is honestly framed: smaller-than-donor does
NOT necessarily mean we win — extra match-finding work the
upstream fast path skips can produce smaller output AND slower
encode time simultaneously. The sequence-stream comparator (#99,
deferred to Lane D of restructured Phase 7 in #111) is the audit
tooling needed to triage.
`FrameCompressor::new` used to rebuild the three predefined LL/ML/OF FSE tables on every call (~12 µs per `build_table_from_probabilities` × 3 = ~38 µs). On a 1 KiB frame the actual compression work is ~1-5 µs, so the predefined-table build dominated per-frame cost for tiny inputs. Two-stage fix: 1. Each `default_*_table()` helper owns a private `static AtomicPtr<FSETable>` initialized via lock-free compare-exchange (`core::sync::atomic`, works in std and no_std alike — only `alloc` is required). First call leaks a `Box<FSETable>` to the cache; subsequent calls return the same pointer. The slot is selected at compile time by the static identity of each helper, not by pointer comparison on the distribution slice — `LL_DIST`/`ML_DIST`/`OF_DIST` are `const`, not `static`, and rustc does not guarantee a single materialized address for const slice references (a future codegen change that duplicates the const per use site would silently bypass a pointer-keyed cache). The helper now returns `&'static FSETable`. 2. `FseTables` stores the static reference directly (`ll_default: &'static FSETable`) rather than an owned `FSETable`. `FrameCompressor::new`'s FSE setup becomes three pointer-sized copies instead of three deep clones of 256-element `SymbolStates` arrays. `clone_fse_tables` (block-split estimator path) becomes a pointer copy on the three default fields too; only the `*_previous` per-frame tables still need a deep clone. `get_or_init_cached_table` uses `Acquire` on the load and `AcqRel` on the compare-exchange success path. The success-path ordering is stronger than the minimal `Release` would suggest; matching the failure `Acquire` keeps the success/failure branches symmetric on rustc's atomic surface, and the path runs at most three times per process so any cost is in the noise. Verification on `compress/level_2_dfast/small-1k-random/matrix/ pure_rust` (criterion, darwin/aarch64, fresh checkout): Pre-cache baseline: 44.25 µs (22.07 MiB/s, 13.4× FFI) After AtomicPtr cache: 12.67 µs (77.11 MiB/s, 3.9× FFI) After &'static refactor: 6.08 µs (160.75 MiB/s, 1.83× FFI) FFI 1.5.7: 3.31 µs (298.24 MiB/s) Cumulative: 7.3× compress-time speedup on the small-input scenario. Applies to every level since `FseTables::new()` runs inside `FrameCompressor::new` regardless of strategy. Adjacent encoding tightening in the same commit: - `compress_to_vec` is split — slice-based work moves into a new public `compress_slice_to_vec` that accepts `&[u8]` directly, while the `Read`-based variant remains a thin wrapper. Both seed the output `Vec` with `min(compress_bound(src), 128 KiB)` instead of the prior `compress_bound(src)` upfront pin — saves ~100 MiB peak on `large-log-stream` (100 MiB highly-compressible input). `compress_slice_to_vec` rustdoc enumerates the RSS shape change and the `# Panics` modes (encoder error, OOM); `compress_to_vec` rustdoc explicitly flags that it's NOT a streaming API — the source is materialized fully before any compression runs, peak bounded by `source.len()`. Both surfaces point at [`StreamingEncoder`] for unbounded inputs. - `FrameCompressor::compress`'s `all_blocks` accumulator pre- reserves capacity scaled by the source-size hint instead of a flat 128 KiB. Saves ~130 KiB peak for sub-4 KiB inputs without regression on medium / large. The `start_matching_fast_loop` preamble grows a raw-pointer aliasing invariant SAFETY block enumerating which fields the inner loop touches through cached `as_mut_ptr` raw pointers (`short_hash`, `long_hash`, `history`) vs which it accesses through `&self` shared reads (`offset_hist`, slice views of `history`). The pattern is sound under Stacked Borrows / Tree Borrows because the fields are physically disjoint allocations; the comment locks the invariant in so a future refactor adding a `&mut self` call inside the loop has to either hoist the call or reload the cached raw pointers. needless-borrow clippy autofixes in `encoding/blocks/compressed.rs`, `fse/mod.rs`, and `encoding/match_generator.rs` (consumers of the now-`&'static` default tables no longer need an outer `&`).
…re scratch dirs - `tests/fuzz_regressions.rs`: new `interop_7_byte_input_does_not_oob_in_dfast_fast_loop` regression for the Linux-fuzz `interop` artifact that exposed the dfast fast-loop OOB read. Bytes inline; pins to `CompressionLevel::Level(3)` so a future retune of the `Default` alias cannot accidentally route the test off the dfast path. Roundtrips through the in-tree decoder. Docstring calls out the sanitizer caveat — a plain `cargo nextest run` may complete successfully against pre-fix code because the OOB `read_unaligned` usually lands inside the live `Vec`'s spare capacity; the regression reliably fires only under ASan (Linux fuzz CI) or `cargo +nightly miri test`. Treat green local test as a smoke check, the authoritative signal is the CI fuzz job. - `benches/compare_ffi.rs` + `compare_ffi_memory.rs`: minor alignment with the public-API split (the new `compress_slice_to_vec` entry point); no behaviour changes to the bench harness, just keeping the call sites in sync. - `.gitignore`: ignore `.local/` (per-branch scratch / helper-doc directory used for `PHASE7C_FINDINGS.md` etc.), cargo-fuzz artifacts under `zstd/fuzz/artifacts/**` (libfuzzer content-hash filenames change on each run; in-repo regression vectors live inline in `tests/fuzz_regressions.rs` instead).
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughCaches predefined FSE tables with cfg-specific backings, exposes ref-based accessors, pre-allocates compression outputs via a compress_bound heuristic, refactors Dfast to a contiguous history and single-slot hash tables with rewritten matching, and updates benches, tests, Cargo features, and .gitignore. ChangesFSE Caching, Encoder Integration, and Dfast Matcher Refactor
Benchmark Updates, Config, and Regression Tests
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Pull request overview
Phase 7pre enablement work to bring dfast Level 3 closer to donor behavior/perf and lay groundwork for cross-level improvements, including reducing per-frame setup costs and tightening memory behavior.
Changes:
- Reworked dfast internals (min match length, table layout, fast-loop structure, history storage) to align more closely with donor semantics and reduce memory footprint.
- Added a process-local cache for predefined LL/ML/OF FSE tables and refactored encoder state to store
&'static FSETablereferences to avoid per-frame rebuild/clone cost. - Added a fuzz regression test for the 7-byte interop crash and aligned benches to use the new slice-to-Vec compression entry point.
Reviewed changes
Copilot reviewed 11 out of 12 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| zstd/src/tests/fuzz_regressions.rs | Adds 7-byte regression test covering the prior dfast fast-loop OOB/UB scenario. |
| zstd/src/fse/mod.rs | Updates tests/call sites for predefined FSE helpers now returning &'static FSETable. |
| zstd/src/fse/fse_encoder.rs | Implements predefined FSE table caching and changes default-table helpers to return &'static references. |
| zstd/src/encoding/mod.rs | Adds compress_bound() and new compress_slice_to_vec(); routes compress_to_vec() through it and seeds output capacity. |
| zstd/src/encoding/match_generator.rs | Updates dfast constants and driver lifecycle logic to match dfast backend refactors and default-table signature changes. |
| zstd/src/encoding/frame_compressor.rs | Stores default FSE tables by reference; tweaks block accumulation capacity based on source size hint. |
| zstd/src/encoding/dfast/mod.rs | Major dfast matcher refactor: single-slot hash tables, history-only storage, fast-loop safety/perf changes, and guard tightening. |
| zstd/src/encoding/blocks/compressed.rs | Adapts FSE table selection/clone logic to &'static default tables. |
| zstd/benches/compare_ffi.rs | Uses compress_slice_to_vec() to avoid benchmarking input-buffering overhead. |
| zstd/benches/compare_ffi_memory.rs | Uses compress_slice_to_vec() for more representative peak-memory measurement. |
| Cargo.toml | Sets [profile.bench] to fat LTO + codegen-units=1 for more realistic bench results. |
| .gitignore | Ignores new local fuzz artifacts and .local/ scratch output. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@zstd/src/encoding/dfast/mod.rs`:
- Around line 1438-1445: Add a debug assertion to ensure the structural
precondition that the trailing-block length is positive before slicing: in the
block where you compute last_len from
self.window_blocks.back().copied().unwrap_or(0) (used by emit_candidate), insert
a debug_assert!(last_len > 0) (or equivalent) immediately before creating
current = &self.history[self.history.len() - last_len..] so tests fail at the
source of the invariant violation rather than later when indexing current.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: fe52333b-a095-4426-94af-042976806f2e
📒 Files selected for processing (12)
.gitignoreCargo.tomlzstd/benches/compare_ffi.rszstd/benches/compare_ffi_memory.rszstd/src/encoding/blocks/compressed.rszstd/src/encoding/dfast/mod.rszstd/src/encoding/frame_compressor.rszstd/src/encoding/match_generator.rszstd/src/encoding/mod.rszstd/src/fse/fse_encoder.rszstd/src/fse/mod.rszstd/src/tests/fuzz_regressions.rs
…debug_assert - `fse_encoder`: gate the `AtomicPtr<FSETable>` lock-free cache behind `#[cfg(target_has_atomic = "ptr")]`. Targets that lack atomic pointer support (e.g. `thumbv6m-none-eabi`, AVR) were failing to compile against the unconditional `core::sync::atomic` import. Added a `#[cfg(not(target_has_atomic = "ptr"))]` fallback that uses a `static mut` raw pointer with single-threaded serial access — targets without atomic pointer support are by construction single-threaded embedded environments, so a non-atomic mutable static is sound. Same `Box::leak` lifetime, same `&'static FSETable` return shape as the atomic path, so consumers don't see a feature-flag-shaped API. Mirrors the pattern the crate already uses for `decoding::dictionary:: DictionaryHandle` (`Arc` vs `Rc`). Verified with `cargo check --target thumbv6m-none-eabi --no-default-features --features hash`. - `fse_encoder`: header comment block rewritten to describe both the atomic and no-atomic paths and to drop the stale "hand callers a clone" wording — the API now returns `&'static FSETable`, callers no longer clone. - `dfast::emit_candidate`: added `debug_assert!(last_len > 0)` immediately after the `window_blocks.back().copied().unwrap_or(0)` read. The function's structural precondition (it runs only after a successful match in the active block) was previously documented in a comment but not enforced in debug builds — the assert surfaces a violation at the source rather than via a confusing out-of-range panic on the `current[*literals_start..start]` subslice below.
…uild Replaces the prior static-mut no-atomic fallback (which assumed non-concurrency on no-atomic targets — wrong, interrupt / preemption reentrancy on bare-metal can still cause concurrent first-call initialization, making the static-mut slot a data race surface and UB) with a three-way split driven by target_has_atomic = "ptr" and the new optional critical-section feature: 1. target_has_atomic = "ptr" (every modern desktop / server / mobile / Cortex-M3+ / RISC-V-A target) — unchanged lock-free AtomicPtr cache. Sound under multi-threaded access (rayon worker threads racing on first call all converge through the compare_exchange; subsequent reads of &static FSETable are immutable shared references which are Sync because FSETable has no interior mutability). 2. not(target_has_atomic = "ptr") + critical-section feature — static CACHE: CsCachedTablePtr (newtype wrapping UnsafeCell<*mut FSETable> with manual unsafe impl Sync), accessed exclusively under critical_section::with. CS impl supplied by the downstream embedded runtime (cortex-m-rt, riscv-rt, embassy-executor, esp-hal, ...). First call takes the CS, builds + Box::into_raws, publishes; subsequent calls return the cached pointer. CS held for one build_table_from_probabilities (~12 us) on the cold path, zero overhead afterwards. 3. not(target_has_atomic = "ptr") without critical-section — per-call build + Box::leak. No cache, accepts the memory accumulation tradeoff against the UB risk of a synchronization- free static-mut slot. Same as the pre-PR status quo on no-atomic, with the addition of intentional leak (which embedded users who repeatedly construct FrameCompressor should mitigate by enabling the critical-section feature). Cargo.toml gains the optional dep critical-section = "1.2" gated by a new feature of the same name. On atomic targets the dep is dead code and never instantiated. Header comment block rewritten to document all three paths, including the explicit "lack of atomic instructions is not a non-concurrency guarantee" rationale that prompted this revision. The previous draft's safety argument was wrong; the new comment makes the trade-offs explicit per target family. Verified host atomic path (cargo clippy / nextest 481 passed) and both no-atomic sub-paths (cargo check --target thumbv6m-none-eabi with and without critical-section feature) build clean.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@zstd/src/fse/fse_encoder.rs`:
- Around line 819-830: The current no-atomic fallback permanently leaks memory
by Box::leak in build_and_leak_table; replace the leaking behavior with an
owned, non-'static return so callers hold responsibility for dropping the table:
change build_and_leak_table to construct the table with
build_table_from_probabilities and return either Box<FSETable> or FSETable
(instead of &'static FSETable), update all call sites to accept and drop the
owned table (or call .into_inner() where appropriate), and remove the Box::leak
to avoid unbounded memory growth on non-atomic targets.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: c533324b-2991-4c7b-bb9f-7e078b21bc3e
📒 Files selected for processing (2)
zstd/Cargo.tomlzstd/src/fse/fse_encoder.rs
Removes the per-call Box::leak from the no-atomic-without- critical-section fallback. The leak accumulated memory unboundedly across multiple FrameCompressor::new calls on the smallest supported targets (Cortex-M0/M0+, AVR, MSP430) — CodeRabbit and Copilot both flagged the unbounded-growth concern. Type-level fix via a new fse_encoder::FseDefaultTable cfg alias: - target_has_atomic = "ptr" OR feature "critical-section": &'static FSETable (cached singleton, zero-cost subsequent calls) - otherwise: alloc::boxed::Box<FSETable> (owned per-frame allocation, dropped with the owning FrameCompressor — same memory shape as the pre-PR status quo on no-atomic targets) FseTables struct fields switch from a hard-coded &'static FSETable to the type alias; both arms Deref to FSETable. Consumers in encoding/blocks/compressed.rs borrow uniformly through three new FseTables accessor methods (ll_default_ref / ml_default_ref / of_default_ref) so the per-target divergence stays inside the struct surface and downstream call sites do not need cfg. clone_fse_tables (block-split estimator path, levels 11+) uses an explicit per-arm clone via cfg-conditional struct-literal fields: on atomic / CS the field is Copy (zero-cost field-access copy); on no-atomic-no-CS the field is Box<FSETable> and needs Clone::clone for a deep copy. The naive uniform .clone() resolves via auto-deref to FSETable::clone (returns owned FSETable) which is the WRONG return type for the atomic arm, hence the explicit split. Verified: cargo clippy --lib --tests clean on host (atomic); cargo nextest run --lib passes 481/481. Pre-existing thumbv6m build failure in tests/roundtrip_integrity.rs (unconditional extern crate std at line 6 vs the cfg-gated mod declaration in tests/mod.rs:582) remains — same failure mode on main, not introduced by this commit. Tracked separately.
There was a problem hiding this comment.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'structured-zstd vs C FFI (x86_64-gnu)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.30.
| Benchmark suite | Current: b86f51e | Previous: 1e0f695 | Ratio |
|---|---|---|---|
compress/level_22_btultra2/small-4k-log-lines/matrix/c_ffi |
0.113 ms |
0.067 ms |
1.69 |
compress/level_22_btultra2/decodecorpus-z000033/matrix/c_ffi |
248.653 ms |
171.961 ms |
1.45 |
This comment was automatically generated by workflow using github-action-benchmark.
CC: @polaz
The `tests/` directory contains test-only modules with unconditional `extern crate std;` items (e.g. `roundtrip_integrity.rs:6`). With `mod tests;` declared unconditionally in `lib.rs`, those modules were pulled into every lib build — including no_std targets that do not have `std` available — producing E0463 "can't find crate for `std`" on `thumbv6m-none-eabi`, AVR, MSP430, etc. Gating the module declaration on `cfg(test)` is the canonical Rust pattern: the test code is part of the lib only when the harness is being built; the production lib stays clean. No other module in the crate references `crate::tests::*` from non-test code, so the gate is behaviour-preserving for the test harness path. Verified: - cargo check --lib --no-default-features --features hash --target thumbv6m-none-eabi — passes (previously failed with 3 E0463 / E0432 errors) - cargo check --lib --no-default-features --features hash,critical-section --target thumbv6m-none-eabi — passes - cargo nextest run --lib on host — 481/481 tests pass, unchanged - cargo build --lib on host (default features) — clean
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 14 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (4)
zstd/src/encoding/match_generator.rs:4235
- Same
FseDefaultTablecfg-typing issue as above:Some(ll)only works whendefault_*_table()returns&'static FSETable. To keep these tests compiling on no-atomic/no-critical-sectiontargets (where the alias isBox<FSETable>), borrow throughDeref(e.g.,Some(&*ll)).
let ll = crate::fse::fse_encoder::default_ll_table();
let ml = crate::fse::fse_encoder::default_ml_table();
let of = crate::fse::fse_encoder::default_of_table();
hc.seed_dictionary_entropy(None, Some(ll), Some(ml), Some(of));
zstd/src/encoding/match_generator.rs:4278
- Same
FseDefaultTablecfg-typing issue:Some(ll)will fail to compile whendefault_*_table()returnsBox<FSETable>(no-atomic/no-critical-section). Borrow throughDeref(e.g.,Some(&*ll)/Some(&*ml)/Some(&*of)) to keep this test cfg-agnostic.
let ll = crate::fse::fse_encoder::default_ll_table();
let ml = crate::fse::fse_encoder::default_ml_table();
let of = crate::fse::fse_encoder::default_of_table();
hc.seed_dictionary_entropy(None, Some(ll), Some(ml), Some(of));
zstd/src/encoding/match_generator.rs:4338
- Same
FseDefaultTablecfg-typing issue:Some(ll)assumesdefault_*_table()returns a reference. On cfgs where it returnsBox<FSETable>, this call won’t type-check againstOption<&FSETable>. Borrow throughDerefso the test remains portable.
let ll = crate::fse::fse_encoder::default_ll_table();
let ml = crate::fse::fse_encoder::default_ml_table();
let of = crate::fse::fse_encoder::default_of_table();
hc.seed_dictionary_entropy(None, Some(ll), Some(ml), Some(of));
zstd/src/encoding/match_generator.rs:4484
- Same
FseDefaultTablecfg-typing issue:default_*_table()can returnBox<FSETable>on no-atomic/no-critical-sectionbuilds, soSome(ll)won’t matchOption<&FSETable>. Borrow throughDeref(e.g.,Some(&*ll)) to keep this test compiling across cfgs.
let ll = crate::fse::fse_encoder::default_ll_table();
let ml = crate::fse::fse_encoder::default_ml_table();
let of = crate::fse::fse_encoder::default_of_table();
stats.seed_dictionary_entropy(Some(&huff), Some(ll), Some(ml), Some(of));
`default_ll_table()` / `default_ml_table()` / `default_of_table()` now return `FseDefaultTable`, the cfg-typed alias added in the previous round. On atomic / `critical-section` targets it resolves to `&'static FSETable` (passes directly anywhere `&FSETable` is expected); on no-atomic / no-CS it resolves to `Box<FSETable>` (needs `&*` reborrow to satisfy a `&FSETable` parameter). Six test sites were assuming the reference shape and would not compile on the no-atomic / no-CS arm: - `fse/mod.rs::table_header_bits_exact` (3 `check(...)` calls) - `match_generator.rs::dictionary_entropy_seed_initializes_opt_state_from_tables` - `match_generator.rs::dictionary_fse_seed_applies_without_huffman_seed` - `match_generator.rs::dictionary_seed_overrides_predef_price_mode_on_tiny_input` - `match_generator.rs::btultra2_seed_pass_disabled_when_dictionary_entropy_seed_present` - `match_generator.rs::dictionary_huffman_seed_ignored_when_literals_uncompressed` Wrapped the 15 call-site occurrences in `&*` so the reborrow is canonical across both cfg arms. On the atomic arm clippy flags `&*` as needless (the value is already `&FSETable`); each affected test function carries a localized `#[allow(clippy::borrow_deref_ref)]` next to its `#[test]` attribute. `Cargo.toml` `critical-section` feature comment refreshed to match the actual no-atomic / no-CS behaviour: per-call build of an owned `Box<FSETable>` stored in `FseTables` and dropped with the owning `FrameCompressor` — no Box::leak, no unbounded growth (the prior draft of the comment described leaky behaviour that has since been fixed).
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 681-692: The Dfast branch in commit_space() is incorrectly
counting the input buffer as evicted; update the Dfast handling (use
dfast_matcher_mut()/Dfast matcher) to compute eviction by capturing pre =
dfast.window_size, calling dfast.add_data(space, |...| { ... }) (the existing
callback), and then incrementing evicted_bytes by pre - dfast.window_size
instead of using the input buffer size; this mirrors the
trim_to_window()/window_size-delta pattern used in trim_after_budget_retire and
prevents premature shrinking of dictionary_retained_budget/max_window_size.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: d667e539-adf7-4ab0-9ff3-0744dbb78890
📒 Files selected for processing (3)
zstd/Cargo.tomlzstd/src/encoding/match_generator.rszstd/src/fse/mod.rs
…mit_space The `commit_space` Dfast branch was counting the INPUT buffer length as `evicted_bytes`. Dfast's `add_data` callback receives the input `Vec<u8>` for pool recycling (Dfast stores bytes in the contiguous `history` buffer; there is no per-block `Vec<u8>` to pop and hand back, unlike Row / HashChain which the callback contract is shaped around). Counting `data.len()` therefore conflated "new bytes ingested" with "old bytes evicted from the window". The two happen to coincide when the prior window was exactly saturated and the new input fills it 1:1 — which is the only case the existing `dfast_add_data_callback_reports_evicted_len_not_capacity` fixture exercised (max=8, prev=8, new=8 → both numbers are 8). The bug surfaces on the divergent path: max=10, window=[4,4] (size 8), new input 5 bytes → pop one 4-byte block (evicted=4), push 5 (window=[4,5], size=9). The old accounting reported `evicted=5`, which then drives `retire_dictionary_budget` to reclaim 1 byte too many and shrinks `max_window_size` prematurely — degrading dictionary retention. Fix: derive the actual eviction from `window_size` before/after, plus the input length. The pop loop inside `add_data` decrements `window_size` by each evicted block length and the final `extend_from_slice + push_back` adds `space_len`, so `evicted = pre + space_len - post` is the exact reverse-engineering of what the pop loop did. Same `window_size`-delta pattern already in use by `trim_after_budget_retire` for the Dfast variant. The `pre.saturating_add(space_len).saturating_sub(m.window_size)` form keeps the arithmetic robust against pathological pre+space_len overflow on i686 (where `usize::MAX` is reachable in tests of extreme inputs). Regression test `dfast_commit_space_eviction_uses_window_size_delta` forces the divergent case (4 + 4 + 5 = 13 bytes vs max_window = 10) and asserts the formula yields 4 (the actual eviction), not 5 (the input size). The prior `dfast_add_data_callback_reports_evicted_len_not_capacity` fixture is unchanged — its concern (callback data is not the raw backing capacity) is orthogonal and still holds.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 6990-7048: The test
dfast_commit_space_eviction_uses_window_size_delta currently exercises
DfastMatchGenerator.add_data directly; instead exercise the driver path by
constructing a MatchGeneratorDriver (wrapping the DfastMatchGenerator) and feed
the three blocks [4,4,5] through the driver's public input API so that
MatchGeneratorDriver.commit_space / its driver-backed commit logic runs the same
code path as the production driver; after the third input assert the driver's
observable dictionary_retained_budget (or the driver's window-related
observable) equals 96 (start 100 → evicted 4) and/or that the driver's window
size delta matches the pre+space_len-post calculation to ensure commit_space was
applied by the driver rather than only by DfastMatchGenerator.add_data.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 1fb99246-f4d5-4f12-8543-5968a67bd059
📒 Files selected for processing (1)
zstd/src/encoding/match_generator.rs
… std
- dfast_commit_space_eviction_uses_window_size_delta now drives the
`[4, 4, 5]` fixture through `MatchGeneratorDriver::commit_space`
rather than calling `DfastMatchGenerator::add_data` directly and
re-deriving the formula on the matcher side. The previous draft
would have passed even if `commit_space`'s Dfast branch regressed
back to `evicted_bytes += data.len()` — the matcher-only path
doesn't run `commit_space` at all. Asserting on the driver's
observable `dictionary_retained_budget` and the post-retire
`max_window_size` proves the driver-side accounting is the one
that fires.
- Trace through the assertion targets (budget = 100, max_window = 10,
commits [4, 4, 5]):
1st commit: ws 0 → 4, evicted = 0, no retire. budget 100.
2nd commit: ws 4 → 8, evicted = 0, no retire. budget 100.
3rd commit: 8 + 5 > 10 → pop 4-byte block, push 5. ws 9.
evicted = pre + space_len - post = 8 + 5 - 9 = 4.
retire(4) → budget 96, max_window 6, returns true.
trim_after_budget_retire cascade:
iter1: ws 9 > 6, pop [4] → ws 5, evicted 4.
retire(4) → budget 92, max_window 2.
iter2: ws 5 > 2, pop [5] → ws 0, evicted 5.
retire(5) → budget 87, max_window 0.
iter3: ws 0, exit.
Final budget = 87, max_window = 0.
- If the Dfast branch in commit_space regresses to `data.len()` the
3rd commit would compute evicted = 5; retire(5) would reclaim one
more byte than the fix on each cascade pass. On the 2nd commit
(also evicted = 4 under the bug) the cascade alone shrinks
max_window_size to 0 prematurely, and the 3rd commit then trips
`assert!(data.len() <= self.max_window_size)` in
`DfastMatchGenerator::add_data`. Either failure mode surfaces as
a test failure rather than silently passing.
- `fuzz_regressions.rs::interop_7_byte_input_does_not_oob_in_dfast_fast_loop`
drops a stray `extern crate std;` line. The test body does not
reference `std::` anywhere (only `crate::decoding::*` /
`crate::encoding::*` / `alloc::vec::Vec` via `Vec` shorthand). The
Rust 2024 edition rules around unused `extern crate` are stricter
than 2021 and the upcoming clippy `legacy_numeric_constants` /
`unused_extern_crates` lints would flag this on future toolchain
bumps. Test still runs identically.
Summary
Phase 7pre — cross-level enablement / groundwork (see #111 for the restructured Phase 7 plan, default-first per-strategy ordering).
This PR replaces #145 (closed) — same final code state, history squashed to 3 logical commits, scope clarified. The original PR was titled "close level 3 parity gap" but grew well beyond Level 3 into cross-level work; under the restructured taxonomy this is Phase 7pre: enablement, prerequisite for the subsequent per-strategy / per-axis sub-phases.
Honest framing on the "Compressed size" column below: our output is 7% smaller than upstream on the
decodecorpus-z000033corpus. That is NOT confirmation that we beat upstream — it is more likely a signal of algorithmic divergence (we are doing extra match-finding work the upstream fast path skips), which is consistent with us still being 5.5× slower. The deviation may turn out to be the cause of the speed gap, not a separate win. To confirm what upstream would have emitted on the same input we still need the sequence-stream comparator (Phase 7-tooling-seq-cmp in the restructured #111).Cumulative deltas (compress/level_3_dfast/decodecorpus-z000033, 1 022 035 B input)
Small-input speedup (FSE cache, all levels)
FrameCompressor::newused to rebuild the three predefined LL/ML/OF FSE tables on every call (~12 µs × 3 = ~38 µs). Two-stage fix in commit65d4cbdb:AtomicPtr<FSETable>, clone on each ctor — drops the build cost but pays ~4-5 µs per clone of three 256-elementSymbolStatesarrays.&'static FSETableand store the static reference insideFseTablesdirectly — the clone disappears, the field becomes a pointer-sized copy.Effect on
small-1k-random/matrix/pure_rustat level 2 dfast (criterion, darwin/aarch64):compress_slice_to_vec7.3x total compress-time speedup on the small-input scenario. Applies to every level. The residual ~2.8 us gap to FFI is per-frame setup not covered by this change; covered by Phase 7-compress-default in #111.
Per-scenario sanity at level_3_dfast (no regressions)
Commits
This PR is squashed from 30 commits in #145 into 3 logical commits — same final tree, cleaner review surface:
Phase 7 sub-phase plan (per #111)
This PR is the entry point for the restructured Phase 7 sub-phase taxonomy. After merge, the remaining sub-phases are:
Lane A — Compress (default-first): 7-compress-default (Dfast Level 3 residual) -> 7-compress-better (Lazy Level 7) -> 7-compress-best (BtOpt/BtUltra 16-19) -> 7-compress-fastest -> 7-compress-dfast-2 -> 7-compress-greedy -> 7-compress-lazy-lower -> 7-compress-lazy-upper -> 7-compress-btultra2.
Lane B — Decompress (NEW): 7-decompress-baseline (block_content_buffer 0-fill skip + NEON-overshoot match copy) -> 7-decompress-fse (state machine register-resident + BMI2 pdep/pext + NEON bit-extract) -> 7-decompress-huffman.
Lane C — Memory (NEW): 7-memory-tracker (per-alloc-site backtrace tracker, prerequisite) -> 7-memory-medium -> 7-memory-large.
Lane D — Tooling (NEW): 7-tooling-seq-cmp (sequence-stream comparator, prerequisite for 7-compress-default audit).
Verification gates
Test plan
Summary by CodeRabbit
New Features
Bug Fixes
Performance Improvements
Tests
Chores