feat(backend): async read-repair batching + unconditional ForwardSet-only repair by hyp3rd · Pull Request #130 · hyp3rd/hypercache

hyp3rd · 2026-05-12T17:37:03Z

Introduce WithDistReadRepairBatch(interval, maxBatchSize) option that routes quorum-read repair fan-out through an async coalescing queue (repairQueue in pkg/backend/dist_read_repair.go).

Repairs are queued by destination peer + key, retaining only the highest-version entry per (peer, key) — last-write-wins by version, tie-broken by origin. Concurrent reads of the same hot key produce one repair, not N; each collapsed duplicate bumps the new dist.read.repair.coalesced metric. The background flusher dispatches per-peer batches on the configured interval or when a peer's pending count hits maxBatchSize, using errgroup for parallel ForwardSet calls. Stop() drains the queue before returning; crash exit loses queued repairs by design, with merkle anti-entropy as the convergence safety net.

Drop the defensive ForwardGet probe from repairRemoteReplica: every repair is now a single unconditional ForwardSet. The receiver's applySet already version-compares and noops downgrades, making the probe pure duplication (~50 % wire-call reduction per repair, independent of whether batching is enabled).

New OTel metrics:

dist.read.repair.batched — ForwardSet calls dispatched by the flusher
dist.read.repair.coalesced — duplicate (peer, key) enqueues collapsed

Eight unit tests (pkg/backend/dist_read_repair_test.go) cover coalesce semantics, distinct-peer independence, parallel per-peer flush, nil-transport noop, size-threshold inline flush, stop-drain guarantee, isHigherVersion tie-break, and concurrent-enqueue race-safety. Three integration tests (tests/hypercache_distmemory_readrepair_batch_test.go) drive a 3-node RF=3 Quorum cluster end-to-end.

Also:

Refactor Stop() stop-channel teardown into closeBackgroundLoops()
Fix Makefile pre-commit target: guard pyenv activation with command -v
Add golang.org/x/sync v0.20.0 (errgroup); bump shamaton/msgpack to v3.1.1
Add cspell words: amortisation, coalescer, distmemory, errgroup, readrepair
Document batching option in docs/operations.md under Tuning — read-repair batching

…only repair Introduce `WithDistReadRepairBatch(interval, maxBatchSize)` option that routes quorum-read repair fan-out through an async coalescing queue (`repairQueue` in pkg/backend/dist_read_repair.go). Repairs are queued by destination peer + key, retaining only the highest-version entry per (peer, key) — last-write-wins by version, tie-broken by origin. Concurrent reads of the same hot key produce one repair, not N; each collapsed duplicate bumps the new `dist.read.repair.coalesced` metric. The background flusher dispatches per-peer batches on the configured interval or when a peer's pending count hits `maxBatchSize`, using errgroup for parallel ForwardSet calls. `Stop()` drains the queue before returning; crash exit loses queued repairs by design, with merkle anti-entropy as the convergence safety net. Drop the defensive ForwardGet probe from `repairRemoteReplica`: every repair is now a single unconditional ForwardSet. The receiver's `applySet` already version-compares and noops downgrades, making the probe pure duplication (~50 % wire-call reduction per repair, independent of whether batching is enabled). New OTel metrics: - `dist.read.repair.batched` — ForwardSet calls dispatched by the flusher - `dist.read.repair.coalesced` — duplicate (peer, key) enqueues collapsed Eight unit tests (pkg/backend/dist_read_repair_test.go) cover coalesce semantics, distinct-peer independence, parallel per-peer flush, nil-transport noop, size-threshold inline flush, stop-drain guarantee, isHigherVersion tie-break, and concurrent-enqueue race-safety. Three integration tests (tests/hypercache_distmemory_readrepair_batch_test.go) drive a 3-node RF=3 Quorum cluster end-to-end. Also: - Refactor Stop() stop-channel teardown into closeBackgroundLoops() - Fix Makefile pre-commit target: guard pyenv activation with command -v - Add golang.org/x/sync v0.20.0 (errgroup); bump shamaton/msgpack to v3.1.1 - Add cspell words: amortisation, coalescer, distmemory, errgroup, readrepair - Document batching option in docs/operations.md under Tuning — read-repair batching

Previously, `handleForwardPrimary` only promoted to a replica owner when the forward error matched the in-process transport's `ErrBackendNotFound` sentinel. HTTP/gRPC transports against a stopped node surface `net.OpError`, `io.EOF`, or `context.DeadlineExceeded` instead — causing writes to fail silently for keys whose primary had just been killed, rather than falling through to a replica. Promotion now triggers on any non-nil forward error when the local node is listed in `owners[1:]` and still owns the key locally (defensive against a stale ring snapshot). Spurious promotion on a transient blip is benign: `applySet` version-compares on the receiver and `chooseNewer` / merkle anti-entropy reconcile divergent `(version, origin)` pairs via the existing LWW rule. Add `TestDistSet_PromotesOnGenericForwardError` which uses chaos hooks at `DropRate=1.0` to deterministically force a generic forward error and asserts the Set succeeds via promotion; `TestDistFailureRecovery` continues to pass unchanged (the change widens the gate, doesn't narrow it). Also add a `check_command_exists` Makefile macro and CHANGELOG entry.

hyp3rd added 2 commits May 12, 2026 19:36

hyp3rd merged commit e19ab38 into main May 12, 2026
14 checks passed

hyp3rd deleted the feat/dist-mem-cache branch May 12, 2026 18:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(backend): async read-repair batching + unconditional ForwardSet-only repair#130

feat(backend): async read-repair batching + unconditional ForwardSet-only repair#130
hyp3rd merged 2 commits into
mainfrom
feat/dist-mem-cache

hyp3rd commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

hyp3rd commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant