feat(backend): async read-repair batching + unconditional ForwardSet-only repair#130
Merged
Conversation
…only repair Introduce `WithDistReadRepairBatch(interval, maxBatchSize)` option that routes quorum-read repair fan-out through an async coalescing queue (`repairQueue` in pkg/backend/dist_read_repair.go). Repairs are queued by destination peer + key, retaining only the highest-version entry per (peer, key) — last-write-wins by version, tie-broken by origin. Concurrent reads of the same hot key produce one repair, not N; each collapsed duplicate bumps the new `dist.read.repair.coalesced` metric. The background flusher dispatches per-peer batches on the configured interval or when a peer's pending count hits `maxBatchSize`, using errgroup for parallel ForwardSet calls. `Stop()` drains the queue before returning; crash exit loses queued repairs by design, with merkle anti-entropy as the convergence safety net. Drop the defensive ForwardGet probe from `repairRemoteReplica`: every repair is now a single unconditional ForwardSet. The receiver's `applySet` already version-compares and noops downgrades, making the probe pure duplication (~50 % wire-call reduction per repair, independent of whether batching is enabled). New OTel metrics: - `dist.read.repair.batched` — ForwardSet calls dispatched by the flusher - `dist.read.repair.coalesced` — duplicate (peer, key) enqueues collapsed Eight unit tests (pkg/backend/dist_read_repair_test.go) cover coalesce semantics, distinct-peer independence, parallel per-peer flush, nil-transport noop, size-threshold inline flush, stop-drain guarantee, isHigherVersion tie-break, and concurrent-enqueue race-safety. Three integration tests (tests/hypercache_distmemory_readrepair_batch_test.go) drive a 3-node RF=3 Quorum cluster end-to-end. Also: - Refactor Stop() stop-channel teardown into closeBackgroundLoops() - Fix Makefile pre-commit target: guard pyenv activation with command -v - Add golang.org/x/sync v0.20.0 (errgroup); bump shamaton/msgpack to v3.1.1 - Add cspell words: amortisation, coalescer, distmemory, errgroup, readrepair - Document batching option in docs/operations.md under Tuning — read-repair batching
Previously, `handleForwardPrimary` only promoted to a replica owner when the forward error matched the in-process transport's `ErrBackendNotFound` sentinel. HTTP/gRPC transports against a stopped node surface `net.OpError`, `io.EOF`, or `context.DeadlineExceeded` instead — causing writes to fail silently for keys whose primary had just been killed, rather than falling through to a replica. Promotion now triggers on any non-nil forward error when the local node is listed in `owners[1:]` and still owns the key locally (defensive against a stale ring snapshot). Spurious promotion on a transient blip is benign: `applySet` version-compares on the receiver and `chooseNewer` / merkle anti-entropy reconcile divergent `(version, origin)` pairs via the existing LWW rule. Add `TestDistSet_PromotesOnGenericForwardError` which uses chaos hooks at `DropRate=1.0` to deterministically force a generic forward error and asserts the Set succeeds via promotion; `TestDistFailureRecovery` continues to pass unchanged (the change widens the gate, doesn't narrow it). Also add a `check_command_exists` Makefile macro and CHANGELOG entry.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Introduce
WithDistReadRepairBatch(interval, maxBatchSize)option that routes quorum-read repair fan-out through an async coalescing queue (repairQueuein pkg/backend/dist_read_repair.go).Repairs are queued by destination peer + key, retaining only the highest-version entry per (peer, key) — last-write-wins by version, tie-broken by origin. Concurrent reads of the same hot key produce one repair, not N; each collapsed duplicate bumps the new
dist.read.repair.coalescedmetric. The background flusher dispatches per-peer batches on the configured interval or when a peer's pending count hitsmaxBatchSize, using errgroup for parallel ForwardSet calls.Stop()drains the queue before returning; crash exit loses queued repairs by design, with merkle anti-entropy as the convergence safety net.Drop the defensive ForwardGet probe from
repairRemoteReplica: every repair is now a single unconditional ForwardSet. The receiver'sapplySetalready version-compares and noops downgrades, making the probe pure duplication (~50 % wire-call reduction per repair, independent of whether batching is enabled).New OTel metrics:
dist.read.repair.batched— ForwardSet calls dispatched by the flusherdist.read.repair.coalesced— duplicate (peer, key) enqueues collapsedEight unit tests (pkg/backend/dist_read_repair_test.go) cover coalesce semantics, distinct-peer independence, parallel per-peer flush, nil-transport noop, size-threshold inline flush, stop-drain guarantee, isHigherVersion tie-break, and concurrent-enqueue race-safety. Three integration tests (tests/hypercache_distmemory_readrepair_batch_test.go) drive a 3-node RF=3 Quorum cluster end-to-end.
Also: