fix(dist): plug Remove silent-swallow and hint-replay abandon on transport errors#132
Merged
Conversation
…sport errors Two symmetric audit fixes for the distributed memory backend: Remove forward promotion (removeImpl): `removeImpl` previously blackholed the ForwardRemove result with `_ = transport.ForwardRemove(...)`, so a Remove against a downed primary silently succeeded while the stale value lingered on every owner. The new `forwardOrPromoteRemove` helper mirrors the `handleForwardPrimary` promotion contract: on any non-nil transport error, if the local node is a replica owner it applies the remove locally and fans out to peer replicas via `applyRemove(replicate=true)`; otherwise it returns the error. Promotions bump the shared `dist.write.forward_promotion` counter so Remove and Set promotions are observable on the same instrument. Hint replay retention (processHint): `processHint` previously retained a hint only when the error matched the in-process `ErrBackendNotFound` sentinel. Production HTTP/gRPC transports surface `net.OpError`, `io.EOF`, and `context.DeadlineExceeded` for a briefly-unreachable peer, causing the hint to be abandoned on its very first replay attempt instead of being retained through the outage (`recovery on :8083 timed out after 60s`). The hint TTL (`WithDistHintTTL`) still bounds total retry time, so a permanently-broken target still drains. The deprecated `hintedDropped` / `migrationHintDropped` OTel counters remain registered for stability but now only bump on queue-capacity overflow. Tests: - Add TestDistRemove_PromotesOnGenericForwardError: chaos at DropRate=1.0, asserts promoted Remove returns nil, clears local copy, bumps counter. - Add TestDistHintReplay_RetainsOnGenericReplayError: 150ms chaos window, heals, asserts hint replays onto recovered peer. - Rename TestMigrationHint_TransportErrorBumpsDroppedCounter -> TestMigrationHint_TransportErrorKeepsEntry to pin the new contract.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Two symmetric audit fixes for the distributed memory backend:
Remove forward promotion (removeImpl):
removeImplpreviously blackholed the ForwardRemove result with_ = transport.ForwardRemove(...), so a Remove against a downed primary silently succeeded while the stale value lingered on every owner. The newforwardOrPromoteRemovehelper mirrors thehandleForwardPrimarypromotion contract: on any non-nil transport error, if the local node is a replica owner it applies the remove locally and fans out to peer replicas viaapplyRemove(replicate=true); otherwise it returns the error. Promotions bump the shareddist.write.forward_promotioncounter so Remove and Set promotions are observable on the same instrument.Hint replay retention (processHint):
processHintpreviously retained a hint only when the error matched the in-processErrBackendNotFoundsentinel. Production HTTP/gRPC transports surfacenet.OpError,io.EOF, andcontext.DeadlineExceededfor a briefly-unreachable peer, causing the hint to be abandoned on its very first replay attempt instead of being retained through the outage (recovery on :8083 timed out after 60s). The hint TTL (WithDistHintTTL) still bounds total retry time, so a permanently-broken target still drains. The deprecatedhintedDropped/migrationHintDroppedOTel counters remain registered for stability but now only bump on queue-capacity overflow.Tests: