Skip to content

fix(dist): plug Remove silent-swallow and hint-replay abandon on transport errors#132

Merged
hyp3rd merged 1 commit into
mainfrom
feat/dist-cluster
May 12, 2026
Merged

fix(dist): plug Remove silent-swallow and hint-replay abandon on transport errors#132
hyp3rd merged 1 commit into
mainfrom
feat/dist-cluster

Conversation

@hyp3rd
Copy link
Copy Markdown
Owner

@hyp3rd hyp3rd commented May 12, 2026

Two symmetric audit fixes for the distributed memory backend:

Remove forward promotion (removeImpl):
removeImpl previously blackholed the ForwardRemove result with _ = transport.ForwardRemove(...), so a Remove against a downed primary silently succeeded while the stale value lingered on every owner. The new forwardOrPromoteRemove helper mirrors the handleForwardPrimary promotion contract: on any non-nil transport error, if the local node is a replica owner it applies the remove locally and fans out to peer replicas via applyRemove(replicate=true); otherwise it returns the error. Promotions bump the shared dist.write.forward_promotion counter so Remove and Set promotions are observable on the same instrument.

Hint replay retention (processHint):
processHint previously retained a hint only when the error matched the in-process ErrBackendNotFound sentinel. Production HTTP/gRPC transports surface net.OpError, io.EOF, and
context.DeadlineExceeded for a briefly-unreachable peer, causing the hint to be abandoned on its very first replay attempt instead of being retained through the outage (recovery on :8083 timed out after 60s). The hint TTL (WithDistHintTTL) still bounds total retry time, so a permanently-broken target still drains. The deprecated hintedDropped / migrationHintDropped OTel counters remain registered for stability but now only bump on queue-capacity overflow.

Tests:

  • Add TestDistRemove_PromotesOnGenericForwardError: chaos at DropRate=1.0, asserts promoted Remove returns nil, clears local copy, bumps counter.
  • Add TestDistHintReplay_RetainsOnGenericReplayError: 150ms chaos window, heals, asserts hint replays onto recovered peer.
  • Rename TestMigrationHint_TransportErrorBumpsDroppedCounter -> TestMigrationHint_TransportErrorKeepsEntry to pin the new contract.

…sport errors

Two symmetric audit fixes for the distributed memory backend:

Remove forward promotion (removeImpl):
`removeImpl` previously blackholed the ForwardRemove result with
`_ = transport.ForwardRemove(...)`, so a Remove against a downed
primary silently succeeded while the stale value lingered on every
owner. The new `forwardOrPromoteRemove` helper mirrors the
`handleForwardPrimary` promotion contract: on any non-nil transport
error, if the local node is a replica owner it applies the remove
locally and fans out to peer replicas via `applyRemove(replicate=true)`;
otherwise it returns the error. Promotions bump the shared
`dist.write.forward_promotion` counter so Remove and Set promotions
are observable on the same instrument.

Hint replay retention (processHint):
`processHint` previously retained a hint only when the error matched
the in-process `ErrBackendNotFound` sentinel. Production HTTP/gRPC
transports surface `net.OpError`, `io.EOF`, and
`context.DeadlineExceeded` for a briefly-unreachable peer, causing the
hint to be abandoned on its very first replay attempt instead of being
retained through the outage (`recovery on :8083 timed out after 60s`).
The hint TTL (`WithDistHintTTL`) still bounds total retry time, so a
permanently-broken target still drains. The deprecated `hintedDropped`
/ `migrationHintDropped` OTel counters remain registered for stability
but now only bump on queue-capacity overflow.

Tests:
- Add TestDistRemove_PromotesOnGenericForwardError: chaos at DropRate=1.0,
  asserts promoted Remove returns nil, clears local copy, bumps counter.
- Add TestDistHintReplay_RetainsOnGenericReplayError: 150ms chaos window,
  heals, asserts hint replays onto recovered peer.
- Rename TestMigrationHint_TransportErrorBumpsDroppedCounter ->
  TestMigrationHint_TransportErrorKeepsEntry to pin the new contract.
@hyp3rd hyp3rd merged commit 606e824 into main May 12, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant