Skip to content

fix(dist): widen replica fan-out on promotion and add forward_promotion metric#131

Merged
hyp3rd merged 1 commit into
mainfrom
feat/dist-cluster
May 12, 2026
Merged

fix(dist): widen replica fan-out on promotion and add forward_promotion metric#131
hyp3rd merged 1 commit into
mainfrom
feat/dist-cluster

Conversation

@hyp3rd
Copy link
Copy Markdown
Owner

@hyp3rd hyp3rd commented May 12, 2026

When setImpl promotes a local replica due to an unreachable primary, pass the full owners list to replicateTo (instead of owners[1:]). This ensures the dead primary's slot is included in best-effort replication, so replicateTo's existing failure-path queues a hinted handoff for it. Post-restart convergence is now bounded by WithDistHintReplayInterval (~200ms default) rather than the next merkle tick.

Add a new OTel counter dist.write.forward_promotion (internal atomic writeForwardPromotion) that increments each time promotion fires. A steadily rising counter surfaces a flapping primary well before any read/write error spikes.

Expand TestDistSet_PromotesOnGenericForwardError to:

  • assert WriteForwardPromotion increments on every promotion
  • assert HintedQueued increments (proving the hint was enqueued)
  • heal chaos and confirm the original primary receives the write via natural hint-replay, using a waitForLocalContains polling helper to absorb scheduling jitter
  • configure a 20ms hint-replay interval for fast, deterministic recovery assertions

Also updates CHANGELOG.md with a detailed description of the defense-in- depth approach and the extended test coverage.

…on metric

When `setImpl` promotes a local replica due to an unreachable primary,
pass the full `owners` list to `replicateTo` (instead of `owners[1:]`).
This ensures the dead primary's slot is included in best-effort replication,
so `replicateTo`'s existing failure-path queues a hinted handoff for it.
Post-restart convergence is now bounded by `WithDistHintReplayInterval`
(~200ms default) rather than the next merkle tick.

Add a new OTel counter `dist.write.forward_promotion` (internal atomic
`writeForwardPromotion`) that increments each time promotion fires. A
steadily rising counter surfaces a flapping primary well before any
read/write error spikes.

Expand `TestDistSet_PromotesOnGenericForwardError` to:
- assert `WriteForwardPromotion` increments on every promotion
- assert `HintedQueued` increments (proving the hint was enqueued)
- heal chaos and confirm the original primary receives the write via
  natural hint-replay, using a `waitForLocalContains` polling helper
  to absorb scheduling jitter
- configure a 20ms hint-replay interval for fast, deterministic recovery
  assertions

Also updates CHANGELOG.md with a detailed description of the defense-in-
depth approach and the extended test coverage.
@hyp3rd hyp3rd merged commit b64c273 into main May 12, 2026
14 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant