fix(dist): widen replica fan-out on promotion and add forward_promotion metric#131
Merged
Conversation
…on metric When `setImpl` promotes a local replica due to an unreachable primary, pass the full `owners` list to `replicateTo` (instead of `owners[1:]`). This ensures the dead primary's slot is included in best-effort replication, so `replicateTo`'s existing failure-path queues a hinted handoff for it. Post-restart convergence is now bounded by `WithDistHintReplayInterval` (~200ms default) rather than the next merkle tick. Add a new OTel counter `dist.write.forward_promotion` (internal atomic `writeForwardPromotion`) that increments each time promotion fires. A steadily rising counter surfaces a flapping primary well before any read/write error spikes. Expand `TestDistSet_PromotesOnGenericForwardError` to: - assert `WriteForwardPromotion` increments on every promotion - assert `HintedQueued` increments (proving the hint was enqueued) - heal chaos and confirm the original primary receives the write via natural hint-replay, using a `waitForLocalContains` polling helper to absorb scheduling jitter - configure a 20ms hint-replay interval for fast, deterministic recovery assertions Also updates CHANGELOG.md with a detailed description of the defense-in- depth approach and the extended test coverage.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When
setImplpromotes a local replica due to an unreachable primary, pass the fullownerslist toreplicateTo(instead ofowners[1:]). This ensures the dead primary's slot is included in best-effort replication, soreplicateTo's existing failure-path queues a hinted handoff for it. Post-restart convergence is now bounded byWithDistHintReplayInterval(~200ms default) rather than the next merkle tick.Add a new OTel counter
dist.write.forward_promotion(internal atomicwriteForwardPromotion) that increments each time promotion fires. A steadily rising counter surfaces a flapping primary well before any read/write error spikes.Expand
TestDistSet_PromotesOnGenericForwardErrorto:WriteForwardPromotionincrements on every promotionHintedQueuedincrements (proving the hint was enqueued)waitForLocalContainspolling helper to absorb scheduling jitterAlso updates CHANGELOG.md with a detailed description of the defense-in- depth approach and the extended test coverage.