Skip to content

feat(backend): add adaptive Merkle anti-entropy backoff scheduling#123

Merged
hyp3rd merged 4 commits into
mainfrom
feat/dist-mem-cache
May 11, 2026
Merged

feat(backend): add adaptive Merkle anti-entropy backoff scheduling#123
hyp3rd merged 4 commits into
mainfrom
feat/dist-mem-cache

Conversation

@hyp3rd
Copy link
Copy Markdown
Owner

@hyp3rd hyp3rd commented May 11, 2026

Introduce WithDistMerkleAdaptiveBackoff(maxFactor) to let the
auto-sync loop progressively back off when all peers are in sync.

Behaviour:

  • Each clean tick (zero divergence across all peers) doubles the sleep
    interval, capped at maxFactor × base interval.
  • Any dirty tick or sync error snaps the factor back to 1× immediately,
    so recovery is never lazy.
  • Disabled by default (maxFactor ≤ 1), preserving existing behaviour
    for all current deployments.

Implementation details:

  • Replace fixed time.Ticker in autoSyncLoop with a reset time.Timer
    driven by nextAutoSyncDelay.
  • Refactor SyncWith into syncWithStatus (returns clean/dirty signal)
    and a thin public SyncWith wrapper to keep the API unchanged.
  • runAutoSyncTick now returns a clean bool consumed by
    updateAutoSyncBackoff.

Observability:

  • New OTel gauge dist.auto_sync.backoff_factor (current multiplier).
  • New OTel counter dist.auto_sync.clean_ticks (cumulative clean ticks).
  • Factor changes are logged once at Info level; no per-tick spam.
  • DistMetrics exposes AutoSyncBackoffFactor and AutoSyncCleanTicks.

Tests (pkg/backend/dist_adaptive_backoff_test.go):

  • TestAdaptiveBackoff_DisabledIsNoop — back-compat guarantee.
  • TestAdaptiveBackoff_RampsAndCaps — doubling, cap enforcement, dirty reset.
  • TestAdaptiveBackoff_NextDelayMultiplies — delay calculation contract.
  • TestAdaptiveBackoff_MaxFactorOneStaysDisabled — edge case: maxFactor=1.
  • TestAdaptiveBackoff_OptionNormalisesNegatives — option validation.

hyp3rd added 3 commits May 11, 2026 13:47
Introduce `WithDistMerkleAdaptiveBackoff(maxFactor)` to let the
auto-sync loop progressively back off when all peers are in sync.

Behaviour:
- Each clean tick (zero divergence across all peers) doubles the sleep
  interval, capped at `maxFactor × base interval`.
- Any dirty tick or sync error snaps the factor back to 1× immediately,
  so recovery is never lazy.
- Disabled by default (maxFactor ≤ 1), preserving existing behaviour
  for all current deployments.

Implementation details:
- Replace fixed `time.Ticker` in `autoSyncLoop` with a reset `time.Timer`
  driven by `nextAutoSyncDelay`.
- Refactor `SyncWith` into `syncWithStatus` (returns clean/dirty signal)
  and a thin public `SyncWith` wrapper to keep the API unchanged.
- `runAutoSyncTick` now returns a clean bool consumed by
  `updateAutoSyncBackoff`.

Observability:
- New OTel gauge `dist.auto_sync.backoff_factor` (current multiplier).
- New OTel counter `dist.auto_sync.clean_ticks` (cumulative clean ticks).
- Factor changes are logged once at Info level; no per-tick spam.
- `DistMetrics` exposes `AutoSyncBackoffFactor` and `AutoSyncCleanTicks`.

Tests (`pkg/backend/dist_adaptive_backoff_test.go`):
- `TestAdaptiveBackoff_DisabledIsNoop` — back-compat guarantee.
- `TestAdaptiveBackoff_RampsAndCaps` — doubling, cap enforcement, dirty reset.
- `TestAdaptiveBackoff_NextDelayMultiplies` — delay calculation contract.
- `TestAdaptiveBackoff_MaxFactorOneStaysDisabled` — edge case: maxFactor=1.
- `TestAdaptiveBackoff_OptionNormalisesNegatives` — option validation.
- github.com/gofiber/utils/v2: v2.0.4 → v2.0.5
- github.com/fxamacker/cbor/v2: v2.9.1 → v2.9.2

Routine patch-level updates to indirect dependencies; no API changes expected.
@hyp3rd hyp3rd self-assigned this May 11, 2026
Tag hints at queue time with their origin (replication fan-out vs
rebalance migration) and track five new per-source OTel counters:

- dist.migration.queued      – migration hints enqueued
- dist.migration.replayed    – migration hints successfully delivered
- dist.migration.expired     – migration hints aged past TTL
- dist.migration.dropped     – migration hints discarded (transport error or global cap)
- dist.migration.last_age_ns – queue residency of the most-recently replayed
                               migration hint; direct signal of new-primary
                               reachability during rolling deploys

Existing dist.hinted.* counters continue to aggregate across both
sources; replication-only counts are derivable as (aggregate - migration).

No second queue or drain loop is introduced. The implementation extends
the existing hinted-handoff infrastructure with a lightweight hintSource
tag on hintedEntry and matching per-source counter branches on every
terminal path in queueHint and processHint (global-cap drop, queue
success, expiry, replay success, and transport-error drop).

Adds pkg/backend/dist_migration_hint_test.go with six focused tests
covering source-tag preservation through queue → replay, per-source
counter increments on every terminal path, and the not-found keep path.
@hyp3rd hyp3rd merged commit c8406d0 into main May 11, 2026
13 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant