Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,27 @@ All notable changes to HyperCache are recorded here. The format follows

### Added

- **Migration-source observability for the hint queue.** Hints produced by rebalance migrations are now
tagged at queue time and tracked in a dedicated set of counters alongside the existing aggregate
metrics. Five new OTel metrics: `dist.migration.queued`, `dist.migration.replayed`,
`dist.migration.expired`, `dist.migration.dropped`, and `dist.migration.last_age_ns` (queue residency of
the most-recently-replayed migration hint — direct signal of new-primary reachability during rolling
deploys). Existing `dist.hinted.*` counters keep their meaning as the aggregate across both sources, so
operators can derive replication-only as `aggregate - migration`. Implementation reuses the proven hint
queue infrastructure (TTL, caps, replay, drop logic) — no second queue, no second drain loop.
Tests in [`pkg/backend/dist_migration_hint_test.go`](pkg/backend/dist_migration_hint_test.go) cover
source-tag preservation through queue→replay, per-source counter increments on every terminal path
(replay success, expired, transport drop, global-cap drop), and the not-found keep-in-queue path.
- **Adaptive Merkle anti-entropy scheduling.** New
[`backend.WithDistMerkleAdaptiveBackoff(maxFactor)`](pkg/backend/dist_memory.go) option lets the auto-sync
loop double its sleep interval after each tick that finds zero divergence across every peer, capped at
`maxFactor`. Any tick with at least one dirty peer snaps the factor back to 1× immediately — recovery is
never lazy. Disabled by default (factor=0 or 1) so existing deployments see no behavior change. Two new
OTel metrics expose the state: `dist.auto_sync.backoff_factor` (gauge) and `dist.auto_sync.clean_ticks`
(counter). Each factor change is logged once at Info (`merkle auto-sync backoff factor changed`) — no
per-tick log spam. Unit tests in
[`pkg/backend/dist_adaptive_backoff_test.go`](pkg/backend/dist_adaptive_backoff_test.go) cover the ramp,
the cap, the dirty-tick reset, and the disabled-by-default back-compat invariant.
- **Structured logging for background loops and cluster lifecycle.** HyperCache gained a
`WithLogger(*slog.Logger)` option ([config.go](config.go)) that wires a structured logger through the
wrapper. Previously the eviction loop, expiration loop, and HyperCache lifecycle ran fully silent —
Expand Down
2 changes: 1 addition & 1 deletion go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ require (
github.com/go-logr/logr v1.4.3 // indirect
github.com/go-logr/stdr v1.2.2 // indirect
github.com/gofiber/schema v1.7.1 // indirect
github.com/gofiber/utils/v2 v2.0.4 // indirect
github.com/gofiber/utils/v2 v2.0.5 // indirect
github.com/google/uuid v1.6.0 // indirect
github.com/klauspost/compress v1.18.6 // indirect
github.com/mattn/go-colorable v0.1.14 // indirect
Expand Down
8 changes: 4 additions & 4 deletions go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@ github.com/coreos/go-oidc/v3 v3.18.0 h1:V9orjXynvu5wiC9SemFTWnG4F45v403aIcjWo0d4
github.com/coreos/go-oidc/v3 v3.18.0/go.mod h1:DYCf24+ncYi+XkIH97GY1+dqoRlbaSI26KVTCI9SrY4=
github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c=
github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
github.com/fxamacker/cbor/v2 v2.9.1 h1:2rWm8B193Ll4VdjsJY28jxs70IdDsHRWgQYAI80+rMQ=
github.com/fxamacker/cbor/v2 v2.9.1/go.mod h1:vM4b+DJCtHn+zz7h3FFp/hDAI9WNWCsZj23V5ytsSxQ=
github.com/fxamacker/cbor/v2 v2.9.2 h1:X4Ksno9+x3cz0TZv69ec1hxP/+tymuR8PXQJyDwfh78=
github.com/fxamacker/cbor/v2 v2.9.2/go.mod h1:vM4b+DJCtHn+zz7h3FFp/hDAI9WNWCsZj23V5ytsSxQ=
github.com/go-jose/go-jose/v4 v4.1.4 h1:moDMcTHmvE6Groj34emNPLs/qtYXRVcd6S7NHbHz3kA=
github.com/go-jose/go-jose/v4 v4.1.4/go.mod h1:x4oUasVrzR7071A4TnHLGSPpNOm2a21K9Kf04k1rs08=
github.com/go-logr/logr v1.2.2/go.mod h1:jdQByPbusPIv2/zmleS9BjJVeZ6kBagPoEUsqbVz/1A=
Expand All @@ -25,8 +25,8 @@ github.com/gofiber/fiber/v3 v3.2.0 h1:g9+09D320foINPpCnR3ibQ5oBEFHjAWRRfDG1te54u
github.com/gofiber/fiber/v3 v3.2.0/go.mod h1:FHOsc2Db7HhHpsE62QAaJlXVV1pNkbZEptZ4jtti7m4=
github.com/gofiber/schema v1.7.1 h1:oSJBKdgP8JeIME4TQSAqlNKTU2iBB+2RNmKi8Nsc+TI=
github.com/gofiber/schema v1.7.1/go.mod h1:A/X5Ffyru4p9eBdp99qu+nzviHzQiZ7odLT+TwxWhbk=
github.com/gofiber/utils/v2 v2.0.4 h1:WwAxUA7L4MW2DjdEHF234lfqvBqd2vYYuBtA9TJq2ec=
github.com/gofiber/utils/v2 v2.0.4/go.mod h1:GGERKU3Vhj5z6hS8YKvxL99A54DjOvTFZ0cjZnG4Lj4=
github.com/gofiber/utils/v2 v2.0.5 h1:IMXoI2A5Dao/aMMBURTNxnhbtQO4kUwUFOgcwFSIjLU=
github.com/gofiber/utils/v2 v2.0.5/go.mod h1:FwwopfzwAQsoXLCHhOT24eH2jQfBgrrra9S5p0+luxg=
github.com/google/go-cmp v0.7.0 h1:wk8382ETsv4JYUZwIsn6YpYiWiBsYLSJiTsyBybVuN8=
github.com/google/go-cmp v0.7.0/go.mod h1:pXiqmnSA92OHEEa9HXL2W4E7lf9JzCmGVUdgjX3N/iU=
github.com/google/uuid v1.6.0 h1:NIvaJDMOsjHA8n1jAhLSgzrAzy1Hgr+hNrb57e+94F0=
Expand Down
171 changes: 171 additions & 0 deletions pkg/backend/dist_adaptive_backoff_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
package backend

import (
"log/slog"
"testing"
"time"
)

// newBackoffTestDM constructs the bare-minimum DistMemory the backoff
// logic needs: a logger (the real constructor always wires one;
// updateAutoSyncBackoff dereferences it on factor change). Real
// production paths go through NewDistMemory; this fixture exists only
// so we can exercise the pure scheduling logic without a transport.
func newBackoffTestDM(maxFactor int) *DistMemory {
return &DistMemory{
logger: slog.New(slog.DiscardHandler),
autoSyncMaxBackoffFactor: maxFactor,
}
}

// TestAdaptiveBackoff_DisabledIsNoop confirms the default behavior is
// unchanged: with no WithDistMerkleAdaptiveBackoff option set, neither
// the factor nor the clean-tick counter ever moves regardless of tick
// outcome, and nextAutoSyncDelay always returns the raw interval. This
// is the back-compat guarantee — existing deployments see no behavioral
// shift on upgrade.
func TestAdaptiveBackoff_DisabledIsNoop(t *testing.T) {
t.Parallel()

dm := newBackoffTestDM(0)

// Several clean ticks — counter and factor must stay flat.
for range 5 {
dm.updateAutoSyncBackoff(true)
}

if got := dm.autoSyncBackoffFactor.Load(); got != 0 {
t.Errorf("backoff factor: want 0 (uninitialized, disabled), got %d", got)
}

if got := dm.autoSyncCleanTicks.Load(); got != 0 {
t.Errorf("clean ticks: want 0 (disabled), got %d", got)
}

if got := dm.nextAutoSyncDelay(30 * time.Second); got != 30*time.Second {
t.Errorf("nextAutoSyncDelay (disabled): want 30s, got %v", got)
}
}

// TestAdaptiveBackoff_RampsAndCaps walks the factor through its
// doubling progression up to the configured maximum, verifies the cap
// holds across additional clean ticks, and confirms a single dirty
// tick snaps the factor back to 1.
func TestAdaptiveBackoff_RampsAndCaps(t *testing.T) {
t.Parallel()

dm := newBackoffTestDM(8)
dm.autoSyncBackoffFactor.Store(1) // mirrors what the loop does at start

// 1 -> 2 -> 4 -> 8 -> 8 (capped) -> 8 (still capped).
want := []int64{2, 4, 8, 8, 8}
for i, w := range want {
dm.updateAutoSyncBackoff(true)

got := dm.autoSyncBackoffFactor.Load()
if got != w {
t.Errorf("clean tick %d: want factor %d, got %d", i+1, w, got)
}
}

if got := dm.autoSyncCleanTicks.Load(); got != int64(len(want)) {
t.Errorf("clean-ticks counter: want %d, got %d", len(want), got)
}

// Dirty tick: factor must immediately reset to 1, counter must not
// increment (clean-ticks only tracks clean ticks).
dm.updateAutoSyncBackoff(false)

if got := dm.autoSyncBackoffFactor.Load(); got != 1 {
t.Errorf("after dirty tick: want factor 1 (reset), got %d", got)
}

if got := dm.autoSyncCleanTicks.Load(); got != int64(len(want)) {
t.Errorf("clean-ticks counter must not bump on dirty tick: want %d, got %d", len(want), got)
}

// Ramp begins again from 1.
dm.updateAutoSyncBackoff(true)

if got := dm.autoSyncBackoffFactor.Load(); got != 2 {
t.Errorf("clean tick after reset: want factor 2, got %d", got)
}
}

// TestAdaptiveBackoff_NextDelayMultiplies confirms the timer-driven
// scheduler multiplies the base interval by the current factor. This
// is the contract the autoSyncLoop relies on to actually sleep longer
// — without this, the metric would move but the loop would still wake
// every base interval.
func TestAdaptiveBackoff_NextDelayMultiplies(t *testing.T) {
t.Parallel()

dm := newBackoffTestDM(16)
dm.autoSyncBackoffFactor.Store(4)

base := 30 * time.Second

got := dm.nextAutoSyncDelay(base)
if got != 4*base {
t.Errorf("nextAutoSyncDelay at factor 4: want %v, got %v", 4*base, got)
}

// Factor < 1 (shouldn't happen under normal operation, but guard
// against a future zero-store race): clamp to 1.
dm.autoSyncBackoffFactor.Store(0)

if got := dm.nextAutoSyncDelay(base); got != base {
t.Errorf("nextAutoSyncDelay with factor 0 should clamp to 1×: want %v, got %v", base, got)
}
}

// TestAdaptiveBackoff_MaxFactorOneStaysDisabled pins the edge case
// where an operator passes maxFactor=1: that's semantically "disabled"
// (no multiplication), and we treat it as such. The option helper
// already normalizes sub-1 values, but a literal 1 is still valid
// configuration and must behave like the disabled default.
func TestAdaptiveBackoff_MaxFactorOneStaysDisabled(t *testing.T) {
t.Parallel()

dm := newBackoffTestDM(1)
dm.autoSyncBackoffFactor.Store(1)

for range 5 {
dm.updateAutoSyncBackoff(true)
}

if got := dm.autoSyncBackoffFactor.Load(); got != 1 {
t.Errorf("maxFactor=1 must keep factor at 1, got %d", got)
}

if got := dm.autoSyncCleanTicks.Load(); got != 0 {
t.Errorf("maxFactor=1 must not count clean ticks, got %d", got)
}
}

// TestAdaptiveBackoff_OptionNormalizesNegatives covers the option
// helper itself: negative or zero values are coerced to 0 (the
// "disabled" sentinel), preventing accidental enablement with a
// surprising factor.
func TestAdaptiveBackoff_OptionNormalizesNegatives(t *testing.T) {
t.Parallel()

cases := []struct {
in int
want int
}{
{in: -5, want: 0},
{in: 0, want: 0},
{in: 1, want: 1},
{in: 8, want: 8},
}

for _, tc := range cases {
dm := newBackoffTestDM(0)
WithDistMerkleAdaptiveBackoff(tc.in)(dm)

if dm.autoSyncMaxBackoffFactor != tc.want {
t.Errorf("WithDistMerkleAdaptiveBackoff(%d): want %d, got %d", tc.in, tc.want, dm.autoSyncMaxBackoffFactor)
}
}
}
Loading
Loading