diff --git a/cspell.config.yaml b/cspell.config.yaml
index cdf2e9a..122ac19 100644
--- a/cspell.config.yaml
+++ b/cspell.config.yaml
@@ -59,6 +59,7 @@ words:
- Cbor
- cespare
- chans
+ - cheatsheet
- cmap
- Cmder
- codacy
@@ -142,6 +143,7 @@ words:
- ints
- ireturn
- Itemm
+ - journalctl
- keyf
- keypair
- lamport
diff --git a/docs/index.md b/docs/index.md
index 9f74b3c..11e2c24 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -6,9 +6,8 @@ hide:
# HyperCache
-Distributed in-memory cache for Go. Sharded for concurrency, replicated for
-durability under partial failure, observable from the start, and shipped as
-both a library and a single-binary HTTP service.
+Distributed in-memory cache for Go. Sharded for concurrency, replicated for durability under partial failure,
+observable from the start, and shipped as both a library and a single-binary HTTP service.
@@ -16,19 +15,21 @@ both a library and a single-binary HTTP service.
- :material-server-network: **[5-Node Cluster](cluster.md)** — boot a real cluster with `docker compose`.
- :fontawesome-brands-kubernetes: **[Helm Chart](helm.md)** — deploy on Kubernetes with stable identities.
- :material-tools: **[Operations Runbook](operations.md)** — split-brain, hint queues, drain, capacity.
+- :material-bell-alert: **[On-call Cheatsheet](oncall.md)** — symptom → log grep → metric → action for paged
+ operators.
## Why HyperCache
-| | What you get | Why it matters |
-|---|---|---|
-| **Sharded by default** | 32 per-shard mutexes routed by xxhash | Write throughput scales with cores, no global lock. |
-| **Distributed backend** | Consistent hashing, configurable replication, quorum reads/writes | A single failed node does not lose keys. |
-| **Hinted handoff** | Failed forwards queue with TTL, replay on the dist HTTP transport | Transient peer outages don't drop replicas. |
-| **SWIM heartbeat** | Direct + indirect probes; self-refute via incarnation gossip | Filters caller-side network blips, recovers from false suspicion. |
-| **Observable** | `slog` logger + OpenTelemetry tracing + OpenTelemetry metrics, all opt-in | Plug into your existing pipeline, no extra deps. |
-| **Operator-friendly** | `Drain` endpoint, cursor-paged key enumeration, JSON error envelopes | Designed for rolling deploys and on-call clarity. |
+| | What you get | Why it matters |
+| ----------------------- | ------------------------------------------------------------------------- | ----------------------------------------------------------------- |
+| **Sharded by default** | 32 per-shard mutexes routed by xxhash | Write throughput scales with cores, no global lock. |
+| **Distributed backend** | Consistent hashing, configurable replication, quorum reads/writes | A single failed node does not lose keys. |
+| **Hinted handoff** | Failed forwards queue with TTL, replay on the dist HTTP transport | Transient peer outages don't drop replicas. |
+| **SWIM heartbeat** | Direct + indirect probes; self-refute via incarnation gossip | Filters caller-side network blips, recovers from false suspicion. |
+| **Observable** | `slog` logger + OpenTelemetry tracing + OpenTelemetry metrics, all opt-in | Plug into your existing pipeline, no extra deps. |
+| **Operator-friendly** | `Drain` endpoint, cursor-paged key enumeration, JSON error envelopes | Designed for rolling deploys and on-call clarity. |
## How it fits together
@@ -57,26 +58,20 @@ flowchart LR
Shard1 <-.HTTP replicate.-> Peer1
```
-The `HyperCache` wrapper is a thin facade you embed in your application.
-The `DistMemory` backend handles sharding, replication, and the cluster
-plane. Two HTTP listeners run per process: a peer-to-peer one for
-replication and gossip, and a separate management one for admin and
-observability.
+The `HyperCache` wrapper is a thin facade you embed in your application. The `DistMemory` backend handles
+sharding, replication, and the cluster plane. Two HTTP listeners run per process: a peer-to-peer one for
+replication and gossip, and a separate management one for admin and observability.
## Two ways to use it
-**As a library** — embed `HyperCache` directly in your Go application; it
-uses the in-memory or distributed backend in-process. See
-[Quickstart](quickstart.md).
+**As a library** — embed `HyperCache` directly in your Go application; it uses the in-memory or distributed
+backend in-process. See [Quickstart](quickstart.md).
-**As a service** — run the [`hypercache-server`](server.md) binary; clients
-talk to it over a REST API. See [5-Node Cluster](cluster.md) for the
-docker-compose recipe and [Helm Chart](helm.md) for Kubernetes.
+**As a service** — run the [`hypercache-server`](server.md) binary; clients talk to it over a REST API. See
+[5-Node Cluster](cluster.md) for the docker-compose recipe and [Helm Chart](helm.md) for Kubernetes.
## Project status
-The distributed backend is production-ready as of v0.6.0 — see the
-[changelog](changelog.md) for the full list of features and fixes that
-landed during the productionization push (Phases A through E in the
-upstream history). Operations procedures live in the
-[runbook](operations.md).
+The distributed backend is production-ready as of v0.6.0 — see the [changelog](changelog.md) for the full list
+of features and fixes that landed during the productionization push (Phases A through E in the upstream
+history). Operations procedures live in the [runbook](operations.md).
diff --git a/docs/oncall.md b/docs/oncall.md
new file mode 100644
index 0000000..37dcea4
--- /dev/null
+++ b/docs/oncall.md
@@ -0,0 +1,346 @@
+---
+title: On-call cheatsheet
+description: Symptom → log grep → metric → action map for HyperCache operators.
+---
+
+# On-call cheatsheet
+
+You got paged. This page exists to take you from a symptom to a diagnosis in under sixty seconds. Each section
+is a single failure shape: what you'll see, where to look, what to do next. Deeper operating procedures live
+in the [operations runbook](operations.md); start here, descend there.
+
+Every log line quoted below is a real string the binary emits — copy into `grep -F` directly. Every metric
+name is from `DistMemory.Metrics()` and its OTel mirror (`dist.*`) or from the wrapper-level `StatsCollector`.
+
+## Triage matrix
+
+| Symptom | Likely cause | Jump to |
+| ------------------------------------------------ | ------------------------------------------------- | --------------------------------------------------------- |
+| Node won't start / never appears in cluster | bind failure, bad config, OIDC issuer unreachable | [Node startup](#node-startup) |
+| Cluster has the right members but cache is empty | new node still rebalancing in | [Cold replica](#cold-replica) |
+| Peers flapping in `/cluster/members` | network jitter, indirect probes failing | [Heartbeat flapping](#heartbeat-flapping) |
+| Hints building up faster than they drain | one peer unreachable or rejecting writes | [Hint queue](#hint-queue-building) |
+| 401 / 403 on requests that should work | misconfigured token, missing scope, OIDC expired | [Auth failures](#auth-failures) |
+| Eviction running hot, latency spiking on Set | cache at capacity, eviction can't keep up | [Eviction pressure](#eviction-pressure) |
+| Replicas diverging | partition healed, version conflicts | [Split-brain reconciliation](#split-brain-reconciliation) |
+| Drain stuck / load balancer still routing | `/health` not flipping or LB caching | [Drain not draining](#drain-not-draining) |
+
+## Node startup
+
+**What you'll see (good).** Exactly one of these on each node, in order, on every boot:
+
+```text
+msg="hypercache-server starting" api_addr=:8080 mgmt_addr=:8081 dist_addr=:7946 oidc_enabled=true
+msg="cluster join: node starting" node_id=cache-0 replication=3 virtual_nodes=128 peers_known=4
+msg="dist HTTP listener started" addr=:7946
+msg="heartbeat loop started" interval=1s
+msg="rebalance loop started" interval=30s
+msg="hint replay loop started" interval=15s
+```
+
+If you see all six lines, the node has bound its ports, advertised itself to peers, and started its background
+loops. Everything after this point is steady-state.
+
+**What you'll see (bad).**
+
+- `msg="dist HTTP listener bind failed"` — another process is already bound to `HYPERCACHE_DIST_ADDR`. Check
+ for a stale pod / process on the host.
+- `msg="oidc verifier construction failed"` — IdP discovery URL unreachable from the pod. Check
+ `HYPERCACHE_OIDC_ISSUER`, DNS, and egress firewall rules. The process exits with code 1 (so the orchestrator
+ will restart it; check `kubectl describe pod` for the loop).
+- No `cluster join` line at all — the binary crashed before `buildHyperCache` returned. Look earlier in the
+ log for `hypercache construction failed` with `err=...`.
+
+**Metrics to check.** `dist.members.alive` (gauge) on every other node should tick up by one within
+`WithDistHeartbeat`'s `aliveAfter` window. `dist.membership.version` increments on each membership change, so
+it also bumps once per peer that learns about the new node.
+
+## Cold replica
+
+**What's happening.** A new replica is in the membership but its shards haven't been hydrated yet. Reads
+against keys whose primary is elsewhere succeed (replica forward), but reads against keys this node should own
+return misses until rebalance migrates them in.
+
+**What you'll see in logs.** No errors — this is normal. After the first `rebalance loop started` line, expect
+periodic ticks `rebalance.batches` increments visible at `/dist/metrics`.
+
+**Metrics to check.**
+
+- `dist.rebalance.batches` (counter) — incrementing means migration is happening.
+- `dist.rebalance.keys` (counter) — total keys migrated this process-lifetime.
+- `dist.rebalance.last_ns` (gauge) — duration of the last full scan. Compare to `WithDistRebalanceInterval` —
+ if scan duration exceeds the interval, you have a sustained backlog.
+
+**What to do.** Usually wait. If wait is unbounded, see
+[Rebalance under load](operations.md#failure-mode--rebalance-under-load).
+
+## Heartbeat flapping
+
+**What's happening.** Peers cycle alive → suspect → alive every few ticks. Caller-side network jitter, an
+overloaded probe path, or a mis-tuned `WithDistHeartbeat` are the usual causes.
+
+**What you'll see in logs.**
+
+```text
+msg="peer marked suspect (timeout)" peer_id=cache-2 ...
+msg="peer probe refuted by indirect probe" peer_id=cache-2 ...
+msg="self-refuted suspect/dead claim from peer" ...
+```
+
+The third one is the recovery path — the suspected node observed itself being slandered and bumped its
+incarnation to refute. If you see it landing, the SWIM dance is working as designed.
+
+```text
+msg="peer pruned (dead)" peer_id=cache-2 ...
+msg="peer removed from membership" peer_addr=:7946 members_after=3
+```
+
+These two together mean a peer has been ejected — distinguish them from manual `RemovePeer` calls (which only
+emit the second line, with no preceding `pruned (dead)`).
+
+**Metrics to check.**
+
+- `dist.heartbeat.failure` (counter) climbing — direct probes are failing.
+- `dist.heartbeat.indirect_probe.refuted` (counter) — indirect probes are saving you from spurious flap.
+ Healthy if non-zero.
+- `dist.heartbeat.indirect_probe.failure` (counter) — indirect probes also fail. The peer is genuinely
+ unreachable.
+- `dist.nodes.suspect` / `dist.nodes.dead` (gauges) — current cluster state.
+
+**What to do.** If `refuted` is climbing in step with `failure`, the system is self-correcting — extend
+`WithDistHeartbeat`'s `suspectAfter` / `deadAfter` if the flap is noisy. If `indirect_probe.failure` is also
+climbing, the peer is genuinely unreachable — see [replica loss](operations.md#failure-mode--replica-loss).
+
+## Hint queue building
+
+**What's happening.** A peer is unreachable. Every replicated write to it gets queued as a hint, waiting for
+the peer to come back. The queue is bounded — see `WithDistHintMaxPerNode` / `WithDistHintMaxBytes`.
+
+**What you'll see in logs.**
+
+```text
+msg="rebalance migration forward failed; queued for hint replay" target_addr=... err=...
+msg="hint dropped after replay error" target_node=... err=...
+```
+
+The first is benign during a peer outage. The second means the peer came back but rejected the hint — auth
+mismatch, schema drift, or a truly bad value.
+
+**Metrics to check.**
+
+- `dist.hinted.bytes` (gauge) — climbing steadily, no drain → peer still down.
+- `dist.hinted.queued` (counter) — total ever queued; rising rate is the canary.
+- `dist.hinted.replayed` (counter) — climbs when the peer is reachable and the queue is draining.
+- `dist.hinted.global_dropped` (counter) — caps exceeded; hints are being silently dropped. Hard limit hit.
+- `dist.hinted.expired` (counter) — hints aged past `WithDistHintTTL`.
+
+**What to do.** See [Hint queue overflow](operations.md#failure-mode--hint-queue-overflow) for the full
+playbook. Short version: restore the peer, or remove it from membership and let hints expire.
+
+## Auth failures
+
+**What's happening.** A request hit the API or management port without an identity that satisfies the policy.
+
+**What you'll see in logs.** Auth failures are deliberately quiet (no "request denied" log per call — that
+would be a log-spam amplifier). Look for the `audit` line emitted by the management HTTP layer on denied
+access, and at `/dist/metrics` for `auth.*` counters if your build has them.
+
+**What to check first.**
+
+- `curl http://:8081/v1/me -H 'Authorization: Bearer '` → returns the resolved identity + scopes.
+ If this returns 401, the token itself is wrong; if it returns 200 with empty scopes, the token resolves but
+ lacks the scope the endpoint requires.
+- For OIDC tokens: `aud` and `iss` must match `HYPERCACHE_OIDC_AUDIENCE` / `HYPERCACHE_OIDC_ISSUER`. The
+ verifier rejects mismatches before any policy check runs.
+- For static bearers: the token must appear in the policy YAML (`HYPERCACHE_AUTH_CONFIG`) — confirm with
+ `curl http://:8081/v1/me` using that exact token.
+
+**What to do.**
+
+1. Reproduce with `curl /v1/me` (definitive truth — same chain that the failing endpoint runs).
+1. If `/v1/me` returns 401: the token is rejected before reaching the scope check. Bearer mismatch, OIDC
+ expiry, or revoked cert.
+1. If `/v1/me` returns 200 but the original endpoint still 403s: the identity resolved but lacks the required
+ scope. Check the route's `Scopes` mapping (`management_http.go`); cross-reference against the identity's
+ `scopes` field in the `/v1/me` response.
+1. For OIDC token expiry specifically — `exp` is in the JWT payload;
+ `cut -d. -f2 | base64 -d | jq .exp` decodes it client-side.
+
+## Eviction pressure
+
+**What's happening.** The cache is at or above capacity. Eviction is running on every tick, every `Set`
+triggers an immediate evict, and `Set` latency reflects the eviction cost.
+
+**What you'll see in logs.** With Info-level logging on, every tick that does work emits:
+
+```text
+msg="eviction tick" evicted=42 items_remaining=10000 elapsed=3.2ms
+```
+
+A sustained sequence of these (non-zero `evicted` on every tick) is the symptom. If you also see
+`eviction triggered source=manual`, something is calling `TriggerEviction` from application code.
+
+**Metrics to check.**
+
+- `eviction_loop_count` (counter) — how often the loop ran.
+- `item_evicted_count` (counter) — total items evicted.
+- `evicted_item_count` (gauge) — items evicted in the **last** tick. Sustained non-zero = under pressure.
+- `eviction_loop_duration` (timing) — tick latency. Climbing → eviction itself is the bottleneck.
+
+**What to do.**
+
+1. Raise capacity (`WithMaxCacheSize` or per-backend equivalent).
+1. Audit `Set` callers — is something setting keys with no TTL and no key reuse? Eviction is doing the work
+ TTL should.
+1. Switch eviction algorithm — `WithEvictionAlgorithm("lru")` vs `"lfu"` vs `"cawolfu"` have very different
+ working-set fit.
+1. Increase `WithEvictionShardCount` (default 32) — eviction contention is per-shard.
+
+## Split-brain reconciliation
+
+**What's happening.** A partition healed. Both sides have writes the other doesn't.
+
+**What you'll see in logs.** During the partition: heartbeat failure logs (see
+[Heartbeat flapping](#heartbeat-flapping)). After healing: the merkle anti-entropy loop reconciles. No
+specific log line is emitted per resolved conflict — version-and-origin ordering is silent by design (it would
+log-spam under load).
+
+**Metrics to check.**
+
+- `dist.version.conflicts` (counter) — increments per detected divergence. Climbs after a heal, then
+ stabilizes.
+- `dist.merkle.last_diff_ns` (gauge) — duration of the last sync.
+- `dist.merkle.syncs` (counter) — successful merkle pulls.
+- `dist.merkle.keys_pulled` (counter) — keys reconciled.
+
+**What to do.** Usually wait. Auto-sync drains on its `WithDistMerkleAutoSync` interval. To force-trigger:
+
+```go
+err := dm.SyncWith(ctx, "peer-node-id")
+```
+
+The full discussion is in [Split-brain](operations.md#failure-mode--split-brain).
+
+## Drain not draining
+
+**What's happening.** You posted to `/dist/drain`, but `/health` still returns 200, or the load balancer is
+still routing.
+
+**What you'll see in logs.**
+
+```text
+msg="dist node draining"
+```
+
+If you see this line exactly once after the `POST /dist/drain`, the drain registered cache-side. From here:
+
+- `/health` returns **503** on every subsequent request.
+- New `Set` / `Remove` return `sentinel.ErrDraining` (HTTP 503).
+- `Get` continues to serve from cache.
+
+**What to do.**
+
+1. Confirm the drain line in the logs first. If absent, the request never reached the node — check the
+ management address you're POSTing to (`HYPERCACHE_MGMT_ADDR`, not the client API port).
+1. If the drain logged but `/health` still returns 200, you're probably hitting the wrong listener — `/health`
+ lives on both the client API and management ports, and only the latter respects drain. Confirm via
+ `curl -v http://:8081/health` (mgmt) vs `:8080` (api).
+1. If `/health` correctly returns 503 but the LB still routes, that's a load-balancer problem, not a cache
+ problem. Check the LB's health-check cache TTL.
+
+Drain is one-way per process; restart to clear.
+
+## Structured-logging reference
+
+Every log line the cache emits as of this writing, grouped by source. Use this as a grep dictionary.
+
+### Lifecycle (`hypercache-server`)
+
+| Message | When | Level |
+| ------------------------------------------------------------------- | -------------------------------- | ----- |
+| `hypercache-server starting` | binary boot, once | Info |
+| `hypercache-server running with no client API auth configured; ...` | misconfigured auth | Warn |
+| `shutdown signal received` | SIGINT/SIGTERM received | Info |
+| `hypercache-server stopped cleanly` | shutdown complete | Info |
+| `oidc verifier construction failed` | IdP unreachable at boot | Error |
+| `client API listener exited` | API port goroutine died | Error |
+| `hypercache construction failed` | wrapper init error | Error |
+| `client API construction failed` | server init error | Error |
+| `drain returned error` | drain attempt on shutdown failed | Warn |
+| `client API shutdown returned error` | graceful shutdown failed | Warn |
+| `hypercache stop returned error` | wrapper stop failed | Warn |
+
+### Wrapper loops (`HyperCache`)
+
+| Message | When | Level |
+| ---------------------------------------------- | --------------------------------------------- | ----- |
+| `eviction loop starting` | wrapper start, once if `evictionInterval > 0` | Info |
+| `eviction loop stopped` | context canceled or stop signal | Info |
+| `eviction tick` | tick did work (evicted > 0) | Info |
+| `eviction tick (idle)` | tick ran with nothing to evict | Debug |
+| `eviction triggered` | `TriggerEviction()` accepted | Info |
+| `eviction trigger coalesced (already pending)` | trigger arrived while one in-flight | Debug |
+| `expiration loop starting` | wrapper start, once | Info |
+| `expiration loop stopped` | context canceled or stop signal | Info |
+| `expiration tick` | tick removed expired items | Info |
+| `expiration tick (idle)` | tick ran with nothing expired | Debug |
+
+### DistMemory backend
+
+| Message | When | Level |
+| ------------------------------------------------------------ | ------------------------------------ | ----- |
+| `cluster join: node starting` | DistMemory constructor, once | Info |
+| `dist HTTP listener started` | peer transport bound | Info |
+| `dist HTTP listener bind failed` | port in use / permission denied | Error |
+| `dist HTTP serve goroutine exited` | transport listener stopped | Info |
+| `heartbeat loop started` | SWIM probe loop start | Info |
+| `gossip loop started` | gossip push loop start | Info |
+| `hint replay loop started` | hint drain loop start | Info |
+| `rebalance loop started` | ownership-migration loop start | Info |
+| `merkle auto-sync loop started` | anti-entropy loop start | Info |
+| `peer added to membership` | `AddPeer` accepted | Info |
+| `peer removed from membership` | `RemovePeer` or `peer pruned (dead)` | Info |
+| `peer marked suspect (timeout)` | direct probe failed | Warn |
+| `peer marked suspect (probe failed)` | probe error during SWIM | Info |
+| `peer probe refuted by indirect probe` | indirect probe rescued the peer | Warn |
+| `peer pruned (dead)` | suspect window exceeded; ejected | Warn |
+| `self-refuted suspect/dead claim from peer` | local incarnation bump | Info |
+| `gossip push failed` | gossip dispatch error | Warn |
+| `merkle sync fetch failed` | anti-entropy pull error | Warn |
+| `rebalance migration forward failed; queued for hint replay` | replication during rebalance failed | Warn |
+| `hint dropped after replay error` | hint replayed but peer rejected | Info |
+| `dist node draining` | `POST /dist/drain` accepted | Info |
+
+### Telemetry registration
+
+| Message | When | Level |
+| ------------------------------------------ | ------------------------------------ | ----- |
+| `dist meter: counter registration failed` | OTel meter binding error | Error |
+| `dist meter: gauge registration failed` | OTel meter binding error | Error |
+| `dist meter: callback registration failed` | OTel observable callback bind failed | Error |
+| `dist meter: callback unregister failed` | OTel meter teardown error | Error |
+
+## Quick filters
+
+```sh
+# All cluster-membership events for this node:
+journalctl -u hypercache -o cat | grep -E 'peer (added|removed|marked|pruned|probe)'
+
+# Background-loop health (every loop emits exactly one starting line per process):
+journalctl -u hypercache -o cat | grep -F 'loop starting' | grep -F 'loop started'
+
+# Hint-queue trouble (replay errors, drops):
+journalctl -u hypercache -o cat | grep -F 'hint '
+
+# All Warns and Errors only:
+journalctl -u hypercache -o cat -p warning..err
+```
+
+## Going deeper
+
+For the design background:
+
+- [Distributed backend](distributed.md) — replication, hashing, membership.
+- [Operations runbook](operations.md) — long-form failure-mode playbooks. Each `#failure-mode-*` anchor
+ matches a symptom above.
+- [API reference](api.md) — REST surface served by the binary.
diff --git a/docs/operations.md b/docs/operations.md
index 1c4b933..e18bdf2 100644
--- a/docs/operations.md
+++ b/docs/operations.md
@@ -1,156 +1,127 @@
# Operations runbook — DistMemory
-This document is for operators running the `pkg/backend.DistMemory`
-distributed backend in production. It assumes the design background in
-[distributed.md](distributed.md). Sections are deliberately short — each
-one stands on its own and links to code.
+This document is for operators running the `pkg/backend.DistMemory` distributed backend in production. It
+assumes the design background in [distributed.md](distributed.md). Sections are deliberately short — each one
+stands on its own and links to code.
+
+
+!!! tip "Paged right now?"
+ Start at the [on-call cheatsheet](oncall.md). It maps a symptom (heartbeat flap, hint queue building,
+ auth failure, drain stuck) to the exact log lines and metrics to grep for, then back-links to the
+ relevant section here.
## At a glance
-| Concern | First place to look |
-|---|---|
-| Node not receiving traffic | `dist.members.alive`, `/health` |
-| Writes failing | `dist.write.quorum_failures`, `sentinel.ErrDraining`, `sentinel.ErrQuorumFailed` |
-| Replicas falling behind | `dist.hinted.queued`, `dist.hinted.replayed`, `dist.hinted.dropped` |
-| Bandwidth pressure | `DistHTTPLimits.CompressionThreshold` |
-| Spurious peer flapping | `dist.heartbeat.indirect_probe.refuted`, `WithDistIndirectProbes` |
-| Slow rebalance | `dist.rebalance.throttle`, `dist.rebalance.last_ns` |
-| Anti-entropy backlog | `dist.merkle.last_diff_ns`, `dist.auto_sync.last_ns` |
-
-Live metric values come from `DistMemory.Metrics()` (Go struct),
-`/dist/metrics` (JSON, when wrapped in `hypercache.HyperCache`), or
-the OpenTelemetry pipeline you wired via `WithDistMeterProvider`.
-The OTel names use the `dist.` prefix.
+| Concern | First place to look |
+| -------------------------- | -------------------------------------------------------------------------------- |
+| Node not receiving traffic | `dist.members.alive`, `/health` |
+| Writes failing | `dist.write.quorum_failures`, `sentinel.ErrDraining`, `sentinel.ErrQuorumFailed` |
+| Replicas falling behind | `dist.hinted.queued`, `dist.hinted.replayed`, `dist.hinted.dropped` |
+| Bandwidth pressure | `DistHTTPLimits.CompressionThreshold` |
+| Spurious peer flapping | `dist.heartbeat.indirect_probe.refuted`, `WithDistIndirectProbes` |
+| Slow rebalance | `dist.rebalance.throttle`, `dist.rebalance.last_ns` |
+| Anti-entropy backlog | `dist.merkle.last_diff_ns`, `dist.auto_sync.last_ns` |
+
+Live metric values come from `DistMemory.Metrics()` (Go struct), `/dist/metrics` (JSON, when wrapped in
+`hypercache.HyperCache`), or the OpenTelemetry pipeline you wired via `WithDistMeterProvider`. The OTel names
+use the `dist.` prefix.
## Wiring observability
Three opt-in entry points, all defaulting to no-op:
-- **Logging** — `backend.WithDistLogger(*slog.Logger)` routes background
- loops (heartbeat, hint replay, rebalance, merkle sync) and operational
- errors into your logger. Records are pre-bound with
+- **Logging** — `backend.WithDistLogger(*slog.Logger)` routes background loops (heartbeat, hint replay,
+ rebalance, merkle sync) and operational errors into your logger. Records are pre-bound with
`component=dist_memory` and `node_id=`.
-- **Tracing** — `backend.WithDistTracerProvider(trace.TracerProvider)`
- opens spans on `Get`/`Set`/`Remove` plus per-peer
- `dist.replicate.*` child spans. Cache key *values* are never put on
- spans (they can be PII); only `cache.key.length`.
-- **Metrics** — `backend.WithDistMeterProvider(metric.MeterProvider)`
- exposes every field on `DistMetrics` as an observable instrument.
+- **Tracing** — `backend.WithDistTracerProvider(trace.TracerProvider)` opens spans on `Get`/`Set`/`Remove`
+ plus per-peer `dist.replicate.*` child spans. Cache key _values_ are never put on spans (they can be PII);
+ only `cache.key.length`.
+- **Metrics** — `backend.WithDistMeterProvider(metric.MeterProvider)` exposes every field on `DistMetrics` as
+ an observable instrument.
-Wire all three to the same `otel.SetTracerProvider` /
-`otel.SetMeterProvider` your application uses; the logger inherits via
-`slog.Default()` if you want a one-liner.
+Wire all three to the same `otel.SetTracerProvider` / `otel.SetMeterProvider` your application uses; the
+logger inherits via `slog.Default()` if you want a one-liner.
## Failure mode — split-brain
-**Symptom.** Two subsets of the cluster lose connectivity to each
-other. Each subset elects local primaries for the keys it owns.
-Writes from clients on subset A land on A-side primaries; writes from
-B-side clients land on B-side primaries. When the partition heals, the
-versions diverge.
+**Symptom.** Two subsets of the cluster lose connectivity to each other. Each subset elects local primaries
+for the keys it owns. Writes from clients on subset A land on A-side primaries; writes from B-side clients
+land on B-side primaries. When the partition heals, the versions diverge.
-**Detection.** `dist.heartbeat.failure` rises on both sides during the
-partition. After healing, `dist.version.conflicts` increments as
-anti-entropy reconciles.
+**Detection.** `dist.heartbeat.failure` rises on both sides during the partition. After healing,
+`dist.version.conflicts` increments as anti-entropy reconciles.
-**Resolution.** DistMemory uses last-write-wins by `(version, origin)`
-ordering — the higher version wins, ties broken by origin string. This
-is automatic. Anti-entropy via `SyncWith` (manual) or
-`WithDistMerkleAutoSync` (background) closes the gap. There is no
-manual reconciliation step today.
+**Resolution.** DistMemory uses last-write-wins by `(version, origin)` ordering — the higher version wins,
+ties broken by origin string. This is automatic. Anti-entropy via `SyncWith` (manual) or
+`WithDistMerkleAutoSync` (background) closes the gap. There is no manual reconciliation step today.
-**Mitigation.** Run an odd number of nodes with quorum writes
-(`WithDistWriteConsistency(ConsistencyQuorum)`); a partition that
-isolates a minority leaves only the majority side accepting writes
-because the minority cannot reach quorum. The minority returns
-`ErrQuorumFailed` (`sentinel.ErrQuorumFailed`) on Set.
+**Mitigation.** Run an odd number of nodes with quorum writes (`WithDistWriteConsistency(ConsistencyQuorum)`);
+a partition that isolates a minority leaves only the majority side accepting writes because the minority
+cannot reach quorum. The minority returns `ErrQuorumFailed` (`sentinel.ErrQuorumFailed`) on Set.
## Failure mode — hint queue overflow
-**Symptom.** A peer is unreachable for a long time. Every replicated
-write to that peer turns into a queued hint. Eventually the queue
-hits `WithDistHintMaxPerNode` or `WithDistHintMaxBytes` and new hints
-get dropped.
+**Symptom.** A peer is unreachable for a long time. Every replicated write to that peer turns into a queued
+hint. Eventually the queue hits `WithDistHintMaxPerNode` or `WithDistHintMaxBytes` and new hints get dropped.
-**Detection.** `dist.hinted.bytes` (gauge) climbs steadily.
-`dist.hinted.global_dropped` increments when caps are exceeded.
-`dist.hinted.dropped` (a different metric — replay errors) also rises
-if the peer is reachable but rejecting writes (auth, schema mismatch).
+**Detection.** `dist.hinted.bytes` (gauge) climbs steadily. `dist.hinted.global_dropped` increments when caps
+are exceeded. `dist.hinted.dropped` (a different metric — replay errors) also rises if the peer is reachable
+but rejecting writes (auth, schema mismatch).
**Resolution.**
-1. Restore the unreachable peer; the replay loop drains automatically
- (`dist.hinted.replayed` rises).
-1. If the peer is permanently gone, remove it from membership
- (`DistMemory.RemovePeer(addr)`); queued hints expire on the
- `WithDistHintTTL` timer.
-1. If hints are dropping faster than they replay, raise
- `WithDistHintMaxPerNode` / `WithDistHintMaxBytes` — but understand
- that the cap exists to bound process memory under sustained
- failure. Raising it without fixing the underlying peer just delays
- the bound.
-
-**Phase B note.** Migration failures during rebalance now also funnel
-through the hint queue (Phase B.2). A surge in `dist.hinted.queued`
-during a rolling deploy is expected; it should drain as the new node
-becomes reachable.
+1. Restore the unreachable peer; the replay loop drains automatically (`dist.hinted.replayed` rises).
+1. If the peer is permanently gone, remove it from membership (`DistMemory.RemovePeer(addr)`); queued hints
+ expire on the `WithDistHintTTL` timer.
+1. If hints are dropping faster than they replay, raise `WithDistHintMaxPerNode` / `WithDistHintMaxBytes` —
+ but understand that the cap exists to bound process memory under sustained failure. Raising it without
+ fixing the underlying peer just delays the bound.
+
+**Phase B note.** Migration failures during rebalance now also funnel through the hint queue (Phase B.2). A
+surge in `dist.hinted.queued` during a rolling deploy is expected; it should drain as the new node becomes
+reachable.
## Failure mode — rebalance under load
-**Symptom.** Adding a node triggers a rebalance scan that migrates
-keys to their new primary. Under sustained write load the migration
-saturates and `dist.rebalance.throttle` increments — batches queue
-behind the configured concurrency cap.
+**Symptom.** Adding a node triggers a rebalance scan that migrates keys to their new primary. Under sustained
+write load the migration saturates and `dist.rebalance.throttle` increments — batches queue behind the
+configured concurrency cap.
-**Detection.** `dist.rebalance.last_ns` (gauge — last full scan
-duration) climbs. `dist.rebalance.throttle` (counter) increments when
-the concurrency limit blocks a batch dispatch. `dist.rebalance.batches`
-should still climb steadily.
+**Detection.** `dist.rebalance.last_ns` (gauge — last full scan duration) climbs. `dist.rebalance.throttle`
+(counter) increments when the concurrency limit blocks a batch dispatch. `dist.rebalance.batches` should still
+climb steadily.
**Resolution.**
-1. Raise `WithDistRebalanceMaxConcurrent` (default 1) if CPU and
- network headroom allow.
-1. Lower `WithDistRebalanceBatchSize` (default 64) so individual
- batches finish faster and concurrency slots cycle more often —
- counter-intuitively, smaller batches sometimes throughput-win.
-1. Pause writes (drain a subset of clients via your LB) until the
- scan finishes. The dist backend has no built-in
- write-throttling — that's the application's job.
-
-**Phase C note.** Drain (`POST /dist/drain`) does *not* trigger an
-expedited rebalance today; the next scheduled
-`WithDistRebalanceInterval` tick does the work. If you need to force
-a faster ownership transfer, call `Stop` after Drain to cancel
-in-flight work and let restart-time rebalance handle migration.
+1. Raise `WithDistRebalanceMaxConcurrent` (default 1) if CPU and network headroom allow.
+1. Lower `WithDistRebalanceBatchSize` (default 64) so individual batches finish faster and concurrency slots
+ cycle more often — counter-intuitively, smaller batches sometimes throughput-win.
+1. Pause writes (drain a subset of clients via your LB) until the scan finishes. The dist backend has no
+ built-in write-throttling — that's the application's job.
+
+**Phase C note.** Drain (`POST /dist/drain`) does _not_ trigger an expedited rebalance today; the next
+scheduled `WithDistRebalanceInterval` tick does the work. If you need to force a faster ownership transfer,
+call `Stop` after Drain to cancel in-flight work and let restart-time rebalance handle migration.
## Failure mode — replica loss
-**Symptom.** A replica node dies hard (kernel panic, hardware
-failure). Its keys still have other replicas (when `replication >= 2`),
-but until membership notices, writes try to fan out to it and
-silently retry via the hint queue.
+**Symptom.** A replica node dies hard (kernel panic, hardware failure). Its keys still have other replicas
+(when `replication >= 2`), but until membership notices, writes try to fan out to it and silently retry via
+the hint queue.
-**Detection.** `dist.heartbeat.failure` increments steadily for the
-lost peer. After `WithDistHeartbeat`'s `deadAfter` window, the peer
-is pruned (`dist.nodes.removed` increments) and ring lookups stop
-including it.
+**Detection.** `dist.heartbeat.failure` increments steadily for the lost peer. After `WithDistHeartbeat`'s
+`deadAfter` window, the peer is pruned (`dist.nodes.removed` increments) and ring lookups stop including it.
**Resolution.**
-1. Wait for the heartbeat to detect the dead peer. With default
- timing, this is on the order of seconds.
-1. Spin up a replacement node with the same membership (or let
- gossip discover it).
-1. The new node's rebalance scan pulls its assigned keys from
- surviving replicas via Merkle anti-entropy.
+1. Wait for the heartbeat to detect the dead peer. With default timing, this is on the order of seconds.
+1. Spin up a replacement node with the same membership (or let gossip discover it).
+1. The new node's rebalance scan pulls its assigned keys from surviving replicas via Merkle anti-entropy.
-**Indirect probes.** `WithDistIndirectProbes(k, timeout)` filters
-caller-side network blips that would otherwise mark a healthy peer
-suspect. `dist.heartbeat.indirect_probe.refuted` rising indicates
-indirect probes are saving you from spurious flapping; rising
-`dist.heartbeat.indirect_probe.failure` indicates the peer is
-genuinely unreachable from multiple vantage points.
+**Indirect probes.** `WithDistIndirectProbes(k, timeout)` filters caller-side network blips that would
+otherwise mark a healthy peer suspect. `dist.heartbeat.indirect_probe.refuted` rising indicates indirect
+probes are saving you from spurious flapping; rising `dist.heartbeat.indirect_probe.failure` indicates the
+peer is genuinely unreachable from multiple vantage points.
## Operational tasks
@@ -187,28 +158,24 @@ curl 'http://node-A:8080/internal/keys?cursor=1'
err := dm.SyncWith(ctx, "node-B")
```
-`WithDistMerkleAutoSync(interval)` runs this on a timer; manual calls
-are useful for debugging.
+`WithDistMerkleAutoSync(interval)` runs this on a timer; manual calls are useful for debugging.
## Capacity planning notes
-- Each shard mutex is independent — write throughput scales with
- shard count up to CPU saturation.
-- Hint queue memory is approximately `HintedBytes` + 64 bytes of
- bookkeeping per queued hint. Cap via `WithDistHintMaxBytes` to
- bound total process memory under partition.
-- Merkle tree storage scales O(N/chunk) for N keys at
- `WithDistMerkleChunkSize` (default 128). For a million keys, the
- default chunk gives ~8K leaf hashes per node — negligible.
-- Replication factor 3 with quorum reads/writes tolerates 1 failure;
- raise to 5 for tolerating 2 failures, at 5× the storage cost.
+- Each shard mutex is independent — write throughput scales with shard count up to CPU saturation.
+- Hint queue memory is approximately `HintedBytes` + 64 bytes of bookkeeping per queued hint. Cap via
+ `WithDistHintMaxBytes` to bound total process memory under partition.
+- Merkle tree storage scales O(N/chunk) for N keys at `WithDistMerkleChunkSize` (default 128). For a million
+ keys, the default chunk gives ~8K leaf hashes per node — negligible.
+- Replication factor 3 with quorum reads/writes tolerates 1 failure; raise to 5 for tolerating 2 failures, at
+ 5× the storage cost.
## Where things are
-| Concern | File |
-|---|---|
-| Public surface | [pkg/backend/dist_memory.go](../pkg/backend/dist_memory.go) |
-| Transport interface | [pkg/backend/dist_transport.go](../pkg/backend/dist_transport.go) |
-| HTTP transport | [pkg/backend/dist_http_transport.go](../pkg/backend/dist_http_transport.go) |
-| HTTP server | [pkg/backend/dist_http_server.go](../pkg/backend/dist_http_server.go) |
-| Membership / ring | [internal/cluster/](../internal/cluster) |
+| Concern | File |
+| ------------------- | --------------------------------------------------------------------------- |
+| Public surface | [pkg/backend/dist_memory.go](../pkg/backend/dist_memory.go) |
+| Transport interface | [pkg/backend/dist_transport.go](../pkg/backend/dist_transport.go) |
+| HTTP transport | [pkg/backend/dist_http_transport.go](../pkg/backend/dist_http_transport.go) |
+| HTTP server | [pkg/backend/dist_http_server.go](../pkg/backend/dist_http_server.go) |
+| Membership / ring | [internal/cluster/](../internal/cluster) |