From 3d8f410a0581887057436b2208f9c971858a3650 Mon Sep 17 00:00:00 2001 From: "F." Date: Mon, 11 May 2026 13:04:49 +0200 Subject: [PATCH] docs: add on-call cheatsheet and reformat operations docs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add docs/oncall.md — a symptom-to-action triage guide for paged operators covering node startup, cold replicas, heartbeat flapping, hint queue buildup, auth failures, eviction pressure, split-brain reconciliation, and stuck drains. Includes a structured-logging grep dictionary and quick journalctl filters. Cross-link the cheatsheet from the docs index (new card) and the operations runbook (tip callout). Reformat both index.md and operations.md for consistent line wrapping and aligned markdown tables. Update cspell dictionary with new terms. --- cspell.config.yaml | 2 + docs/index.md | 49 +++---- docs/oncall.md | 346 +++++++++++++++++++++++++++++++++++++++++++++ docs/operations.md | 231 +++++++++++++----------------- 4 files changed, 469 insertions(+), 159 deletions(-) create mode 100644 docs/oncall.md diff --git a/cspell.config.yaml b/cspell.config.yaml index cdf2e9a..122ac19 100644 --- a/cspell.config.yaml +++ b/cspell.config.yaml @@ -59,6 +59,7 @@ words: - Cbor - cespare - chans + - cheatsheet - cmap - Cmder - codacy @@ -142,6 +143,7 @@ words: - ints - ireturn - Itemm + - journalctl - keyf - keypair - lamport diff --git a/docs/index.md b/docs/index.md index 9f74b3c..11e2c24 100644 --- a/docs/index.md +++ b/docs/index.md @@ -6,9 +6,8 @@ hide: # HyperCache -Distributed in-memory cache for Go. Sharded for concurrency, replicated for -durability under partial failure, observable from the start, and shipped as -both a library and a single-binary HTTP service. +Distributed in-memory cache for Go. Sharded for concurrency, replicated for durability under partial failure, +observable from the start, and shipped as both a library and a single-binary HTTP service.
@@ -16,19 +15,21 @@ both a library and a single-binary HTTP service. - :material-server-network: **[5-Node Cluster](cluster.md)** — boot a real cluster with `docker compose`. - :fontawesome-brands-kubernetes: **[Helm Chart](helm.md)** — deploy on Kubernetes with stable identities. - :material-tools: **[Operations Runbook](operations.md)** — split-brain, hint queues, drain, capacity. +- :material-bell-alert: **[On-call Cheatsheet](oncall.md)** — symptom → log grep → metric → action for paged + operators.
## Why HyperCache -| | What you get | Why it matters | -|---|---|---| -| **Sharded by default** | 32 per-shard mutexes routed by xxhash | Write throughput scales with cores, no global lock. | -| **Distributed backend** | Consistent hashing, configurable replication, quorum reads/writes | A single failed node does not lose keys. | -| **Hinted handoff** | Failed forwards queue with TTL, replay on the dist HTTP transport | Transient peer outages don't drop replicas. | -| **SWIM heartbeat** | Direct + indirect probes; self-refute via incarnation gossip | Filters caller-side network blips, recovers from false suspicion. | -| **Observable** | `slog` logger + OpenTelemetry tracing + OpenTelemetry metrics, all opt-in | Plug into your existing pipeline, no extra deps. | -| **Operator-friendly** | `Drain` endpoint, cursor-paged key enumeration, JSON error envelopes | Designed for rolling deploys and on-call clarity. | +| | What you get | Why it matters | +| ----------------------- | ------------------------------------------------------------------------- | ----------------------------------------------------------------- | +| **Sharded by default** | 32 per-shard mutexes routed by xxhash | Write throughput scales with cores, no global lock. | +| **Distributed backend** | Consistent hashing, configurable replication, quorum reads/writes | A single failed node does not lose keys. | +| **Hinted handoff** | Failed forwards queue with TTL, replay on the dist HTTP transport | Transient peer outages don't drop replicas. | +| **SWIM heartbeat** | Direct + indirect probes; self-refute via incarnation gossip | Filters caller-side network blips, recovers from false suspicion. | +| **Observable** | `slog` logger + OpenTelemetry tracing + OpenTelemetry metrics, all opt-in | Plug into your existing pipeline, no extra deps. | +| **Operator-friendly** | `Drain` endpoint, cursor-paged key enumeration, JSON error envelopes | Designed for rolling deploys and on-call clarity. | ## How it fits together @@ -57,26 +58,20 @@ flowchart LR Shard1 <-.HTTP replicate.-> Peer1 ``` -The `HyperCache` wrapper is a thin facade you embed in your application. -The `DistMemory` backend handles sharding, replication, and the cluster -plane. Two HTTP listeners run per process: a peer-to-peer one for -replication and gossip, and a separate management one for admin and -observability. +The `HyperCache` wrapper is a thin facade you embed in your application. The `DistMemory` backend handles +sharding, replication, and the cluster plane. Two HTTP listeners run per process: a peer-to-peer one for +replication and gossip, and a separate management one for admin and observability. ## Two ways to use it -**As a library** — embed `HyperCache` directly in your Go application; it -uses the in-memory or distributed backend in-process. See -[Quickstart](quickstart.md). +**As a library** — embed `HyperCache` directly in your Go application; it uses the in-memory or distributed +backend in-process. See [Quickstart](quickstart.md). -**As a service** — run the [`hypercache-server`](server.md) binary; clients -talk to it over a REST API. See [5-Node Cluster](cluster.md) for the -docker-compose recipe and [Helm Chart](helm.md) for Kubernetes. +**As a service** — run the [`hypercache-server`](server.md) binary; clients talk to it over a REST API. See +[5-Node Cluster](cluster.md) for the docker-compose recipe and [Helm Chart](helm.md) for Kubernetes. ## Project status -The distributed backend is production-ready as of v0.6.0 — see the -[changelog](changelog.md) for the full list of features and fixes that -landed during the productionization push (Phases A through E in the -upstream history). Operations procedures live in the -[runbook](operations.md). +The distributed backend is production-ready as of v0.6.0 — see the [changelog](changelog.md) for the full list +of features and fixes that landed during the productionization push (Phases A through E in the upstream +history). Operations procedures live in the [runbook](operations.md). diff --git a/docs/oncall.md b/docs/oncall.md new file mode 100644 index 0000000..37dcea4 --- /dev/null +++ b/docs/oncall.md @@ -0,0 +1,346 @@ +--- +title: On-call cheatsheet +description: Symptom → log grep → metric → action map for HyperCache operators. +--- + +# On-call cheatsheet + +You got paged. This page exists to take you from a symptom to a diagnosis in under sixty seconds. Each section +is a single failure shape: what you'll see, where to look, what to do next. Deeper operating procedures live +in the [operations runbook](operations.md); start here, descend there. + +Every log line quoted below is a real string the binary emits — copy into `grep -F` directly. Every metric +name is from `DistMemory.Metrics()` and its OTel mirror (`dist.*`) or from the wrapper-level `StatsCollector`. + +## Triage matrix + +| Symptom | Likely cause | Jump to | +| ------------------------------------------------ | ------------------------------------------------- | --------------------------------------------------------- | +| Node won't start / never appears in cluster | bind failure, bad config, OIDC issuer unreachable | [Node startup](#node-startup) | +| Cluster has the right members but cache is empty | new node still rebalancing in | [Cold replica](#cold-replica) | +| Peers flapping in `/cluster/members` | network jitter, indirect probes failing | [Heartbeat flapping](#heartbeat-flapping) | +| Hints building up faster than they drain | one peer unreachable or rejecting writes | [Hint queue](#hint-queue-building) | +| 401 / 403 on requests that should work | misconfigured token, missing scope, OIDC expired | [Auth failures](#auth-failures) | +| Eviction running hot, latency spiking on Set | cache at capacity, eviction can't keep up | [Eviction pressure](#eviction-pressure) | +| Replicas diverging | partition healed, version conflicts | [Split-brain reconciliation](#split-brain-reconciliation) | +| Drain stuck / load balancer still routing | `/health` not flipping or LB caching | [Drain not draining](#drain-not-draining) | + +## Node startup + +**What you'll see (good).** Exactly one of these on each node, in order, on every boot: + +```text +msg="hypercache-server starting" api_addr=:8080 mgmt_addr=:8081 dist_addr=:7946 oidc_enabled=true +msg="cluster join: node starting" node_id=cache-0 replication=3 virtual_nodes=128 peers_known=4 +msg="dist HTTP listener started" addr=:7946 +msg="heartbeat loop started" interval=1s +msg="rebalance loop started" interval=30s +msg="hint replay loop started" interval=15s +``` + +If you see all six lines, the node has bound its ports, advertised itself to peers, and started its background +loops. Everything after this point is steady-state. + +**What you'll see (bad).** + +- `msg="dist HTTP listener bind failed"` — another process is already bound to `HYPERCACHE_DIST_ADDR`. Check + for a stale pod / process on the host. +- `msg="oidc verifier construction failed"` — IdP discovery URL unreachable from the pod. Check + `HYPERCACHE_OIDC_ISSUER`, DNS, and egress firewall rules. The process exits with code 1 (so the orchestrator + will restart it; check `kubectl describe pod` for the loop). +- No `cluster join` line at all — the binary crashed before `buildHyperCache` returned. Look earlier in the + log for `hypercache construction failed` with `err=...`. + +**Metrics to check.** `dist.members.alive` (gauge) on every other node should tick up by one within +`WithDistHeartbeat`'s `aliveAfter` window. `dist.membership.version` increments on each membership change, so +it also bumps once per peer that learns about the new node. + +## Cold replica + +**What's happening.** A new replica is in the membership but its shards haven't been hydrated yet. Reads +against keys whose primary is elsewhere succeed (replica forward), but reads against keys this node should own +return misses until rebalance migrates them in. + +**What you'll see in logs.** No errors — this is normal. After the first `rebalance loop started` line, expect +periodic ticks `rebalance.batches` increments visible at `/dist/metrics`. + +**Metrics to check.** + +- `dist.rebalance.batches` (counter) — incrementing means migration is happening. +- `dist.rebalance.keys` (counter) — total keys migrated this process-lifetime. +- `dist.rebalance.last_ns` (gauge) — duration of the last full scan. Compare to `WithDistRebalanceInterval` — + if scan duration exceeds the interval, you have a sustained backlog. + +**What to do.** Usually wait. If wait is unbounded, see +[Rebalance under load](operations.md#failure-mode--rebalance-under-load). + +## Heartbeat flapping + +**What's happening.** Peers cycle alive → suspect → alive every few ticks. Caller-side network jitter, an +overloaded probe path, or a mis-tuned `WithDistHeartbeat` are the usual causes. + +**What you'll see in logs.** + +```text +msg="peer marked suspect (timeout)" peer_id=cache-2 ... +msg="peer probe refuted by indirect probe" peer_id=cache-2 ... +msg="self-refuted suspect/dead claim from peer" ... +``` + +The third one is the recovery path — the suspected node observed itself being slandered and bumped its +incarnation to refute. If you see it landing, the SWIM dance is working as designed. + +```text +msg="peer pruned (dead)" peer_id=cache-2 ... +msg="peer removed from membership" peer_addr=:7946 members_after=3 +``` + +These two together mean a peer has been ejected — distinguish them from manual `RemovePeer` calls (which only +emit the second line, with no preceding `pruned (dead)`). + +**Metrics to check.** + +- `dist.heartbeat.failure` (counter) climbing — direct probes are failing. +- `dist.heartbeat.indirect_probe.refuted` (counter) — indirect probes are saving you from spurious flap. + Healthy if non-zero. +- `dist.heartbeat.indirect_probe.failure` (counter) — indirect probes also fail. The peer is genuinely + unreachable. +- `dist.nodes.suspect` / `dist.nodes.dead` (gauges) — current cluster state. + +**What to do.** If `refuted` is climbing in step with `failure`, the system is self-correcting — extend +`WithDistHeartbeat`'s `suspectAfter` / `deadAfter` if the flap is noisy. If `indirect_probe.failure` is also +climbing, the peer is genuinely unreachable — see [replica loss](operations.md#failure-mode--replica-loss). + +## Hint queue building + +**What's happening.** A peer is unreachable. Every replicated write to it gets queued as a hint, waiting for +the peer to come back. The queue is bounded — see `WithDistHintMaxPerNode` / `WithDistHintMaxBytes`. + +**What you'll see in logs.** + +```text +msg="rebalance migration forward failed; queued for hint replay" target_addr=... err=... +msg="hint dropped after replay error" target_node=... err=... +``` + +The first is benign during a peer outage. The second means the peer came back but rejected the hint — auth +mismatch, schema drift, or a truly bad value. + +**Metrics to check.** + +- `dist.hinted.bytes` (gauge) — climbing steadily, no drain → peer still down. +- `dist.hinted.queued` (counter) — total ever queued; rising rate is the canary. +- `dist.hinted.replayed` (counter) — climbs when the peer is reachable and the queue is draining. +- `dist.hinted.global_dropped` (counter) — caps exceeded; hints are being silently dropped. Hard limit hit. +- `dist.hinted.expired` (counter) — hints aged past `WithDistHintTTL`. + +**What to do.** See [Hint queue overflow](operations.md#failure-mode--hint-queue-overflow) for the full +playbook. Short version: restore the peer, or remove it from membership and let hints expire. + +## Auth failures + +**What's happening.** A request hit the API or management port without an identity that satisfies the policy. + +**What you'll see in logs.** Auth failures are deliberately quiet (no "request denied" log per call — that +would be a log-spam amplifier). Look for the `audit` line emitted by the management HTTP layer on denied +access, and at `/dist/metrics` for `auth.*` counters if your build has them. + +**What to check first.** + +- `curl http://:8081/v1/me -H 'Authorization: Bearer '` → returns the resolved identity + scopes. + If this returns 401, the token itself is wrong; if it returns 200 with empty scopes, the token resolves but + lacks the scope the endpoint requires. +- For OIDC tokens: `aud` and `iss` must match `HYPERCACHE_OIDC_AUDIENCE` / `HYPERCACHE_OIDC_ISSUER`. The + verifier rejects mismatches before any policy check runs. +- For static bearers: the token must appear in the policy YAML (`HYPERCACHE_AUTH_CONFIG`) — confirm with + `curl http://:8081/v1/me` using that exact token. + +**What to do.** + +1. Reproduce with `curl /v1/me` (definitive truth — same chain that the failing endpoint runs). +1. If `/v1/me` returns 401: the token is rejected before reaching the scope check. Bearer mismatch, OIDC + expiry, or revoked cert. +1. If `/v1/me` returns 200 but the original endpoint still 403s: the identity resolved but lacks the required + scope. Check the route's `Scopes` mapping (`management_http.go`); cross-reference against the identity's + `scopes` field in the `/v1/me` response. +1. For OIDC token expiry specifically — `exp` is in the JWT payload; + `cut -d. -f2 | base64 -d | jq .exp` decodes it client-side. + +## Eviction pressure + +**What's happening.** The cache is at or above capacity. Eviction is running on every tick, every `Set` +triggers an immediate evict, and `Set` latency reflects the eviction cost. + +**What you'll see in logs.** With Info-level logging on, every tick that does work emits: + +```text +msg="eviction tick" evicted=42 items_remaining=10000 elapsed=3.2ms +``` + +A sustained sequence of these (non-zero `evicted` on every tick) is the symptom. If you also see +`eviction triggered source=manual`, something is calling `TriggerEviction` from application code. + +**Metrics to check.** + +- `eviction_loop_count` (counter) — how often the loop ran. +- `item_evicted_count` (counter) — total items evicted. +- `evicted_item_count` (gauge) — items evicted in the **last** tick. Sustained non-zero = under pressure. +- `eviction_loop_duration` (timing) — tick latency. Climbing → eviction itself is the bottleneck. + +**What to do.** + +1. Raise capacity (`WithMaxCacheSize` or per-backend equivalent). +1. Audit `Set` callers — is something setting keys with no TTL and no key reuse? Eviction is doing the work + TTL should. +1. Switch eviction algorithm — `WithEvictionAlgorithm("lru")` vs `"lfu"` vs `"cawolfu"` have very different + working-set fit. +1. Increase `WithEvictionShardCount` (default 32) — eviction contention is per-shard. + +## Split-brain reconciliation + +**What's happening.** A partition healed. Both sides have writes the other doesn't. + +**What you'll see in logs.** During the partition: heartbeat failure logs (see +[Heartbeat flapping](#heartbeat-flapping)). After healing: the merkle anti-entropy loop reconciles. No +specific log line is emitted per resolved conflict — version-and-origin ordering is silent by design (it would +log-spam under load). + +**Metrics to check.** + +- `dist.version.conflicts` (counter) — increments per detected divergence. Climbs after a heal, then + stabilizes. +- `dist.merkle.last_diff_ns` (gauge) — duration of the last sync. +- `dist.merkle.syncs` (counter) — successful merkle pulls. +- `dist.merkle.keys_pulled` (counter) — keys reconciled. + +**What to do.** Usually wait. Auto-sync drains on its `WithDistMerkleAutoSync` interval. To force-trigger: + +```go +err := dm.SyncWith(ctx, "peer-node-id") +``` + +The full discussion is in [Split-brain](operations.md#failure-mode--split-brain). + +## Drain not draining + +**What's happening.** You posted to `/dist/drain`, but `/health` still returns 200, or the load balancer is +still routing. + +**What you'll see in logs.** + +```text +msg="dist node draining" +``` + +If you see this line exactly once after the `POST /dist/drain`, the drain registered cache-side. From here: + +- `/health` returns **503** on every subsequent request. +- New `Set` / `Remove` return `sentinel.ErrDraining` (HTTP 503). +- `Get` continues to serve from cache. + +**What to do.** + +1. Confirm the drain line in the logs first. If absent, the request never reached the node — check the + management address you're POSTing to (`HYPERCACHE_MGMT_ADDR`, not the client API port). +1. If the drain logged but `/health` still returns 200, you're probably hitting the wrong listener — `/health` + lives on both the client API and management ports, and only the latter respects drain. Confirm via + `curl -v http://:8081/health` (mgmt) vs `:8080` (api). +1. If `/health` correctly returns 503 but the LB still routes, that's a load-balancer problem, not a cache + problem. Check the LB's health-check cache TTL. + +Drain is one-way per process; restart to clear. + +## Structured-logging reference + +Every log line the cache emits as of this writing, grouped by source. Use this as a grep dictionary. + +### Lifecycle (`hypercache-server`) + +| Message | When | Level | +| ------------------------------------------------------------------- | -------------------------------- | ----- | +| `hypercache-server starting` | binary boot, once | Info | +| `hypercache-server running with no client API auth configured; ...` | misconfigured auth | Warn | +| `shutdown signal received` | SIGINT/SIGTERM received | Info | +| `hypercache-server stopped cleanly` | shutdown complete | Info | +| `oidc verifier construction failed` | IdP unreachable at boot | Error | +| `client API listener exited` | API port goroutine died | Error | +| `hypercache construction failed` | wrapper init error | Error | +| `client API construction failed` | server init error | Error | +| `drain returned error` | drain attempt on shutdown failed | Warn | +| `client API shutdown returned error` | graceful shutdown failed | Warn | +| `hypercache stop returned error` | wrapper stop failed | Warn | + +### Wrapper loops (`HyperCache`) + +| Message | When | Level | +| ---------------------------------------------- | --------------------------------------------- | ----- | +| `eviction loop starting` | wrapper start, once if `evictionInterval > 0` | Info | +| `eviction loop stopped` | context canceled or stop signal | Info | +| `eviction tick` | tick did work (evicted > 0) | Info | +| `eviction tick (idle)` | tick ran with nothing to evict | Debug | +| `eviction triggered` | `TriggerEviction()` accepted | Info | +| `eviction trigger coalesced (already pending)` | trigger arrived while one in-flight | Debug | +| `expiration loop starting` | wrapper start, once | Info | +| `expiration loop stopped` | context canceled or stop signal | Info | +| `expiration tick` | tick removed expired items | Info | +| `expiration tick (idle)` | tick ran with nothing expired | Debug | + +### DistMemory backend + +| Message | When | Level | +| ------------------------------------------------------------ | ------------------------------------ | ----- | +| `cluster join: node starting` | DistMemory constructor, once | Info | +| `dist HTTP listener started` | peer transport bound | Info | +| `dist HTTP listener bind failed` | port in use / permission denied | Error | +| `dist HTTP serve goroutine exited` | transport listener stopped | Info | +| `heartbeat loop started` | SWIM probe loop start | Info | +| `gossip loop started` | gossip push loop start | Info | +| `hint replay loop started` | hint drain loop start | Info | +| `rebalance loop started` | ownership-migration loop start | Info | +| `merkle auto-sync loop started` | anti-entropy loop start | Info | +| `peer added to membership` | `AddPeer` accepted | Info | +| `peer removed from membership` | `RemovePeer` or `peer pruned (dead)` | Info | +| `peer marked suspect (timeout)` | direct probe failed | Warn | +| `peer marked suspect (probe failed)` | probe error during SWIM | Info | +| `peer probe refuted by indirect probe` | indirect probe rescued the peer | Warn | +| `peer pruned (dead)` | suspect window exceeded; ejected | Warn | +| `self-refuted suspect/dead claim from peer` | local incarnation bump | Info | +| `gossip push failed` | gossip dispatch error | Warn | +| `merkle sync fetch failed` | anti-entropy pull error | Warn | +| `rebalance migration forward failed; queued for hint replay` | replication during rebalance failed | Warn | +| `hint dropped after replay error` | hint replayed but peer rejected | Info | +| `dist node draining` | `POST /dist/drain` accepted | Info | + +### Telemetry registration + +| Message | When | Level | +| ------------------------------------------ | ------------------------------------ | ----- | +| `dist meter: counter registration failed` | OTel meter binding error | Error | +| `dist meter: gauge registration failed` | OTel meter binding error | Error | +| `dist meter: callback registration failed` | OTel observable callback bind failed | Error | +| `dist meter: callback unregister failed` | OTel meter teardown error | Error | + +## Quick filters + +```sh +# All cluster-membership events for this node: +journalctl -u hypercache -o cat | grep -E 'peer (added|removed|marked|pruned|probe)' + +# Background-loop health (every loop emits exactly one starting line per process): +journalctl -u hypercache -o cat | grep -F 'loop starting' | grep -F 'loop started' + +# Hint-queue trouble (replay errors, drops): +journalctl -u hypercache -o cat | grep -F 'hint ' + +# All Warns and Errors only: +journalctl -u hypercache -o cat -p warning..err +``` + +## Going deeper + +For the design background: + +- [Distributed backend](distributed.md) — replication, hashing, membership. +- [Operations runbook](operations.md) — long-form failure-mode playbooks. Each `#failure-mode-*` anchor + matches a symptom above. +- [API reference](api.md) — REST surface served by the binary. diff --git a/docs/operations.md b/docs/operations.md index 1c4b933..e18bdf2 100644 --- a/docs/operations.md +++ b/docs/operations.md @@ -1,156 +1,127 @@ # Operations runbook — DistMemory -This document is for operators running the `pkg/backend.DistMemory` -distributed backend in production. It assumes the design background in -[distributed.md](distributed.md). Sections are deliberately short — each -one stands on its own and links to code. +This document is for operators running the `pkg/backend.DistMemory` distributed backend in production. It +assumes the design background in [distributed.md](distributed.md). Sections are deliberately short — each one +stands on its own and links to code. + + +!!! tip "Paged right now?" + Start at the [on-call cheatsheet](oncall.md). It maps a symptom (heartbeat flap, hint queue building, + auth failure, drain stuck) to the exact log lines and metrics to grep for, then back-links to the + relevant section here. ## At a glance -| Concern | First place to look | -|---|---| -| Node not receiving traffic | `dist.members.alive`, `/health` | -| Writes failing | `dist.write.quorum_failures`, `sentinel.ErrDraining`, `sentinel.ErrQuorumFailed` | -| Replicas falling behind | `dist.hinted.queued`, `dist.hinted.replayed`, `dist.hinted.dropped` | -| Bandwidth pressure | `DistHTTPLimits.CompressionThreshold` | -| Spurious peer flapping | `dist.heartbeat.indirect_probe.refuted`, `WithDistIndirectProbes` | -| Slow rebalance | `dist.rebalance.throttle`, `dist.rebalance.last_ns` | -| Anti-entropy backlog | `dist.merkle.last_diff_ns`, `dist.auto_sync.last_ns` | - -Live metric values come from `DistMemory.Metrics()` (Go struct), -`/dist/metrics` (JSON, when wrapped in `hypercache.HyperCache`), or -the OpenTelemetry pipeline you wired via `WithDistMeterProvider`. -The OTel names use the `dist.` prefix. +| Concern | First place to look | +| -------------------------- | -------------------------------------------------------------------------------- | +| Node not receiving traffic | `dist.members.alive`, `/health` | +| Writes failing | `dist.write.quorum_failures`, `sentinel.ErrDraining`, `sentinel.ErrQuorumFailed` | +| Replicas falling behind | `dist.hinted.queued`, `dist.hinted.replayed`, `dist.hinted.dropped` | +| Bandwidth pressure | `DistHTTPLimits.CompressionThreshold` | +| Spurious peer flapping | `dist.heartbeat.indirect_probe.refuted`, `WithDistIndirectProbes` | +| Slow rebalance | `dist.rebalance.throttle`, `dist.rebalance.last_ns` | +| Anti-entropy backlog | `dist.merkle.last_diff_ns`, `dist.auto_sync.last_ns` | + +Live metric values come from `DistMemory.Metrics()` (Go struct), `/dist/metrics` (JSON, when wrapped in +`hypercache.HyperCache`), or the OpenTelemetry pipeline you wired via `WithDistMeterProvider`. The OTel names +use the `dist.` prefix. ## Wiring observability Three opt-in entry points, all defaulting to no-op: -- **Logging** — `backend.WithDistLogger(*slog.Logger)` routes background - loops (heartbeat, hint replay, rebalance, merkle sync) and operational - errors into your logger. Records are pre-bound with +- **Logging** — `backend.WithDistLogger(*slog.Logger)` routes background loops (heartbeat, hint replay, + rebalance, merkle sync) and operational errors into your logger. Records are pre-bound with `component=dist_memory` and `node_id=`. -- **Tracing** — `backend.WithDistTracerProvider(trace.TracerProvider)` - opens spans on `Get`/`Set`/`Remove` plus per-peer - `dist.replicate.*` child spans. Cache key *values* are never put on - spans (they can be PII); only `cache.key.length`. -- **Metrics** — `backend.WithDistMeterProvider(metric.MeterProvider)` - exposes every field on `DistMetrics` as an observable instrument. +- **Tracing** — `backend.WithDistTracerProvider(trace.TracerProvider)` opens spans on `Get`/`Set`/`Remove` + plus per-peer `dist.replicate.*` child spans. Cache key _values_ are never put on spans (they can be PII); + only `cache.key.length`. +- **Metrics** — `backend.WithDistMeterProvider(metric.MeterProvider)` exposes every field on `DistMetrics` as + an observable instrument. -Wire all three to the same `otel.SetTracerProvider` / -`otel.SetMeterProvider` your application uses; the logger inherits via -`slog.Default()` if you want a one-liner. +Wire all three to the same `otel.SetTracerProvider` / `otel.SetMeterProvider` your application uses; the +logger inherits via `slog.Default()` if you want a one-liner. ## Failure mode — split-brain -**Symptom.** Two subsets of the cluster lose connectivity to each -other. Each subset elects local primaries for the keys it owns. -Writes from clients on subset A land on A-side primaries; writes from -B-side clients land on B-side primaries. When the partition heals, the -versions diverge. +**Symptom.** Two subsets of the cluster lose connectivity to each other. Each subset elects local primaries +for the keys it owns. Writes from clients on subset A land on A-side primaries; writes from B-side clients +land on B-side primaries. When the partition heals, the versions diverge. -**Detection.** `dist.heartbeat.failure` rises on both sides during the -partition. After healing, `dist.version.conflicts` increments as -anti-entropy reconciles. +**Detection.** `dist.heartbeat.failure` rises on both sides during the partition. After healing, +`dist.version.conflicts` increments as anti-entropy reconciles. -**Resolution.** DistMemory uses last-write-wins by `(version, origin)` -ordering — the higher version wins, ties broken by origin string. This -is automatic. Anti-entropy via `SyncWith` (manual) or -`WithDistMerkleAutoSync` (background) closes the gap. There is no -manual reconciliation step today. +**Resolution.** DistMemory uses last-write-wins by `(version, origin)` ordering — the higher version wins, +ties broken by origin string. This is automatic. Anti-entropy via `SyncWith` (manual) or +`WithDistMerkleAutoSync` (background) closes the gap. There is no manual reconciliation step today. -**Mitigation.** Run an odd number of nodes with quorum writes -(`WithDistWriteConsistency(ConsistencyQuorum)`); a partition that -isolates a minority leaves only the majority side accepting writes -because the minority cannot reach quorum. The minority returns -`ErrQuorumFailed` (`sentinel.ErrQuorumFailed`) on Set. +**Mitigation.** Run an odd number of nodes with quorum writes (`WithDistWriteConsistency(ConsistencyQuorum)`); +a partition that isolates a minority leaves only the majority side accepting writes because the minority +cannot reach quorum. The minority returns `ErrQuorumFailed` (`sentinel.ErrQuorumFailed`) on Set. ## Failure mode — hint queue overflow -**Symptom.** A peer is unreachable for a long time. Every replicated -write to that peer turns into a queued hint. Eventually the queue -hits `WithDistHintMaxPerNode` or `WithDistHintMaxBytes` and new hints -get dropped. +**Symptom.** A peer is unreachable for a long time. Every replicated write to that peer turns into a queued +hint. Eventually the queue hits `WithDistHintMaxPerNode` or `WithDistHintMaxBytes` and new hints get dropped. -**Detection.** `dist.hinted.bytes` (gauge) climbs steadily. -`dist.hinted.global_dropped` increments when caps are exceeded. -`dist.hinted.dropped` (a different metric — replay errors) also rises -if the peer is reachable but rejecting writes (auth, schema mismatch). +**Detection.** `dist.hinted.bytes` (gauge) climbs steadily. `dist.hinted.global_dropped` increments when caps +are exceeded. `dist.hinted.dropped` (a different metric — replay errors) also rises if the peer is reachable +but rejecting writes (auth, schema mismatch). **Resolution.** -1. Restore the unreachable peer; the replay loop drains automatically - (`dist.hinted.replayed` rises). -1. If the peer is permanently gone, remove it from membership - (`DistMemory.RemovePeer(addr)`); queued hints expire on the - `WithDistHintTTL` timer. -1. If hints are dropping faster than they replay, raise - `WithDistHintMaxPerNode` / `WithDistHintMaxBytes` — but understand - that the cap exists to bound process memory under sustained - failure. Raising it without fixing the underlying peer just delays - the bound. - -**Phase B note.** Migration failures during rebalance now also funnel -through the hint queue (Phase B.2). A surge in `dist.hinted.queued` -during a rolling deploy is expected; it should drain as the new node -becomes reachable. +1. Restore the unreachable peer; the replay loop drains automatically (`dist.hinted.replayed` rises). +1. If the peer is permanently gone, remove it from membership (`DistMemory.RemovePeer(addr)`); queued hints + expire on the `WithDistHintTTL` timer. +1. If hints are dropping faster than they replay, raise `WithDistHintMaxPerNode` / `WithDistHintMaxBytes` — + but understand that the cap exists to bound process memory under sustained failure. Raising it without + fixing the underlying peer just delays the bound. + +**Phase B note.** Migration failures during rebalance now also funnel through the hint queue (Phase B.2). A +surge in `dist.hinted.queued` during a rolling deploy is expected; it should drain as the new node becomes +reachable. ## Failure mode — rebalance under load -**Symptom.** Adding a node triggers a rebalance scan that migrates -keys to their new primary. Under sustained write load the migration -saturates and `dist.rebalance.throttle` increments — batches queue -behind the configured concurrency cap. +**Symptom.** Adding a node triggers a rebalance scan that migrates keys to their new primary. Under sustained +write load the migration saturates and `dist.rebalance.throttle` increments — batches queue behind the +configured concurrency cap. -**Detection.** `dist.rebalance.last_ns` (gauge — last full scan -duration) climbs. `dist.rebalance.throttle` (counter) increments when -the concurrency limit blocks a batch dispatch. `dist.rebalance.batches` -should still climb steadily. +**Detection.** `dist.rebalance.last_ns` (gauge — last full scan duration) climbs. `dist.rebalance.throttle` +(counter) increments when the concurrency limit blocks a batch dispatch. `dist.rebalance.batches` should still +climb steadily. **Resolution.** -1. Raise `WithDistRebalanceMaxConcurrent` (default 1) if CPU and - network headroom allow. -1. Lower `WithDistRebalanceBatchSize` (default 64) so individual - batches finish faster and concurrency slots cycle more often — - counter-intuitively, smaller batches sometimes throughput-win. -1. Pause writes (drain a subset of clients via your LB) until the - scan finishes. The dist backend has no built-in - write-throttling — that's the application's job. - -**Phase C note.** Drain (`POST /dist/drain`) does *not* trigger an -expedited rebalance today; the next scheduled -`WithDistRebalanceInterval` tick does the work. If you need to force -a faster ownership transfer, call `Stop` after Drain to cancel -in-flight work and let restart-time rebalance handle migration. +1. Raise `WithDistRebalanceMaxConcurrent` (default 1) if CPU and network headroom allow. +1. Lower `WithDistRebalanceBatchSize` (default 64) so individual batches finish faster and concurrency slots + cycle more often — counter-intuitively, smaller batches sometimes throughput-win. +1. Pause writes (drain a subset of clients via your LB) until the scan finishes. The dist backend has no + built-in write-throttling — that's the application's job. + +**Phase C note.** Drain (`POST /dist/drain`) does _not_ trigger an expedited rebalance today; the next +scheduled `WithDistRebalanceInterval` tick does the work. If you need to force a faster ownership transfer, +call `Stop` after Drain to cancel in-flight work and let restart-time rebalance handle migration. ## Failure mode — replica loss -**Symptom.** A replica node dies hard (kernel panic, hardware -failure). Its keys still have other replicas (when `replication >= 2`), -but until membership notices, writes try to fan out to it and -silently retry via the hint queue. +**Symptom.** A replica node dies hard (kernel panic, hardware failure). Its keys still have other replicas +(when `replication >= 2`), but until membership notices, writes try to fan out to it and silently retry via +the hint queue. -**Detection.** `dist.heartbeat.failure` increments steadily for the -lost peer. After `WithDistHeartbeat`'s `deadAfter` window, the peer -is pruned (`dist.nodes.removed` increments) and ring lookups stop -including it. +**Detection.** `dist.heartbeat.failure` increments steadily for the lost peer. After `WithDistHeartbeat`'s +`deadAfter` window, the peer is pruned (`dist.nodes.removed` increments) and ring lookups stop including it. **Resolution.** -1. Wait for the heartbeat to detect the dead peer. With default - timing, this is on the order of seconds. -1. Spin up a replacement node with the same membership (or let - gossip discover it). -1. The new node's rebalance scan pulls its assigned keys from - surviving replicas via Merkle anti-entropy. +1. Wait for the heartbeat to detect the dead peer. With default timing, this is on the order of seconds. +1. Spin up a replacement node with the same membership (or let gossip discover it). +1. The new node's rebalance scan pulls its assigned keys from surviving replicas via Merkle anti-entropy. -**Indirect probes.** `WithDistIndirectProbes(k, timeout)` filters -caller-side network blips that would otherwise mark a healthy peer -suspect. `dist.heartbeat.indirect_probe.refuted` rising indicates -indirect probes are saving you from spurious flapping; rising -`dist.heartbeat.indirect_probe.failure` indicates the peer is -genuinely unreachable from multiple vantage points. +**Indirect probes.** `WithDistIndirectProbes(k, timeout)` filters caller-side network blips that would +otherwise mark a healthy peer suspect. `dist.heartbeat.indirect_probe.refuted` rising indicates indirect +probes are saving you from spurious flapping; rising `dist.heartbeat.indirect_probe.failure` indicates the +peer is genuinely unreachable from multiple vantage points. ## Operational tasks @@ -187,28 +158,24 @@ curl 'http://node-A:8080/internal/keys?cursor=1' err := dm.SyncWith(ctx, "node-B") ``` -`WithDistMerkleAutoSync(interval)` runs this on a timer; manual calls -are useful for debugging. +`WithDistMerkleAutoSync(interval)` runs this on a timer; manual calls are useful for debugging. ## Capacity planning notes -- Each shard mutex is independent — write throughput scales with - shard count up to CPU saturation. -- Hint queue memory is approximately `HintedBytes` + 64 bytes of - bookkeeping per queued hint. Cap via `WithDistHintMaxBytes` to - bound total process memory under partition. -- Merkle tree storage scales O(N/chunk) for N keys at - `WithDistMerkleChunkSize` (default 128). For a million keys, the - default chunk gives ~8K leaf hashes per node — negligible. -- Replication factor 3 with quorum reads/writes tolerates 1 failure; - raise to 5 for tolerating 2 failures, at 5× the storage cost. +- Each shard mutex is independent — write throughput scales with shard count up to CPU saturation. +- Hint queue memory is approximately `HintedBytes` + 64 bytes of bookkeeping per queued hint. Cap via + `WithDistHintMaxBytes` to bound total process memory under partition. +- Merkle tree storage scales O(N/chunk) for N keys at `WithDistMerkleChunkSize` (default 128). For a million + keys, the default chunk gives ~8K leaf hashes per node — negligible. +- Replication factor 3 with quorum reads/writes tolerates 1 failure; raise to 5 for tolerating 2 failures, at + 5× the storage cost. ## Where things are -| Concern | File | -|---|---| -| Public surface | [pkg/backend/dist_memory.go](../pkg/backend/dist_memory.go) | -| Transport interface | [pkg/backend/dist_transport.go](../pkg/backend/dist_transport.go) | -| HTTP transport | [pkg/backend/dist_http_transport.go](../pkg/backend/dist_http_transport.go) | -| HTTP server | [pkg/backend/dist_http_server.go](../pkg/backend/dist_http_server.go) | -| Membership / ring | [internal/cluster/](../internal/cluster) | +| Concern | File | +| ------------------- | --------------------------------------------------------------------------- | +| Public surface | [pkg/backend/dist_memory.go](../pkg/backend/dist_memory.go) | +| Transport interface | [pkg/backend/dist_transport.go](../pkg/backend/dist_transport.go) | +| HTTP transport | [pkg/backend/dist_http_transport.go](../pkg/backend/dist_http_transport.go) | +| HTTP server | [pkg/backend/dist_http_server.go](../pkg/backend/dist_http_server.go) | +| Membership / ring | [internal/cluster/](../internal/cluster) |