diff --git a/cspell.config.yaml b/cspell.config.yaml index cdf2e9a..122ac19 100644 --- a/cspell.config.yaml +++ b/cspell.config.yaml @@ -59,6 +59,7 @@ words: - Cbor - cespare - chans + - cheatsheet - cmap - Cmder - codacy @@ -142,6 +143,7 @@ words: - ints - ireturn - Itemm + - journalctl - keyf - keypair - lamport diff --git a/docs/index.md b/docs/index.md index 9f74b3c..11e2c24 100644 --- a/docs/index.md +++ b/docs/index.md @@ -6,9 +6,8 @@ hide: # HyperCache -Distributed in-memory cache for Go. Sharded for concurrency, replicated for -durability under partial failure, observable from the start, and shipped as -both a library and a single-binary HTTP service. +Distributed in-memory cache for Go. Sharded for concurrency, replicated for durability under partial failure, +observable from the start, and shipped as both a library and a single-binary HTTP service.
@@ -16,19 +15,21 @@ both a library and a single-binary HTTP service. - :material-server-network: **[5-Node Cluster](cluster.md)** — boot a real cluster with `docker compose`. - :fontawesome-brands-kubernetes: **[Helm Chart](helm.md)** — deploy on Kubernetes with stable identities. - :material-tools: **[Operations Runbook](operations.md)** — split-brain, hint queues, drain, capacity. +- :material-bell-alert: **[On-call Cheatsheet](oncall.md)** — symptom → log grep → metric → action for paged + operators.
## Why HyperCache -| | What you get | Why it matters | -|---|---|---| -| **Sharded by default** | 32 per-shard mutexes routed by xxhash | Write throughput scales with cores, no global lock. | -| **Distributed backend** | Consistent hashing, configurable replication, quorum reads/writes | A single failed node does not lose keys. | -| **Hinted handoff** | Failed forwards queue with TTL, replay on the dist HTTP transport | Transient peer outages don't drop replicas. | -| **SWIM heartbeat** | Direct + indirect probes; self-refute via incarnation gossip | Filters caller-side network blips, recovers from false suspicion. | -| **Observable** | `slog` logger + OpenTelemetry tracing + OpenTelemetry metrics, all opt-in | Plug into your existing pipeline, no extra deps. | -| **Operator-friendly** | `Drain` endpoint, cursor-paged key enumeration, JSON error envelopes | Designed for rolling deploys and on-call clarity. | +| | What you get | Why it matters | +| ----------------------- | ------------------------------------------------------------------------- | ----------------------------------------------------------------- | +| **Sharded by default** | 32 per-shard mutexes routed by xxhash | Write throughput scales with cores, no global lock. | +| **Distributed backend** | Consistent hashing, configurable replication, quorum reads/writes | A single failed node does not lose keys. | +| **Hinted handoff** | Failed forwards queue with TTL, replay on the dist HTTP transport | Transient peer outages don't drop replicas. | +| **SWIM heartbeat** | Direct + indirect probes; self-refute via incarnation gossip | Filters caller-side network blips, recovers from false suspicion. | +| **Observable** | `slog` logger + OpenTelemetry tracing + OpenTelemetry metrics, all opt-in | Plug into your existing pipeline, no extra deps. | +| **Operator-friendly** | `Drain` endpoint, cursor-paged key enumeration, JSON error envelopes | Designed for rolling deploys and on-call clarity. | ## How it fits together @@ -57,26 +58,20 @@ flowchart LR Shard1 <-.HTTP replicate.-> Peer1 ``` -The `HyperCache` wrapper is a thin facade you embed in your application. -The `DistMemory` backend handles sharding, replication, and the cluster -plane. Two HTTP listeners run per process: a peer-to-peer one for -replication and gossip, and a separate management one for admin and -observability. +The `HyperCache` wrapper is a thin facade you embed in your application. The `DistMemory` backend handles +sharding, replication, and the cluster plane. Two HTTP listeners run per process: a peer-to-peer one for +replication and gossip, and a separate management one for admin and observability. ## Two ways to use it -**As a library** — embed `HyperCache` directly in your Go application; it -uses the in-memory or distributed backend in-process. See -[Quickstart](quickstart.md). +**As a library** — embed `HyperCache` directly in your Go application; it uses the in-memory or distributed +backend in-process. See [Quickstart](quickstart.md). -**As a service** — run the [`hypercache-server`](server.md) binary; clients -talk to it over a REST API. See [5-Node Cluster](cluster.md) for the -docker-compose recipe and [Helm Chart](helm.md) for Kubernetes. +**As a service** — run the [`hypercache-server`](server.md) binary; clients talk to it over a REST API. See +[5-Node Cluster](cluster.md) for the docker-compose recipe and [Helm Chart](helm.md) for Kubernetes. ## Project status -The distributed backend is production-ready as of v0.6.0 — see the -[changelog](changelog.md) for the full list of features and fixes that -landed during the productionization push (Phases A through E in the -upstream history). Operations procedures live in the -[runbook](operations.md). +The distributed backend is production-ready as of v0.6.0 — see the [changelog](changelog.md) for the full list +of features and fixes that landed during the productionization push (Phases A through E in the upstream +history). Operations procedures live in the [runbook](operations.md). diff --git a/docs/oncall.md b/docs/oncall.md new file mode 100644 index 0000000..37dcea4 --- /dev/null +++ b/docs/oncall.md @@ -0,0 +1,346 @@ +--- +title: On-call cheatsheet +description: Symptom → log grep → metric → action map for HyperCache operators. +--- + +# On-call cheatsheet + +You got paged. This page exists to take you from a symptom to a diagnosis in under sixty seconds. Each section +is a single failure shape: what you'll see, where to look, what to do next. Deeper operating procedures live +in the [operations runbook](operations.md); start here, descend there. + +Every log line quoted below is a real string the binary emits — copy into `grep -F` directly. Every metric +name is from `DistMemory.Metrics()` and its OTel mirror (`dist.*`) or from the wrapper-level `StatsCollector`. + +## Triage matrix + +| Symptom | Likely cause | Jump to | +| ------------------------------------------------ | ------------------------------------------------- | --------------------------------------------------------- | +| Node won't start / never appears in cluster | bind failure, bad config, OIDC issuer unreachable | [Node startup](#node-startup) | +| Cluster has the right members but cache is empty | new node still rebalancing in | [Cold replica](#cold-replica) | +| Peers flapping in `/cluster/members` | network jitter, indirect probes failing | [Heartbeat flapping](#heartbeat-flapping) | +| Hints building up faster than they drain | one peer unreachable or rejecting writes | [Hint queue](#hint-queue-building) | +| 401 / 403 on requests that should work | misconfigured token, missing scope, OIDC expired | [Auth failures](#auth-failures) | +| Eviction running hot, latency spiking on Set | cache at capacity, eviction can't keep up | [Eviction pressure](#eviction-pressure) | +| Replicas diverging | partition healed, version conflicts | [Split-brain reconciliation](#split-brain-reconciliation) | +| Drain stuck / load balancer still routing | `/health` not flipping or LB caching | [Drain not draining](#drain-not-draining) | + +## Node startup + +**What you'll see (good).** Exactly one of these on each node, in order, on every boot: + +```text +msg="hypercache-server starting" api_addr=:8080 mgmt_addr=:8081 dist_addr=:7946 oidc_enabled=true +msg="cluster join: node starting" node_id=cache-0 replication=3 virtual_nodes=128 peers_known=4 +msg="dist HTTP listener started" addr=:7946 +msg="heartbeat loop started" interval=1s +msg="rebalance loop started" interval=30s +msg="hint replay loop started" interval=15s +``` + +If you see all six lines, the node has bound its ports, advertised itself to peers, and started its background +loops. Everything after this point is steady-state. + +**What you'll see (bad).** + +- `msg="dist HTTP listener bind failed"` — another process is already bound to `HYPERCACHE_DIST_ADDR`. Check + for a stale pod / process on the host. +- `msg="oidc verifier construction failed"` — IdP discovery URL unreachable from the pod. Check + `HYPERCACHE_OIDC_ISSUER`, DNS, and egress firewall rules. The process exits with code 1 (so the orchestrator + will restart it; check `kubectl describe pod` for the loop). +- No `cluster join` line at all — the binary crashed before `buildHyperCache` returned. Look earlier in the + log for `hypercache construction failed` with `err=...`. + +**Metrics to check.** `dist.members.alive` (gauge) on every other node should tick up by one within +`WithDistHeartbeat`'s `aliveAfter` window. `dist.membership.version` increments on each membership change, so +it also bumps once per peer that learns about the new node. + +## Cold replica + +**What's happening.** A new replica is in the membership but its shards haven't been hydrated yet. Reads +against keys whose primary is elsewhere succeed (replica forward), but reads against keys this node should own +return misses until rebalance migrates them in. + +**What you'll see in logs.** No errors — this is normal. After the first `rebalance loop started` line, expect +periodic ticks `rebalance.batches` increments visible at `/dist/metrics`. + +**Metrics to check.** + +- `dist.rebalance.batches` (counter) — incrementing means migration is happening. +- `dist.rebalance.keys` (counter) — total keys migrated this process-lifetime. +- `dist.rebalance.last_ns` (gauge) — duration of the last full scan. Compare to `WithDistRebalanceInterval` — + if scan duration exceeds the interval, you have a sustained backlog. + +**What to do.** Usually wait. If wait is unbounded, see +[Rebalance under load](operations.md#failure-mode--rebalance-under-load). + +## Heartbeat flapping + +**What's happening.** Peers cycle alive → suspect → alive every few ticks. Caller-side network jitter, an +overloaded probe path, or a mis-tuned `WithDistHeartbeat` are the usual causes. + +**What you'll see in logs.** + +```text +msg="peer marked suspect (timeout)" peer_id=cache-2 ... +msg="peer probe refuted by indirect probe" peer_id=cache-2 ... +msg="self-refuted suspect/dead claim from peer" ... +``` + +The third one is the recovery path — the suspected node observed itself being slandered and bumped its +incarnation to refute. If you see it landing, the SWIM dance is working as designed. + +```text +msg="peer pruned (dead)" peer_id=cache-2 ... +msg="peer removed from membership" peer_addr=:7946 members_after=3 +``` + +These two together mean a peer has been ejected — distinguish them from manual `RemovePeer` calls (which only +emit the second line, with no preceding `pruned (dead)`). + +**Metrics to check.** + +- `dist.heartbeat.failure` (counter) climbing — direct probes are failing. +- `dist.heartbeat.indirect_probe.refuted` (counter) — indirect probes are saving you from spurious flap. + Healthy if non-zero. +- `dist.heartbeat.indirect_probe.failure` (counter) — indirect probes also fail. The peer is genuinely + unreachable. +- `dist.nodes.suspect` / `dist.nodes.dead` (gauges) — current cluster state. + +**What to do.** If `refuted` is climbing in step with `failure`, the system is self-correcting — extend +`WithDistHeartbeat`'s `suspectAfter` / `deadAfter` if the flap is noisy. If `indirect_probe.failure` is also +climbing, the peer is genuinely unreachable — see [replica loss](operations.md#failure-mode--replica-loss). + +## Hint queue building + +**What's happening.** A peer is unreachable. Every replicated write to it gets queued as a hint, waiting for +the peer to come back. The queue is bounded — see `WithDistHintMaxPerNode` / `WithDistHintMaxBytes`. + +**What you'll see in logs.** + +```text +msg="rebalance migration forward failed; queued for hint replay" target_addr=... err=... +msg="hint dropped after replay error" target_node=... err=... +``` + +The first is benign during a peer outage. The second means the peer came back but rejected the hint — auth +mismatch, schema drift, or a truly bad value. + +**Metrics to check.** + +- `dist.hinted.bytes` (gauge) — climbing steadily, no drain → peer still down. +- `dist.hinted.queued` (counter) — total ever queued; rising rate is the canary. +- `dist.hinted.replayed` (counter) — climbs when the peer is reachable and the queue is draining. +- `dist.hinted.global_dropped` (counter) — caps exceeded; hints are being silently dropped. Hard limit hit. +- `dist.hinted.expired` (counter) — hints aged past `WithDistHintTTL`. + +**What to do.** See [Hint queue overflow](operations.md#failure-mode--hint-queue-overflow) for the full +playbook. Short version: restore the peer, or remove it from membership and let hints expire. + +## Auth failures + +**What's happening.** A request hit the API or management port without an identity that satisfies the policy. + +**What you'll see in logs.** Auth failures are deliberately quiet (no "request denied" log per call — that +would be a log-spam amplifier). Look for the `audit` line emitted by the management HTTP layer on denied +access, and at `/dist/metrics` for `auth.*` counters if your build has them. + +**What to check first.** + +- `curl http://:8081/v1/me -H 'Authorization: Bearer '` → returns the resolved identity + scopes. + If this returns 401, the token itself is wrong; if it returns 200 with empty scopes, the token resolves but + lacks the scope the endpoint requires. +- For OIDC tokens: `aud` and `iss` must match `HYPERCACHE_OIDC_AUDIENCE` / `HYPERCACHE_OIDC_ISSUER`. The + verifier rejects mismatches before any policy check runs. +- For static bearers: the token must appear in the policy YAML (`HYPERCACHE_AUTH_CONFIG`) — confirm with + `curl http://:8081/v1/me` using that exact token. + +**What to do.** + +1. Reproduce with `curl /v1/me` (definitive truth — same chain that the failing endpoint runs). +1. If `/v1/me` returns 401: the token is rejected before reaching the scope check. Bearer mismatch, OIDC + expiry, or revoked cert. +1. If `/v1/me` returns 200 but the original endpoint still 403s: the identity resolved but lacks the required + scope. Check the route's `Scopes` mapping (`management_http.go`); cross-reference against the identity's + `scopes` field in the `/v1/me` response. +1. For OIDC token expiry specifically — `exp` is in the JWT payload; + `cut -d. -f2 | base64 -d | jq .exp` decodes it client-side. + +## Eviction pressure + +**What's happening.** The cache is at or above capacity. Eviction is running on every tick, every `Set` +triggers an immediate evict, and `Set` latency reflects the eviction cost. + +**What you'll see in logs.** With Info-level logging on, every tick that does work emits: + +```text +msg="eviction tick" evicted=42 items_remaining=10000 elapsed=3.2ms +``` + +A sustained sequence of these (non-zero `evicted` on every tick) is the symptom. If you also see +`eviction triggered source=manual`, something is calling `TriggerEviction` from application code. + +**Metrics to check.** + +- `eviction_loop_count` (counter) — how often the loop ran. +- `item_evicted_count` (counter) — total items evicted. +- `evicted_item_count` (gauge) — items evicted in the **last** tick. Sustained non-zero = under pressure. +- `eviction_loop_duration` (timing) — tick latency. Climbing → eviction itself is the bottleneck. + +**What to do.** + +1. Raise capacity (`WithMaxCacheSize` or per-backend equivalent). +1. Audit `Set` callers — is something setting keys with no TTL and no key reuse? Eviction is doing the work + TTL should. +1. Switch eviction algorithm — `WithEvictionAlgorithm("lru")` vs `"lfu"` vs `"cawolfu"` have very different + working-set fit. +1. Increase `WithEvictionShardCount` (default 32) — eviction contention is per-shard. + +## Split-brain reconciliation + +**What's happening.** A partition healed. Both sides have writes the other doesn't. + +**What you'll see in logs.** During the partition: heartbeat failure logs (see +[Heartbeat flapping](#heartbeat-flapping)). After healing: the merkle anti-entropy loop reconciles. No +specific log line is emitted per resolved conflict — version-and-origin ordering is silent by design (it would +log-spam under load). + +**Metrics to check.** + +- `dist.version.conflicts` (counter) — increments per detected divergence. Climbs after a heal, then + stabilizes. +- `dist.merkle.last_diff_ns` (gauge) — duration of the last sync. +- `dist.merkle.syncs` (counter) — successful merkle pulls. +- `dist.merkle.keys_pulled` (counter) — keys reconciled. + +**What to do.** Usually wait. Auto-sync drains on its `WithDistMerkleAutoSync` interval. To force-trigger: + +```go +err := dm.SyncWith(ctx, "peer-node-id") +``` + +The full discussion is in [Split-brain](operations.md#failure-mode--split-brain). + +## Drain not draining + +**What's happening.** You posted to `/dist/drain`, but `/health` still returns 200, or the load balancer is +still routing. + +**What you'll see in logs.** + +```text +msg="dist node draining" +``` + +If you see this line exactly once after the `POST /dist/drain`, the drain registered cache-side. From here: + +- `/health` returns **503** on every subsequent request. +- New `Set` / `Remove` return `sentinel.ErrDraining` (HTTP 503). +- `Get` continues to serve from cache. + +**What to do.** + +1. Confirm the drain line in the logs first. If absent, the request never reached the node — check the + management address you're POSTing to (`HYPERCACHE_MGMT_ADDR`, not the client API port). +1. If the drain logged but `/health` still returns 200, you're probably hitting the wrong listener — `/health` + lives on both the client API and management ports, and only the latter respects drain. Confirm via + `curl -v http://:8081/health` (mgmt) vs `:8080` (api). +1. If `/health` correctly returns 503 but the LB still routes, that's a load-balancer problem, not a cache + problem. Check the LB's health-check cache TTL. + +Drain is one-way per process; restart to clear. + +## Structured-logging reference + +Every log line the cache emits as of this writing, grouped by source. Use this as a grep dictionary. + +### Lifecycle (`hypercache-server`) + +| Message | When | Level | +| ------------------------------------------------------------------- | -------------------------------- | ----- | +| `hypercache-server starting` | binary boot, once | Info | +| `hypercache-server running with no client API auth configured; ...` | misconfigured auth | Warn | +| `shutdown signal received` | SIGINT/SIGTERM received | Info | +| `hypercache-server stopped cleanly` | shutdown complete | Info | +| `oidc verifier construction failed` | IdP unreachable at boot | Error | +| `client API listener exited` | API port goroutine died | Error | +| `hypercache construction failed` | wrapper init error | Error | +| `client API construction failed` | server init error | Error | +| `drain returned error` | drain attempt on shutdown failed | Warn | +| `client API shutdown returned error` | graceful shutdown failed | Warn | +| `hypercache stop returned error` | wrapper stop failed | Warn | + +### Wrapper loops (`HyperCache`) + +| Message | When | Level | +| ---------------------------------------------- | --------------------------------------------- | ----- | +| `eviction loop starting` | wrapper start, once if `evictionInterval > 0` | Info | +| `eviction loop stopped` | context canceled or stop signal | Info | +| `eviction tick` | tick did work (evicted > 0) | Info | +| `eviction tick (idle)` | tick ran with nothing to evict | Debug | +| `eviction triggered` | `TriggerEviction()` accepted | Info | +| `eviction trigger coalesced (already pending)` | trigger arrived while one in-flight | Debug | +| `expiration loop starting` | wrapper start, once | Info | +| `expiration loop stopped` | context canceled or stop signal | Info | +| `expiration tick` | tick removed expired items | Info | +| `expiration tick (idle)` | tick ran with nothing expired | Debug | + +### DistMemory backend + +| Message | When | Level | +| ------------------------------------------------------------ | ------------------------------------ | ----- | +| `cluster join: node starting` | DistMemory constructor, once | Info | +| `dist HTTP listener started` | peer transport bound | Info | +| `dist HTTP listener bind failed` | port in use / permission denied | Error | +| `dist HTTP serve goroutine exited` | transport listener stopped | Info | +| `heartbeat loop started` | SWIM probe loop start | Info | +| `gossip loop started` | gossip push loop start | Info | +| `hint replay loop started` | hint drain loop start | Info | +| `rebalance loop started` | ownership-migration loop start | Info | +| `merkle auto-sync loop started` | anti-entropy loop start | Info | +| `peer added to membership` | `AddPeer` accepted | Info | +| `peer removed from membership` | `RemovePeer` or `peer pruned (dead)` | Info | +| `peer marked suspect (timeout)` | direct probe failed | Warn | +| `peer marked suspect (probe failed)` | probe error during SWIM | Info | +| `peer probe refuted by indirect probe` | indirect probe rescued the peer | Warn | +| `peer pruned (dead)` | suspect window exceeded; ejected | Warn | +| `self-refuted suspect/dead claim from peer` | local incarnation bump | Info | +| `gossip push failed` | gossip dispatch error | Warn | +| `merkle sync fetch failed` | anti-entropy pull error | Warn | +| `rebalance migration forward failed; queued for hint replay` | replication during rebalance failed | Warn | +| `hint dropped after replay error` | hint replayed but peer rejected | Info | +| `dist node draining` | `POST /dist/drain` accepted | Info | + +### Telemetry registration + +| Message | When | Level | +| ------------------------------------------ | ------------------------------------ | ----- | +| `dist meter: counter registration failed` | OTel meter binding error | Error | +| `dist meter: gauge registration failed` | OTel meter binding error | Error | +| `dist meter: callback registration failed` | OTel observable callback bind failed | Error | +| `dist meter: callback unregister failed` | OTel meter teardown error | Error | + +## Quick filters + +```sh +# All cluster-membership events for this node: +journalctl -u hypercache -o cat | grep -E 'peer (added|removed|marked|pruned|probe)' + +# Background-loop health (every loop emits exactly one starting line per process): +journalctl -u hypercache -o cat | grep -F 'loop starting' | grep -F 'loop started' + +# Hint-queue trouble (replay errors, drops): +journalctl -u hypercache -o cat | grep -F 'hint ' + +# All Warns and Errors only: +journalctl -u hypercache -o cat -p warning..err +``` + +## Going deeper + +For the design background: + +- [Distributed backend](distributed.md) — replication, hashing, membership. +- [Operations runbook](operations.md) — long-form failure-mode playbooks. Each `#failure-mode-*` anchor + matches a symptom above. +- [API reference](api.md) — REST surface served by the binary. diff --git a/docs/operations.md b/docs/operations.md index 1c4b933..e18bdf2 100644 --- a/docs/operations.md +++ b/docs/operations.md @@ -1,156 +1,127 @@ # Operations runbook — DistMemory -This document is for operators running the `pkg/backend.DistMemory` -distributed backend in production. It assumes the design background in -[distributed.md](distributed.md). Sections are deliberately short — each -one stands on its own and links to code. +This document is for operators running the `pkg/backend.DistMemory` distributed backend in production. It +assumes the design background in [distributed.md](distributed.md). Sections are deliberately short — each one +stands on its own and links to code. + + +!!! tip "Paged right now?" + Start at the [on-call cheatsheet](oncall.md). It maps a symptom (heartbeat flap, hint queue building, + auth failure, drain stuck) to the exact log lines and metrics to grep for, then back-links to the + relevant section here. ## At a glance -| Concern | First place to look | -|---|---| -| Node not receiving traffic | `dist.members.alive`, `/health` | -| Writes failing | `dist.write.quorum_failures`, `sentinel.ErrDraining`, `sentinel.ErrQuorumFailed` | -| Replicas falling behind | `dist.hinted.queued`, `dist.hinted.replayed`, `dist.hinted.dropped` | -| Bandwidth pressure | `DistHTTPLimits.CompressionThreshold` | -| Spurious peer flapping | `dist.heartbeat.indirect_probe.refuted`, `WithDistIndirectProbes` | -| Slow rebalance | `dist.rebalance.throttle`, `dist.rebalance.last_ns` | -| Anti-entropy backlog | `dist.merkle.last_diff_ns`, `dist.auto_sync.last_ns` | - -Live metric values come from `DistMemory.Metrics()` (Go struct), -`/dist/metrics` (JSON, when wrapped in `hypercache.HyperCache`), or -the OpenTelemetry pipeline you wired via `WithDistMeterProvider`. -The OTel names use the `dist.` prefix. +| Concern | First place to look | +| -------------------------- | -------------------------------------------------------------------------------- | +| Node not receiving traffic | `dist.members.alive`, `/health` | +| Writes failing | `dist.write.quorum_failures`, `sentinel.ErrDraining`, `sentinel.ErrQuorumFailed` | +| Replicas falling behind | `dist.hinted.queued`, `dist.hinted.replayed`, `dist.hinted.dropped` | +| Bandwidth pressure | `DistHTTPLimits.CompressionThreshold` | +| Spurious peer flapping | `dist.heartbeat.indirect_probe.refuted`, `WithDistIndirectProbes` | +| Slow rebalance | `dist.rebalance.throttle`, `dist.rebalance.last_ns` | +| Anti-entropy backlog | `dist.merkle.last_diff_ns`, `dist.auto_sync.last_ns` | + +Live metric values come from `DistMemory.Metrics()` (Go struct), `/dist/metrics` (JSON, when wrapped in +`hypercache.HyperCache`), or the OpenTelemetry pipeline you wired via `WithDistMeterProvider`. The OTel names +use the `dist.` prefix. ## Wiring observability Three opt-in entry points, all defaulting to no-op: -- **Logging** — `backend.WithDistLogger(*slog.Logger)` routes background - loops (heartbeat, hint replay, rebalance, merkle sync) and operational - errors into your logger. Records are pre-bound with +- **Logging** — `backend.WithDistLogger(*slog.Logger)` routes background loops (heartbeat, hint replay, + rebalance, merkle sync) and operational errors into your logger. Records are pre-bound with `component=dist_memory` and `node_id=`. -- **Tracing** — `backend.WithDistTracerProvider(trace.TracerProvider)` - opens spans on `Get`/`Set`/`Remove` plus per-peer - `dist.replicate.*` child spans. Cache key *values* are never put on - spans (they can be PII); only `cache.key.length`. -- **Metrics** — `backend.WithDistMeterProvider(metric.MeterProvider)` - exposes every field on `DistMetrics` as an observable instrument. +- **Tracing** — `backend.WithDistTracerProvider(trace.TracerProvider)` opens spans on `Get`/`Set`/`Remove` + plus per-peer `dist.replicate.*` child spans. Cache key _values_ are never put on spans (they can be PII); + only `cache.key.length`. +- **Metrics** — `backend.WithDistMeterProvider(metric.MeterProvider)` exposes every field on `DistMetrics` as + an observable instrument. -Wire all three to the same `otel.SetTracerProvider` / -`otel.SetMeterProvider` your application uses; the logger inherits via -`slog.Default()` if you want a one-liner. +Wire all three to the same `otel.SetTracerProvider` / `otel.SetMeterProvider` your application uses; the +logger inherits via `slog.Default()` if you want a one-liner. ## Failure mode — split-brain -**Symptom.** Two subsets of the cluster lose connectivity to each -other. Each subset elects local primaries for the keys it owns. -Writes from clients on subset A land on A-side primaries; writes from -B-side clients land on B-side primaries. When the partition heals, the -versions diverge. +**Symptom.** Two subsets of the cluster lose connectivity to each other. Each subset elects local primaries +for the keys it owns. Writes from clients on subset A land on A-side primaries; writes from B-side clients +land on B-side primaries. When the partition heals, the versions diverge. -**Detection.** `dist.heartbeat.failure` rises on both sides during the -partition. After healing, `dist.version.conflicts` increments as -anti-entropy reconciles. +**Detection.** `dist.heartbeat.failure` rises on both sides during the partition. After healing, +`dist.version.conflicts` increments as anti-entropy reconciles. -**Resolution.** DistMemory uses last-write-wins by `(version, origin)` -ordering — the higher version wins, ties broken by origin string. This -is automatic. Anti-entropy via `SyncWith` (manual) or -`WithDistMerkleAutoSync` (background) closes the gap. There is no -manual reconciliation step today. +**Resolution.** DistMemory uses last-write-wins by `(version, origin)` ordering — the higher version wins, +ties broken by origin string. This is automatic. Anti-entropy via `SyncWith` (manual) or +`WithDistMerkleAutoSync` (background) closes the gap. There is no manual reconciliation step today. -**Mitigation.** Run an odd number of nodes with quorum writes -(`WithDistWriteConsistency(ConsistencyQuorum)`); a partition that -isolates a minority leaves only the majority side accepting writes -because the minority cannot reach quorum. The minority returns -`ErrQuorumFailed` (`sentinel.ErrQuorumFailed`) on Set. +**Mitigation.** Run an odd number of nodes with quorum writes (`WithDistWriteConsistency(ConsistencyQuorum)`); +a partition that isolates a minority leaves only the majority side accepting writes because the minority +cannot reach quorum. The minority returns `ErrQuorumFailed` (`sentinel.ErrQuorumFailed`) on Set. ## Failure mode — hint queue overflow -**Symptom.** A peer is unreachable for a long time. Every replicated -write to that peer turns into a queued hint. Eventually the queue -hits `WithDistHintMaxPerNode` or `WithDistHintMaxBytes` and new hints -get dropped. +**Symptom.** A peer is unreachable for a long time. Every replicated write to that peer turns into a queued +hint. Eventually the queue hits `WithDistHintMaxPerNode` or `WithDistHintMaxBytes` and new hints get dropped. -**Detection.** `dist.hinted.bytes` (gauge) climbs steadily. -`dist.hinted.global_dropped` increments when caps are exceeded. -`dist.hinted.dropped` (a different metric — replay errors) also rises -if the peer is reachable but rejecting writes (auth, schema mismatch). +**Detection.** `dist.hinted.bytes` (gauge) climbs steadily. `dist.hinted.global_dropped` increments when caps +are exceeded. `dist.hinted.dropped` (a different metric — replay errors) also rises if the peer is reachable +but rejecting writes (auth, schema mismatch). **Resolution.** -1. Restore the unreachable peer; the replay loop drains automatically - (`dist.hinted.replayed` rises). -1. If the peer is permanently gone, remove it from membership - (`DistMemory.RemovePeer(addr)`); queued hints expire on the - `WithDistHintTTL` timer. -1. If hints are dropping faster than they replay, raise - `WithDistHintMaxPerNode` / `WithDistHintMaxBytes` — but understand - that the cap exists to bound process memory under sustained - failure. Raising it without fixing the underlying peer just delays - the bound. - -**Phase B note.** Migration failures during rebalance now also funnel -through the hint queue (Phase B.2). A surge in `dist.hinted.queued` -during a rolling deploy is expected; it should drain as the new node -becomes reachable. +1. Restore the unreachable peer; the replay loop drains automatically (`dist.hinted.replayed` rises). +1. If the peer is permanently gone, remove it from membership (`DistMemory.RemovePeer(addr)`); queued hints + expire on the `WithDistHintTTL` timer. +1. If hints are dropping faster than they replay, raise `WithDistHintMaxPerNode` / `WithDistHintMaxBytes` — + but understand that the cap exists to bound process memory under sustained failure. Raising it without + fixing the underlying peer just delays the bound. + +**Phase B note.** Migration failures during rebalance now also funnel through the hint queue (Phase B.2). A +surge in `dist.hinted.queued` during a rolling deploy is expected; it should drain as the new node becomes +reachable. ## Failure mode — rebalance under load -**Symptom.** Adding a node triggers a rebalance scan that migrates -keys to their new primary. Under sustained write load the migration -saturates and `dist.rebalance.throttle` increments — batches queue -behind the configured concurrency cap. +**Symptom.** Adding a node triggers a rebalance scan that migrates keys to their new primary. Under sustained +write load the migration saturates and `dist.rebalance.throttle` increments — batches queue behind the +configured concurrency cap. -**Detection.** `dist.rebalance.last_ns` (gauge — last full scan -duration) climbs. `dist.rebalance.throttle` (counter) increments when -the concurrency limit blocks a batch dispatch. `dist.rebalance.batches` -should still climb steadily. +**Detection.** `dist.rebalance.last_ns` (gauge — last full scan duration) climbs. `dist.rebalance.throttle` +(counter) increments when the concurrency limit blocks a batch dispatch. `dist.rebalance.batches` should still +climb steadily. **Resolution.** -1. Raise `WithDistRebalanceMaxConcurrent` (default 1) if CPU and - network headroom allow. -1. Lower `WithDistRebalanceBatchSize` (default 64) so individual - batches finish faster and concurrency slots cycle more often — - counter-intuitively, smaller batches sometimes throughput-win. -1. Pause writes (drain a subset of clients via your LB) until the - scan finishes. The dist backend has no built-in - write-throttling — that's the application's job. - -**Phase C note.** Drain (`POST /dist/drain`) does *not* trigger an -expedited rebalance today; the next scheduled -`WithDistRebalanceInterval` tick does the work. If you need to force -a faster ownership transfer, call `Stop` after Drain to cancel -in-flight work and let restart-time rebalance handle migration. +1. Raise `WithDistRebalanceMaxConcurrent` (default 1) if CPU and network headroom allow. +1. Lower `WithDistRebalanceBatchSize` (default 64) so individual batches finish faster and concurrency slots + cycle more often — counter-intuitively, smaller batches sometimes throughput-win. +1. Pause writes (drain a subset of clients via your LB) until the scan finishes. The dist backend has no + built-in write-throttling — that's the application's job. + +**Phase C note.** Drain (`POST /dist/drain`) does _not_ trigger an expedited rebalance today; the next +scheduled `WithDistRebalanceInterval` tick does the work. If you need to force a faster ownership transfer, +call `Stop` after Drain to cancel in-flight work and let restart-time rebalance handle migration. ## Failure mode — replica loss -**Symptom.** A replica node dies hard (kernel panic, hardware -failure). Its keys still have other replicas (when `replication >= 2`), -but until membership notices, writes try to fan out to it and -silently retry via the hint queue. +**Symptom.** A replica node dies hard (kernel panic, hardware failure). Its keys still have other replicas +(when `replication >= 2`), but until membership notices, writes try to fan out to it and silently retry via +the hint queue. -**Detection.** `dist.heartbeat.failure` increments steadily for the -lost peer. After `WithDistHeartbeat`'s `deadAfter` window, the peer -is pruned (`dist.nodes.removed` increments) and ring lookups stop -including it. +**Detection.** `dist.heartbeat.failure` increments steadily for the lost peer. After `WithDistHeartbeat`'s +`deadAfter` window, the peer is pruned (`dist.nodes.removed` increments) and ring lookups stop including it. **Resolution.** -1. Wait for the heartbeat to detect the dead peer. With default - timing, this is on the order of seconds. -1. Spin up a replacement node with the same membership (or let - gossip discover it). -1. The new node's rebalance scan pulls its assigned keys from - surviving replicas via Merkle anti-entropy. +1. Wait for the heartbeat to detect the dead peer. With default timing, this is on the order of seconds. +1. Spin up a replacement node with the same membership (or let gossip discover it). +1. The new node's rebalance scan pulls its assigned keys from surviving replicas via Merkle anti-entropy. -**Indirect probes.** `WithDistIndirectProbes(k, timeout)` filters -caller-side network blips that would otherwise mark a healthy peer -suspect. `dist.heartbeat.indirect_probe.refuted` rising indicates -indirect probes are saving you from spurious flapping; rising -`dist.heartbeat.indirect_probe.failure` indicates the peer is -genuinely unreachable from multiple vantage points. +**Indirect probes.** `WithDistIndirectProbes(k, timeout)` filters caller-side network blips that would +otherwise mark a healthy peer suspect. `dist.heartbeat.indirect_probe.refuted` rising indicates indirect +probes are saving you from spurious flapping; rising `dist.heartbeat.indirect_probe.failure` indicates the +peer is genuinely unreachable from multiple vantage points. ## Operational tasks @@ -187,28 +158,24 @@ curl 'http://node-A:8080/internal/keys?cursor=1' err := dm.SyncWith(ctx, "node-B") ``` -`WithDistMerkleAutoSync(interval)` runs this on a timer; manual calls -are useful for debugging. +`WithDistMerkleAutoSync(interval)` runs this on a timer; manual calls are useful for debugging. ## Capacity planning notes -- Each shard mutex is independent — write throughput scales with - shard count up to CPU saturation. -- Hint queue memory is approximately `HintedBytes` + 64 bytes of - bookkeeping per queued hint. Cap via `WithDistHintMaxBytes` to - bound total process memory under partition. -- Merkle tree storage scales O(N/chunk) for N keys at - `WithDistMerkleChunkSize` (default 128). For a million keys, the - default chunk gives ~8K leaf hashes per node — negligible. -- Replication factor 3 with quorum reads/writes tolerates 1 failure; - raise to 5 for tolerating 2 failures, at 5× the storage cost. +- Each shard mutex is independent — write throughput scales with shard count up to CPU saturation. +- Hint queue memory is approximately `HintedBytes` + 64 bytes of bookkeeping per queued hint. Cap via + `WithDistHintMaxBytes` to bound total process memory under partition. +- Merkle tree storage scales O(N/chunk) for N keys at `WithDistMerkleChunkSize` (default 128). For a million + keys, the default chunk gives ~8K leaf hashes per node — negligible. +- Replication factor 3 with quorum reads/writes tolerates 1 failure; raise to 5 for tolerating 2 failures, at + 5× the storage cost. ## Where things are -| Concern | File | -|---|---| -| Public surface | [pkg/backend/dist_memory.go](../pkg/backend/dist_memory.go) | -| Transport interface | [pkg/backend/dist_transport.go](../pkg/backend/dist_transport.go) | -| HTTP transport | [pkg/backend/dist_http_transport.go](../pkg/backend/dist_http_transport.go) | -| HTTP server | [pkg/backend/dist_http_server.go](../pkg/backend/dist_http_server.go) | -| Membership / ring | [internal/cluster/](../internal/cluster) | +| Concern | File | +| ------------------- | --------------------------------------------------------------------------- | +| Public surface | [pkg/backend/dist_memory.go](../pkg/backend/dist_memory.go) | +| Transport interface | [pkg/backend/dist_transport.go](../pkg/backend/dist_transport.go) | +| HTTP transport | [pkg/backend/dist_http_transport.go](../pkg/backend/dist_http_transport.go) | +| HTTP server | [pkg/backend/dist_http_server.go](../pkg/backend/dist_http_server.go) | +| Membership / ring | [internal/cluster/](../internal/cluster) |