Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,26 @@ All notable changes to HyperCache are recorded here. The format follows

### Added

- **Chaos hooks for resilience testing (Phase 7).** New
[`backend.WithDistChaos(*Chaos)`](pkg/backend/dist_chaos.go) option transparently wraps the dist transport
with configurable fault injection — drop rate and latency injection, both with per-call probability rolls
off a crypto-seeded math/rand source. The wrapper is automatic for both the explicit
`WithDistTransport` path and the auto-wired HTTP transport, so chaos covers every dist call uniformly.
Disabled by default (zero overhead) and opt-in by design — the doc comment is explicit that this is a
test-only surface with no production safety net. Atomic mutators (`SetDropRate`, `SetLatency`) let tests
enable chaos mid-run, drive the cluster, then heal — exactly the shape the rebalance flake we caught in
May 2026 needed to be surfaced deterministically. Two new OTel metrics:
`dist.chaos.drops` (calls dropped) and `dist.chaos.latencies` (calls with latency injected). Eight unit
tests in [`pkg/backend/dist_chaos_test.go`](pkg/backend/dist_chaos_test.go) cover every branch
(DropRate=1 always drops, DropRate=0 never drops, latency injection fires + delays the call, nil-Chaos
passes through unchanged, the disabled-but-installed wrapper is a pass-through, concurrent calls are
race-free under -race, boundary clamping for out-of-range probabilities, nil-receiver safety on the
Metrics() snapshot path). Two integration tests in
[`tests/integration/dist_chaos_test.go`](tests/integration/dist_chaos_test.go) drive the canonical
resilience scenario — 80% drops force the hint queue to absorb replica fan-out failures; disabling chaos
lets the replay loop drain the queue. New "Chaos hooks (resilience testing)" section in
[`docs/operations.md`](docs/operations.md) with the usage shape and the "what this catches that CI flake
hunting won't" rationale.
- **Batch operations on the client SDK.** `BatchSet`, `BatchGet`, `BatchDelete` close the v1 SDK gap PR3's
stopping conditions called out — the raw OIDC example demonstrated batch round-trips but the SDK had no
equivalent. Each method takes a slice and returns per-item results so a single HTTP call can carry
Expand Down
10 changes: 9 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ test-cluster: stop-dev-cluster
exit $$rc

# ci aggregates the gates required before declaring a task done (see AGENTS.md).
ci: lint typecheck test-race sec build
ci: lint typecheck test-race pre-commit sec build
@echo "All CI gates passed."

# bench runs the benchmark tests in the benchmark subpackage of the tests package.
Expand Down Expand Up @@ -215,6 +215,14 @@ docs-publish: docs-build
docs-serve: docs-build
PYENV_VERSION=mkdocs mkdocs serve

pre-commit:
pre-commit run -a trailing-whitespace && \
pre-commit run -a end-of-file-fixer && \
pre-commit run -a markdownlint && \
pre-commit run -a yamllint && \
pre-commit run -a cspell && \
pre-commit run -a cspell

# check_command_exists is a helper function that checks if a command exists.
define check_command_exists
@which $(1) > /dev/null 2>&1 || (echo "$(1) command not found" && exit 1)
Expand Down
6 changes: 3 additions & 3 deletions __examples/distributed-oidc-client/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -13,14 +13,14 @@ package main
import (
"context"
"encoding/json"
"errors"
"fmt"
"net/http"
"net/url"
"os"
"strings"
"time"

"github.com/hyp3rd/ewrap"
"golang.org/x/oauth2/clientcredentials"

"github.com/hyp3rd/hypercache/pkg/client"
Expand All @@ -40,8 +40,8 @@ const (
// errors.Is against it; in the example, run() surfaces the
// wrapped error to stderr.
var (
errEnvMissing = errors.New("missing required env var")
errDiscoveryNoEndpoint = errors.New("OIDC discovery doc missing token_endpoint")
errEnvMissing = ewrap.New("missing required env var")
errDiscoveryNoEndpoint = ewrap.New("OIDC discovery doc missing token_endpoint")
)

func main() {
Expand Down
2 changes: 2 additions & 0 deletions cspell.config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,7 @@ words:
- contextcheck
- cpuprofile
- cret
- cryptorand
- cyclop
- daixiang
- Decr
Expand Down Expand Up @@ -112,6 +113,7 @@ words:
- Fprintln
- freqs
- frontmatter
- funcorder
- funlen
- geomean
- gerr
Expand Down
73 changes: 73 additions & 0 deletions docs/operations.md
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,79 @@ err := dm.SyncWith(ctx, "node-B")

`WithDistMerkleAutoSync(interval)` runs this on a timer; manual calls are useful for debugging.

## Chaos hooks (resilience testing)

The dist backend exposes fault-injection hooks via the `Chaos` type
and `WithDistChaos` option. **Tests only** — there's no production
safety net; pointing a live cluster at a Chaos with `DropRate=1.0`
will drop every transport call.

What's covered today:

- **Drop**: with configurable probability, return `ErrChaosDrop`
instead of forwarding the call. Useful for "what if the peer's
down for N seconds?" scenarios — the hint queue should absorb
the drops and replay them once chaos is disabled.
- **Latency**: with configurable probability, sleep before
forwarding. Useful for "what if the peer's slow?" — exercises
the timeout + retry surface.

What's deliberately out of scope for v1:

- Per-peer partition simulation (block a specific peer ID).
Tracked as a follow-up; the current hooks treat every peer
uniformly. Workaround: spin a third node, configure chaos on
it only, and observe what happens when "node N misbehaves".

### Usage

```go
import "github.com/hyp3rd/hypercache/pkg/backend"

chaos := backend.NewChaos()

bm, _ := backend.NewDistMemory(ctx,
backend.WithDistNode("A", addr),
backend.WithDistChaos(chaos),
// ... other options ...
)

// Mid-test: enable 50% drops + 10ms latency on every call.
chaos.SetDropRate(0.5)
chaos.SetLatency(10*time.Millisecond, 1.0)

// ... drive the cluster, observe behavior ...

// Heal: turn faults off.
chaos.SetDropRate(0)

// Verify the chaos counters fired AND the cluster recovered.
metrics := bm.(*backend.DistMemory).Metrics()
fmt.Println("drops:", metrics.ChaosDrops)
fmt.Println("hint replays after heal:", metrics.HintedReplayed)
```

### Metrics

- `dist.chaos.drops` (counter) — calls dropped since construction.
- `dist.chaos.latencies` (counter) — calls that had latency injected.

Both stay at zero when chaos isn't configured (the wrapper isn't
installed at all — `WithDistChaos` is opt-in).

### What this gives you

The rebalance flake we caught manually in May 2026
(`TestDistRebalanceThrottle` failing under `-shuffle` due to a
transient quorum miss) is exactly the class of bug the chaos hooks
exist to surface. Wire chaos at 5-10% drop rate against the test
suite, run under `-race`, and the timing-sensitive paths surface
deterministically rather than as 1-in-50 CI flakes.

See [`pkg/backend/dist_chaos_test.go`](../pkg/backend/dist_chaos_test.go)
and [`tests/integration/dist_chaos_test.go`](../tests/integration/dist_chaos_test.go)
for runnable examples.

## Capacity planning notes

- Each shard mutex is independent — write throughput scales with shard count up to CPU saturation.
Expand Down
Loading
Loading