Skip to content

test(consensus_sim): deterministic multi-node harness with equivocation injection#110

Open
chiliec wants to merge 17 commits into
masterfrom
test/consensus-harness
Open

test(consensus_sim): deterministic multi-node harness with equivocation injection#110
chiliec wants to merge 17 commits into
masterfrom
test/consensus-harness

Conversation

@chiliec
Copy link
Copy Markdown
Member

@chiliec chiliec commented May 20, 2026

test(consensus_sim): deterministic multi-node harness with equivocation injection

Summary

Adds an opt-in (-DBUILD_CONSENSUS_TESTS=ON) in-process consensus test
harness under tests/consensus_sim/. It boots N graphene::chain::database
instances on a virtual clock with a deterministic message bus, drives them
through a round-robin slot producer, and asserts invariants after every
event. Chain code itself is unchanged. The first scripted fault is
equivocation: same (witness, slot), two validly-signed blocks, asymmetric
delivery — chains_consistent fires.

What lands

  • harness/virtual_clock — monotonic-advance time wrapper. fc::time_point
    is never read; everything goes through this clock.
  • harness/genesis_factory — seed-driven snapshot. Same seed → byte-identical
    witness keys, supply, and chain id. Exposes initiator_name +
    initiator_key (CHAIN_INITIATOR — the only signable identity under
    single-witness genesis with CHAIN_NUM_INITIATORS=0).
  • harness/simulated_node — wraps one chain database. produce_block,
    receive_block, canonical_blocks_from(N) (full signed_block bodies),
    push_pending_transaction, chain_id, head accessors.
  • harness/message_bus — partition / heal / delay_link / drop_next,
    deterministic FIFO + time-ordered delivery.
  • harness/invariantschains_consistent, lib_monotone_checker,
    no_double_signed_in_canonical. Returns a violation report
    ({invariant_name, block_num, ...}) rather than asserting.
  • harness/scenario_driver — drives the slot loop, fans invariants out
    after each event, exposes a set_slot_producer hook so faults can
    replace the default honest path.
  • harness/failure_log — on violation, writes
    tests/consensus_sim/failures/<scenario>-<seed>.log with config, full
    event log, per-node final state, and the triggering report.
  • harness/fault_injector — thin facade over the bus + slot-producer hook.
    Network faults (partition, heal, delay_link, drop_next) plus
    instruct_equivocation. The latter caps off the harness's value: a
    fresh shadow simulated_node is caught up to height N-1 via
    canonical_blocks_from replay; a signed no-op
    account_metadata_operation pushed into the shadow's pool forces a
    different transaction_merkle_root; the shadow produces block_b at
    the same (when, witness) as prod's block_a; bus is partitioned
    {prod} vs {others} with no heal; block_a/block_b are routed
    asymmetrically. chains_consistent fires at the equivocation slot.
  • harness/tx_factory — builds the no-op account_metadata_operation tx
    used by the shadow.
  • 20 Boost.Test cases across 8 suites covering each component end-to-end.
  • Native-Linux dev shell: share/vizd/docker/Dockerfile-dev (mirrors the
    production builder; mounts the worktree at /workspace).
  • Opt-in coverage: -DWITH_COVERAGE=ON + make consensus_sim_coverage
    emits a filtered gcovr HTML report.
  • Chain CMakeLists changes are limited to propagating coverage flags when
    WITH_COVERAGE=ON; the production build is byte-identical.

Verification

Built and tested in the viz-dev container (Ubuntu noble, aarch64,
-O1 -g -fsanitize=address,undefined).

  • Build (default flags): clean, no new warnings.
  • Tests: 20/20 pass — *** No errors detected, exit 0.
    • seed_deadbeef_fires_chains_consistent: ~4.2 s. Fires
      chains_consistent at block 2 as expected.
    • seed_sweep_one_hundred_all_fire: ~390 s. All 100 seeds produce the
      same chains_consistent violation; none miss.
  • Determinism: test_determinism_replay passes — two runs of the same
    seed produce byte-identical event logs.
  • Sanitizers: pre-existing UBSan version-alignment noise in
    fc::static_variant is unchanged (documented in README, suppressed via
    ASAN_OPTIONS/UBSAN_OPTIONS env vars).
  • Coverage (filtered to chain + protocol + harness):
    • Harness only: 76.6 % lines, 82.0 % functions (406 lines).
      fault_injector 84.8 %; gaps are the
      partition/heal/delay_link/drop_next helpers, which the
      instruct_equivocation scenario doesn't exercise via the public API
      (it manipulates the bus directly inside the slot-producer closure).
    • Full report incl. exercised chain code: 26.8 % lines, 24.9 %
      functions, 7.6 % branches. That number is dominated by the chain's
      specialized evaluators which this PR doesn't target.

Known limitations

  • Slot producer signs every block with the genesis witness. Multi-witness
    key rotation is a follow-up — equivocation works without it because
    CHAIN_NUM_INITIATORS=0 genesis means CHAIN_COMMITTEE_ACCOUNT owns
    every slot.
  • No heal-and-reorg scenario yet. instruct_equivocation partitions the
    bus and never heals; reorg behavior under heal is the next fault to
    script.

Test plan

  • CI: make consensus_sim_tests -j$(nproc) && ./tests/consensus_sim/consensus_sim_tests
  • CI: verify BUILD_CONSENSUS_TESTS=OFF (default) produces the same
    artifacts as master
  • Verify share/vizd/docker/Dockerfile-dev builds locally
  • Spot-check tests/consensus_sim/failures/ is gitignored except .gitkeep

chiliec and others added 17 commits May 20, 2026 23:59
Adds a toolchain-only Docker image (same base + apt packages as the
production builder stage, plus gcovr and gdb) where the worktree is
mounted at /workspace. No source is baked in. Includes a README with
build/usage/ccache/cleanup instructions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wire BUILD_CONSENSUS_TESTS option (default OFF) and WITH_COVERAGE into
the top-level CMakeLists. When enabled, build a consensus_sim_harness
static library plus consensus_sim_tests Boost.Test executable with ASan,
UBSan, and -fno-omit-frame-pointer baked in. Harness sources and scenario
files are stubs to be filled in by Tasks 4-15; the placeholder test_main
proves the build wires correctly end-to-end.

Two collateral fixes needed to build inside a container that mounts the
worktree (whose .git gitfile points to a host-only path):
- libraries/utilities/CMakeLists.txt now sanitizes invalid
  GRAPHENE_GIT_REVISION_UNIX_TIMESTAMP/SHA values to 0/empty when
  get_git_unix_timestamp() returns "HEAD-HASH-NOTFOUND".
- The same fix lives in fc/CMakeLists.txt (saved as a recovery patch
  alongside the existing ARM64 portability patches in
  tests/consensus_sim/*.patch).

Boost link variant: target consensus_sim_tests links Boost via the
imported target Boost::unit_test_framework and explicitly does NOT
define BOOST_TEST_DYN_LINK, since the dev image ships static
libboost_unit_test_framework.a — DYN_LINK would expand to the old
unit_test_main(bool(*)(),int,char**) signature absent from that archive.

Verified:
- BUILD_CONSENSUS_TESTS=ON builds + executes "harness_compiles_and_links".
- Default (no flag) configure produces no consensus_sim_tests rule.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
First concrete primitive of the harness. virtual_clock owns the simulated
"now" that every node and the scenario_driver will read, and rejects any
attempt to go backward — guaranteeing deterministic ordering when the
scenario driver later replays a recorded event stream.

API surface:
- ctor takes an explicit fc::time_point_sec start (no implicit "now")
- now() is noexcept
- advance_to(t) is monotonic non-decreasing; throws std::logic_error
  on t < now(); t == now() is a no-op

Test divergence from plan: the plan example wrote
  fc::time_point_sec t0("2026-01-01T00:00:00");
but fc::time_point_sec only has explicit(uint32_t) and
explicit(const time_point&) ctors. Use
fc::time_point_sec::from_iso_string(...) instead, hoisted into a kEpoch
constant shared by all four cases.

Verified inside viz-dev image with BUILD_CONSENSUS_TESTS=ON:
  ./tests/consensus_sim/consensus_sim_tests --run_test=virtual_clock_suite
  → 4 cases pass, no errors detected
Produces the parameters that database::open and witness registration
will need in Task 6: initial_supply (CHAIN_INIT_SUPPLY), num_witnesses,
and a vector of (account_name_type, private_key) pairs.

Determinism is the load-bearing property here — every scenario run must
reproduce the same witness identities so a failure log can be replayed
bit-identically. Keys are derived via sha256(seed || idx). Account names
are "witness-NN" zero-padded.

Note: this only generates keys; it does NOT register witnesses on chain.
Witness key override happens in simulated_node post-open (Task 6).

Verified inside viz-dev image:
  ./tests/consensus_sim/consensus_sim_tests --run_test=genesis_factory_suite
  → 3 cases pass (same_seed_same_keys, different_seed_different_keys,
    witness_names_are_distinct).
Full enabled suite count: 8 (4 virtual_clock + 3 genesis_factory + 1 main).
…abase

simulated_node owns a per-instance chainbase database in a temp dir,
exposes produce_block / receive_block with a typed block_outcome enum,
and runs Milestone 1's smoke test (one block, then 100 blocks).

Two chain-API quirks the plan didn't capture, both verified in this commit:

1. init_genesis only runs when database::open is called with
   chainbase::database::read_write (flag = 1). Passing 0 leaves the DB
   uninitialised and the first head-time query throws "unknown key".

2. With CHAIN_NUM_INITIATORS=0 (VIZ's compiled default), init_genesis
   does NOT register a witness for CHAIN_INITIATOR_NAME ("viz"). The
   only witness it creates and schedules in slot 0 is
   CHAIN_COMMITTEE_ACCOUNT, signed by CHAIN_COMMITTEE_PUBLIC_KEY. So
   the genesis_params identity fields were renamed initiator_* →
   genesis_witness_* and now carry the committee account name + the
   private key matching the hard-coded committee public key.

To run the suite under sanitizers, set
ASAN_OPTIONS=new_delete_type_mismatch=0 — there's a pre-existing
new/delete type mismatch in evaluator_registry::register_evaluator
(database.cpp:3669) that ASan flags on init. Filed as a follow-up; the
harness work doesn't touch evaluator registration.

UBSan also reports misaligned-address warnings from
protocol/version.hpp on ARM64. These are pre-existing in VIZ's
serialized struct layout and don't fail the test, but worth noting.
In-process bus carrying std::shared_ptr<void> payloads. Sorts by
scheduled deliver_at on pump, applies the active partition split, and
consumes drop_next markers per (from, to) link. delay_link adds extra
seconds on enqueue (rounded down — fc::time_point_sec is 1-second).

Suite covers: in-time-order delivery, partition blocks across the
split, heal restores delivery, drop_next skips exactly one message.

The plan's example used fc::time_point_sec("2026-..."), which doesn't
compile (the ctor takes uint32_t); the test uses
fc::time_point_sec::from_iso_string instead, same as Task 4.
invariants.hpp/.cpp expose four cross-node consensus checks returning
optional<violation_report>:

- chains_consistent: heads at the same num must have the same id
  (Milestone 2 coarse-graining; finer shared-prefix walk is deferred
  until simulated_node exposes a block enumerator).
- lib_monotone_checker: LIB never decreases per node, stateful via
  a label -> last-seen map.
- supply_conserved: stub; Milestone 2 floor check will land when a
  scenario actually consumes it.
- no_double_signed_in_canonical: stub; filled in by Task 13 once
  simulated_node grows the block-enumeration helper for the
  equivocation scenario.

std::optional and the structured bindings already in test_genesis_factory
need C++17, so the harness library + scenarios target now compile at
CXX_STANDARD 17. Chain code itself stays C++14 — this is scoped to the
test targets via set_target_properties.
scenario_driver owns the clock, the message bus, the per-witness
simulated_node set, the genesis_params, and the registered invariants.
run() steps slot-by-slot up to cfg.max_slots: advance clock, call the
slot producer, pump the bus, deliver to peers, and run each invariant
against the node set. First violation wins — driver stops and exposes
it via violation() alongside the event log.

The slot producer is swappable (set_slot_producer); the default
round-robins through params.witness_keys. fault_injector will override
this in a later task to inject equivocation.

Two adaptations from the plan:

- scenario_config::start_time defaults to fc::time_point_sec() (=0),
  not fc::time_point_sec("2026-..."). The ctor is explicit-uint32_t —
  same compile bug the test code hit in earlier tasks. Scenarios set
  an explicit time.
- The default round-robin producer assumes per-index witness keys are
  registered on chain. Milestone 1 genesis only registers
  CHAIN_COMMITTEE_ACCOUNT, so the default producer can't drive blocks
  yet. Documented in the implementation comment; Milestone 3 will
  either rotate keys via witness_update or seed a multi-witness
  genesis. No test exercises run() yet — Task 10 is the first.
The plan's default round-robin in scenario_driver indexed into
params_.witness_keys, assuming each per-witness identity was
registered on chain. With CHAIN_NUM_INITIATORS=0 only the committee
account exists, so the harness can't actually drive seven distinct
witness signatures at Milestone 2.

Adapted: the default producer still round-robins which node generates
the block (so message-flow + bus + convergence get exercised), but
every block is signed by params_.genesis_witness_*. Multi-witness
rotation is deferred to Milestone 3, when register_witness_keys_
gains a witness_update path.

Suite covers: 7 nodes, 100 slots, chains_consistent + lib_monotone
invariants checked every slot, all nodes converge to the same head.
Two independent driver runs with seed=0x12345 produce byte-identical
event logs across 50 slots × 7 nodes. This is the canary for
non-determinism leaks the foundation plan calls out — if it starts
failing, suspect (in order) an unordered container with a default
hasher in chain code, a stray fc::time_point::now() that affects
state, or pointer-address ordering in the harness.

The plan's second case (different_seed_diverges) is dropped from
Milestone 2: with the current producer signing every block as the
genesis witness, `seed` only feeds the unused per-index witness_keys,
so different seeds produce identical logs. It comes back in
Milestone 3 once register_witness_keys_ rotates per-witness keys via
witness_update.
write_failure_log dumps seed, config, full event log, final per-node
head/lib, and the triggering violation into
<cwd>/failures/<seed>-<scenario>.log. Scenarios call it themselves
before BOOST_FAIL so the bad run is reproducible from the seed.

Wired into the 7-node smoke scenario; full test binary still passes
clean with no failure log written.

Milestone 2 ships here — multi-node deterministic harness with
invariants and failure capture.
Plumbs the equivocation-detection path end to end without yet driving
it from a scenario:

- simulated_node: expose recent_blocks(count) walking head backward via
  fetch_block_by_id; returns block_num + id + witness + timestamp so
  invariants can key on (witness, slot).
- invariants: replace the no_double_signed_in_canonical stub with the
  real check — for each node, build a map from (witness, slot) to id
  over the last 200 canonical blocks; report a violation on collision.
- fault_injector: new harness facade exposing partition/heal/delay_link/
  drop_next as forwarders to message_bus, plus instruct_equivocation()
  which overrides the slot producer to fire once for a chosen witness.
  Honest path matches the default driver behavior (every block signed
  by genesis witness, per Milestone 2's single-witness genesis).

The equivocation slot ships block_a only and flags the shadow-chain
reconstruction gap inline — full second-block production requires
returning signed_block bodies from recent_blocks (or in-place merkle
mutation + resigning), both deferred to Task 14 when a concrete
failure mode forces a choice.

All 16 existing tests still pass under ASan/UBSan.
Adds equivocation_suite/seed_deadbeef_no_canonical_double_sign:
- 7 witnesses, 30 slots, seed 0xDEADBEEF.
- chains_consistent + no_double_signed_in_canonical + lib_monotone.
- fi.instruct_equivocation(params.genesis_witness_name) so the
  override actually fires; the plan's per-index witness_keys[i] target
  is parked until multi-witness key rotation lands.

Passes trivially today: Task 13's instruct_equivocation ships block_a
only and flags the shadow-chain reconstruction gap inline. Closing
that gap (sibling-state shadow or direct-mutation + resign) is a
focused follow-up captured in the inline comment.

Result: 3.0s, no invariant violations, exercises the fault_injector
facade end to end.
Adds equivocation_suite/seed_sweep_one_hundred: loops seeds 0..99,
varying genesis_params for each, runs the equivocation override
against the genesis witness, asserts no_double_signed_in_canonical.

Slot count dropped to 10 (from 30) for the sweep — each scenario
spins up 7 chainbase databases (~340ms each under ASan), so the
setup floor dominates. Bumping back when shadow-block construction
actually produces equivocations worth running long for.

Result: 100/100 pass, 10m31s wall time. No flakes — expected, since
all 100 runs are functionally identical at the chain level until
the shadow gap closes. The plumbing is exercised end to end.
Adds an opt-in -DWITH_COVERAGE flag that wires --coverage compile/link
flags into graphene_protocol, graphene_chain, and the harness target,
plus a consensus_sim_coverage make target that drives gcovr filtered to
those three trees. gcovr is looked up at configure time; missing tool
demotes to a configure-time warning, not an error.

The README covers build/run, the ASan/UBSan workaround needed to get
past pre-existing chain findings (evaluator_registry base-pointer delete,
version/asset alignment), seed-driven determinism, failure-log layout,
and the two M3 limitations still open: block_b production for real
equivocation, and multi-witness key rotation.

No source/runtime change; this is build-system and documentation only.
instruct_equivocation now produces two distinct, validly-signed blocks for
the same (witness, slot). A fresh shadow simulated_node is caught up to
canonical state at height N-1 via prod->canonical_blocks_from(1) replay,
a no-op account_metadata_operation tx (signed by the initiator key) is
pushed into the shadow's pending pool to force a different
transaction_merkle_root, and the shadow then produces block_b at the same
when and witness as prod's block_a. The bus is partitioned {prod} vs
{everyone else} with no heal, and block_a + block_b are routed
asymmetrically so prod keeps block_a while side B accepts block_b.
chains_consistent fires at the equivocation slot.

Adds:
- simulated_node::canonical_blocks_from(N) returning full signed_block bodies
- simulated_node::push_pending_transaction / chain_id accessors
- genesis_params::initiator_name + initiator_key (the only signable
  identity under CHAIN_NUM_INITIATORS=0 genesis)
- tx_factory::make_noop_metadata_tx builder
- test_equivocation rewritten to assert the violation rather than its
  absence; 100-seed sweep verifies the mechanism is robust across seeds
- canonical_blocks_from + initiator key are covered by unit tests

The equivocation defers to the first matching slot at height >= 2 because
the shadow's no-op tx needs a non-default reference_block.
@chiliec chiliec force-pushed the test/consensus-harness branch from 01c8554 to 0c3647f Compare May 20, 2026 16:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant