test(consensus_sim): deterministic multi-node harness with equivocation injection by chiliec · Pull Request #110 · VIZ-Blockchain/viz-cpp-node

chiliec · 2026-05-20T15:10:51Z

test(consensus_sim): deterministic multi-node harness with equivocation injection

Summary

Adds an opt-in (-DBUILD_CONSENSUS_TESTS=ON) in-process consensus test
harness under tests/consensus_sim/. It boots N graphene::chain::database
instances on a virtual clock with a deterministic message bus, drives them
through a round-robin slot producer, and asserts invariants after every
event. Chain code itself is unchanged. The first scripted fault is
equivocation: same (witness, slot), two validly-signed blocks, asymmetric
delivery — chains_consistent fires.

What lands

harness/virtual_clock — monotonic-advance time wrapper. fc::time_point
is never read; everything goes through this clock.
harness/genesis_factory — seed-driven snapshot. Same seed → byte-identical
witness keys, supply, and chain id. Exposes initiator_name +
initiator_key (CHAIN_INITIATOR — the only signable identity under
single-witness genesis with CHAIN_NUM_INITIATORS=0).
harness/simulated_node — wraps one chain database. produce_block,
receive_block, canonical_blocks_from(N) (full signed_block bodies),
push_pending_transaction, chain_id, head accessors.
harness/message_bus — partition / heal / delay_link / drop_next,
deterministic FIFO + time-ordered delivery.
harness/invariants — chains_consistent, lib_monotone_checker,
no_double_signed_in_canonical. Returns a violation report
({invariant_name, block_num, ...}) rather than asserting.
harness/scenario_driver — drives the slot loop, fans invariants out
after each event, exposes a set_slot_producer hook so faults can
replace the default honest path.
harness/failure_log — on violation, writes
tests/consensus_sim/failures/<scenario>-<seed>.log with config, full
event log, per-node final state, and the triggering report.
harness/fault_injector — thin facade over the bus + slot-producer hook.
Network faults (partition, heal, delay_link, drop_next) plus
instruct_equivocation. The latter caps off the harness's value: a
fresh shadow simulated_node is caught up to height N-1 via
canonical_blocks_from replay; a signed no-op
account_metadata_operation pushed into the shadow's pool forces a
different transaction_merkle_root; the shadow produces block_b at
the same (when, witness) as prod's block_a; bus is partitioned
{prod} vs {others} with no heal; block_a/block_b are routed
asymmetrically. chains_consistent fires at the equivocation slot.
harness/tx_factory — builds the no-op account_metadata_operation tx
used by the shadow.
20 Boost.Test cases across 8 suites covering each component end-to-end.
Native-Linux dev shell: share/vizd/docker/Dockerfile-dev (mirrors the
production builder; mounts the worktree at /workspace).
Opt-in coverage: -DWITH_COVERAGE=ON + make consensus_sim_coverage
emits a filtered gcovr HTML report.
Chain CMakeLists changes are limited to propagating coverage flags when
WITH_COVERAGE=ON; the production build is byte-identical.

Verification

Built and tested in the viz-dev container (Ubuntu noble, aarch64,
-O1 -g -fsanitize=address,undefined).

Build (default flags): clean, no new warnings.
Tests: 20/20 pass — *** No errors detected, exit 0.
- seed_deadbeef_fires_chains_consistent: ~4.2 s. Fires
  chains_consistent at block 2 as expected.
- seed_sweep_one_hundred_all_fire: ~390 s. All 100 seeds produce the
  same chains_consistent violation; none miss.
Determinism: test_determinism_replay passes — two runs of the same
seed produce byte-identical event logs.
Sanitizers: pre-existing UBSan version-alignment noise in
fc::static_variant is unchanged (documented in README, suppressed via
ASAN_OPTIONS/UBSAN_OPTIONS env vars).
Coverage (filtered to chain + protocol + harness):
- Harness only: 76.6 % lines, 82.0 % functions (406 lines).
  fault_injector 84.8 %; gaps are the
  partition/heal/delay_link/drop_next helpers, which the
  instruct_equivocation scenario doesn't exercise via the public API
  (it manipulates the bus directly inside the slot-producer closure).
- Full report incl. exercised chain code: 26.8 % lines, 24.9 %
  functions, 7.6 % branches. That number is dominated by the chain's
  specialized evaluators which this PR doesn't target.

Known limitations

Slot producer signs every block with the genesis witness. Multi-witness
key rotation is a follow-up — equivocation works without it because
CHAIN_NUM_INITIATORS=0 genesis means CHAIN_COMMITTEE_ACCOUNT owns
every slot.
No heal-and-reorg scenario yet. instruct_equivocation partitions the
bus and never heals; reorg behavior under heal is the next fault to
script.

Test plan

CI: make consensus_sim_tests -j$(nproc) && ./tests/consensus_sim/consensus_sim_tests
CI: verify BUILD_CONSENSUS_TESTS=OFF (default) produces the same
artifacts as master
Verify share/vizd/docker/Dockerfile-dev builds locally
Spot-check tests/consensus_sim/failures/ is gitignored except .gitkeep

Adds a toolchain-only Docker image (same base + apt packages as the production builder stage, plus gcovr and gdb) where the worktree is mounted at /workspace. No source is baked in. Includes a README with build/usage/ccache/cleanup instructions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Wire BUILD_CONSENSUS_TESTS option (default OFF) and WITH_COVERAGE into the top-level CMakeLists. When enabled, build a consensus_sim_harness static library plus consensus_sim_tests Boost.Test executable with ASan, UBSan, and -fno-omit-frame-pointer baked in. Harness sources and scenario files are stubs to be filled in by Tasks 4-15; the placeholder test_main proves the build wires correctly end-to-end. Two collateral fixes needed to build inside a container that mounts the worktree (whose .git gitfile points to a host-only path): - libraries/utilities/CMakeLists.txt now sanitizes invalid GRAPHENE_GIT_REVISION_UNIX_TIMESTAMP/SHA values to 0/empty when get_git_unix_timestamp() returns "HEAD-HASH-NOTFOUND". - The same fix lives in fc/CMakeLists.txt (saved as a recovery patch alongside the existing ARM64 portability patches in tests/consensus_sim/*.patch). Boost link variant: target consensus_sim_tests links Boost via the imported target Boost::unit_test_framework and explicitly does NOT define BOOST_TEST_DYN_LINK, since the dev image ships static libboost_unit_test_framework.a — DYN_LINK would expand to the old unit_test_main(bool(*)(),int,char**) signature absent from that archive. Verified: - BUILD_CONSENSUS_TESTS=ON builds + executes "harness_compiles_and_links". - Default (no flag) configure produces no consensus_sim_tests rule. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

First concrete primitive of the harness. virtual_clock owns the simulated "now" that every node and the scenario_driver will read, and rejects any attempt to go backward — guaranteeing deterministic ordering when the scenario driver later replays a recorded event stream. API surface: - ctor takes an explicit fc::time_point_sec start (no implicit "now") - now() is noexcept - advance_to(t) is monotonic non-decreasing; throws std::logic_error on t < now(); t == now() is a no-op Test divergence from plan: the plan example wrote fc::time_point_sec t0("2026-01-01T00:00:00"); but fc::time_point_sec only has explicit(uint32_t) and explicit(const time_point&) ctors. Use fc::time_point_sec::from_iso_string(...) instead, hoisted into a kEpoch constant shared by all four cases. Verified inside viz-dev image with BUILD_CONSENSUS_TESTS=ON: ./tests/consensus_sim/consensus_sim_tests --run_test=virtual_clock_suite → 4 cases pass, no errors detected

Produces the parameters that database::open and witness registration will need in Task 6: initial_supply (CHAIN_INIT_SUPPLY), num_witnesses, and a vector of (account_name_type, private_key) pairs. Determinism is the load-bearing property here — every scenario run must reproduce the same witness identities so a failure log can be replayed bit-identically. Keys are derived via sha256(seed || idx). Account names are "witness-NN" zero-padded. Note: this only generates keys; it does NOT register witnesses on chain. Witness key override happens in simulated_node post-open (Task 6). Verified inside viz-dev image: ./tests/consensus_sim/consensus_sim_tests --run_test=genesis_factory_suite → 3 cases pass (same_seed_same_keys, different_seed_different_keys, witness_names_are_distinct). Full enabled suite count: 8 (4 virtual_clock + 3 genesis_factory + 1 main).

…abase simulated_node owns a per-instance chainbase database in a temp dir, exposes produce_block / receive_block with a typed block_outcome enum, and runs Milestone 1's smoke test (one block, then 100 blocks). Two chain-API quirks the plan didn't capture, both verified in this commit: 1. init_genesis only runs when database::open is called with chainbase::database::read_write (flag = 1). Passing 0 leaves the DB uninitialised and the first head-time query throws "unknown key". 2. With CHAIN_NUM_INITIATORS=0 (VIZ's compiled default), init_genesis does NOT register a witness for CHAIN_INITIATOR_NAME ("viz"). The only witness it creates and schedules in slot 0 is CHAIN_COMMITTEE_ACCOUNT, signed by CHAIN_COMMITTEE_PUBLIC_KEY. So the genesis_params identity fields were renamed initiator_* → genesis_witness_* and now carry the committee account name + the private key matching the hard-coded committee public key. To run the suite under sanitizers, set ASAN_OPTIONS=new_delete_type_mismatch=0 — there's a pre-existing new/delete type mismatch in evaluator_registry::register_evaluator (database.cpp:3669) that ASan flags on init. Filed as a follow-up; the harness work doesn't touch evaluator registration. UBSan also reports misaligned-address warnings from protocol/version.hpp on ARM64. These are pre-existing in VIZ's serialized struct layout and don't fail the test, but worth noting.

In-process bus carrying std::shared_ptr<void> payloads. Sorts by scheduled deliver_at on pump, applies the active partition split, and consumes drop_next markers per (from, to) link. delay_link adds extra seconds on enqueue (rounded down — fc::time_point_sec is 1-second). Suite covers: in-time-order delivery, partition blocks across the split, heal restores delivery, drop_next skips exactly one message. The plan's example used fc::time_point_sec("2026-..."), which doesn't compile (the ctor takes uint32_t); the test uses fc::time_point_sec::from_iso_string instead, same as Task 4.

invariants.hpp/.cpp expose four cross-node consensus checks returning optional<violation_report>: - chains_consistent: heads at the same num must have the same id (Milestone 2 coarse-graining; finer shared-prefix walk is deferred until simulated_node exposes a block enumerator). - lib_monotone_checker: LIB never decreases per node, stateful via a label -> last-seen map. - supply_conserved: stub; Milestone 2 floor check will land when a scenario actually consumes it. - no_double_signed_in_canonical: stub; filled in by Task 13 once simulated_node grows the block-enumeration helper for the equivocation scenario. std::optional and the structured bindings already in test_genesis_factory need C++17, so the harness library + scenarios target now compile at CXX_STANDARD 17. Chain code itself stays C++14 — this is scoped to the test targets via set_target_properties.

scenario_driver owns the clock, the message bus, the per-witness simulated_node set, the genesis_params, and the registered invariants. run() steps slot-by-slot up to cfg.max_slots: advance clock, call the slot producer, pump the bus, deliver to peers, and run each invariant against the node set. First violation wins — driver stops and exposes it via violation() alongside the event log. The slot producer is swappable (set_slot_producer); the default round-robins through params.witness_keys. fault_injector will override this in a later task to inject equivocation. Two adaptations from the plan: - scenario_config::start_time defaults to fc::time_point_sec() (=0), not fc::time_point_sec("2026-..."). The ctor is explicit-uint32_t — same compile bug the test code hit in earlier tasks. Scenarios set an explicit time. - The default round-robin producer assumes per-index witness keys are registered on chain. Milestone 1 genesis only registers CHAIN_COMMITTEE_ACCOUNT, so the default producer can't drive blocks yet. Documented in the implementation comment; Milestone 3 will either rotate keys via witness_update or seed a multi-witness genesis. No test exercises run() yet — Task 10 is the first.

The plan's default round-robin in scenario_driver indexed into params_.witness_keys, assuming each per-witness identity was registered on chain. With CHAIN_NUM_INITIATORS=0 only the committee account exists, so the harness can't actually drive seven distinct witness signatures at Milestone 2. Adapted: the default producer still round-robins which node generates the block (so message-flow + bus + convergence get exercised), but every block is signed by params_.genesis_witness_*. Multi-witness rotation is deferred to Milestone 3, when register_witness_keys_ gains a witness_update path. Suite covers: 7 nodes, 100 slots, chains_consistent + lib_monotone invariants checked every slot, all nodes converge to the same head.

Two independent driver runs with seed=0x12345 produce byte-identical event logs across 50 slots × 7 nodes. This is the canary for non-determinism leaks the foundation plan calls out — if it starts failing, suspect (in order) an unordered container with a default hasher in chain code, a stray fc::time_point::now() that affects state, or pointer-address ordering in the harness. The plan's second case (different_seed_diverges) is dropped from Milestone 2: with the current producer signing every block as the genesis witness, `seed` only feeds the unused per-index witness_keys, so different seeds produce identical logs. It comes back in Milestone 3 once register_witness_keys_ rotates per-witness keys via witness_update.

write_failure_log dumps seed, config, full event log, final per-node head/lib, and the triggering violation into <cwd>/failures/<seed>-<scenario>.log. Scenarios call it themselves before BOOST_FAIL so the bad run is reproducible from the seed. Wired into the 7-node smoke scenario; full test binary still passes clean with no failure log written. Milestone 2 ships here — multi-node deterministic harness with invariants and failure capture.

Plumbs the equivocation-detection path end to end without yet driving it from a scenario: - simulated_node: expose recent_blocks(count) walking head backward via fetch_block_by_id; returns block_num + id + witness + timestamp so invariants can key on (witness, slot). - invariants: replace the no_double_signed_in_canonical stub with the real check — for each node, build a map from (witness, slot) to id over the last 200 canonical blocks; report a violation on collision. - fault_injector: new harness facade exposing partition/heal/delay_link/ drop_next as forwarders to message_bus, plus instruct_equivocation() which overrides the slot producer to fire once for a chosen witness. Honest path matches the default driver behavior (every block signed by genesis witness, per Milestone 2's single-witness genesis). The equivocation slot ships block_a only and flags the shadow-chain reconstruction gap inline — full second-block production requires returning signed_block bodies from recent_blocks (or in-place merkle mutation + resigning), both deferred to Task 14 when a concrete failure mode forces a choice. All 16 existing tests still pass under ASan/UBSan.

Adds equivocation_suite/seed_deadbeef_no_canonical_double_sign: - 7 witnesses, 30 slots, seed 0xDEADBEEF. - chains_consistent + no_double_signed_in_canonical + lib_monotone. - fi.instruct_equivocation(params.genesis_witness_name) so the override actually fires; the plan's per-index witness_keys[i] target is parked until multi-witness key rotation lands. Passes trivially today: Task 13's instruct_equivocation ships block_a only and flags the shadow-chain reconstruction gap inline. Closing that gap (sibling-state shadow or direct-mutation + resign) is a focused follow-up captured in the inline comment. Result: 3.0s, no invariant violations, exercises the fault_injector facade end to end.

Adds equivocation_suite/seed_sweep_one_hundred: loops seeds 0..99, varying genesis_params for each, runs the equivocation override against the genesis witness, asserts no_double_signed_in_canonical. Slot count dropped to 10 (from 30) for the sweep — each scenario spins up 7 chainbase databases (~340ms each under ASan), so the setup floor dominates. Bumping back when shadow-block construction actually produces equivocations worth running long for. Result: 100/100 pass, 10m31s wall time. No flakes — expected, since all 100 runs are functionally identical at the chain level until the shadow gap closes. The plumbing is exercised end to end.

Adds an opt-in -DWITH_COVERAGE flag that wires --coverage compile/link flags into graphene_protocol, graphene_chain, and the harness target, plus a consensus_sim_coverage make target that drives gcovr filtered to those three trees. gcovr is looked up at configure time; missing tool demotes to a configure-time warning, not an error. The README covers build/run, the ASan/UBSan workaround needed to get past pre-existing chain findings (evaluator_registry base-pointer delete, version/asset alignment), seed-driven determinism, failure-log layout, and the two M3 limitations still open: block_b production for real equivocation, and multi-witness key rotation. No source/runtime change; this is build-system and documentation only.

instruct_equivocation now produces two distinct, validly-signed blocks for the same (witness, slot). A fresh shadow simulated_node is caught up to canonical state at height N-1 via prod->canonical_blocks_from(1) replay, a no-op account_metadata_operation tx (signed by the initiator key) is pushed into the shadow's pending pool to force a different transaction_merkle_root, and the shadow then produces block_b at the same when and witness as prod's block_a. The bus is partitioned {prod} vs {everyone else} with no heal, and block_a + block_b are routed asymmetrically so prod keeps block_a while side B accepts block_b. chains_consistent fires at the equivocation slot. Adds: - simulated_node::canonical_blocks_from(N) returning full signed_block bodies - simulated_node::push_pending_transaction / chain_id accessors - genesis_params::initiator_name + initiator_key (the only signable identity under CHAIN_NUM_INITIATORS=0 genesis) - tx_factory::make_noop_metadata_tx builder - test_equivocation rewritten to assert the violation rather than its absence; 100-seed sweep verifies the mechanism is robust across seeds - canonical_blocks_from + initiator key are covered by unit tests The equivocation defers to the first matching slot at height >= 2 because the shadow's no-op tx needs a non-default reference_block.

chiliec and others added 17 commits May 20, 2026 23:59

test(consensus_sim): audit chain API and clock dependencies

5a72216

chiliec force-pushed the test/consensus-harness branch from 01c8554 to 0c3647f Compare May 20, 2026 16:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(consensus_sim): deterministic multi-node harness with equivocation injection#110

test(consensus_sim): deterministic multi-node harness with equivocation injection#110
chiliec wants to merge 17 commits into
masterfrom
test/consensus-harness

chiliec commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chiliec commented May 20, 2026

test(consensus_sim): deterministic multi-node harness with equivocation injection

Summary

What lands

Verification

Known limitations

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant