test(consensus_sim): deterministic multi-node harness with equivocation injection#110
Open
chiliec wants to merge 17 commits into
Open
test(consensus_sim): deterministic multi-node harness with equivocation injection#110chiliec wants to merge 17 commits into
chiliec wants to merge 17 commits into
Conversation
Adds a toolchain-only Docker image (same base + apt packages as the production builder stage, plus gcovr and gdb) where the worktree is mounted at /workspace. No source is baked in. Includes a README with build/usage/ccache/cleanup instructions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wire BUILD_CONSENSUS_TESTS option (default OFF) and WITH_COVERAGE into the top-level CMakeLists. When enabled, build a consensus_sim_harness static library plus consensus_sim_tests Boost.Test executable with ASan, UBSan, and -fno-omit-frame-pointer baked in. Harness sources and scenario files are stubs to be filled in by Tasks 4-15; the placeholder test_main proves the build wires correctly end-to-end. Two collateral fixes needed to build inside a container that mounts the worktree (whose .git gitfile points to a host-only path): - libraries/utilities/CMakeLists.txt now sanitizes invalid GRAPHENE_GIT_REVISION_UNIX_TIMESTAMP/SHA values to 0/empty when get_git_unix_timestamp() returns "HEAD-HASH-NOTFOUND". - The same fix lives in fc/CMakeLists.txt (saved as a recovery patch alongside the existing ARM64 portability patches in tests/consensus_sim/*.patch). Boost link variant: target consensus_sim_tests links Boost via the imported target Boost::unit_test_framework and explicitly does NOT define BOOST_TEST_DYN_LINK, since the dev image ships static libboost_unit_test_framework.a — DYN_LINK would expand to the old unit_test_main(bool(*)(),int,char**) signature absent from that archive. Verified: - BUILD_CONSENSUS_TESTS=ON builds + executes "harness_compiles_and_links". - Default (no flag) configure produces no consensus_sim_tests rule. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
First concrete primitive of the harness. virtual_clock owns the simulated
"now" that every node and the scenario_driver will read, and rejects any
attempt to go backward — guaranteeing deterministic ordering when the
scenario driver later replays a recorded event stream.
API surface:
- ctor takes an explicit fc::time_point_sec start (no implicit "now")
- now() is noexcept
- advance_to(t) is monotonic non-decreasing; throws std::logic_error
on t < now(); t == now() is a no-op
Test divergence from plan: the plan example wrote
fc::time_point_sec t0("2026-01-01T00:00:00");
but fc::time_point_sec only has explicit(uint32_t) and
explicit(const time_point&) ctors. Use
fc::time_point_sec::from_iso_string(...) instead, hoisted into a kEpoch
constant shared by all four cases.
Verified inside viz-dev image with BUILD_CONSENSUS_TESTS=ON:
./tests/consensus_sim/consensus_sim_tests --run_test=virtual_clock_suite
→ 4 cases pass, no errors detected
Produces the parameters that database::open and witness registration
will need in Task 6: initial_supply (CHAIN_INIT_SUPPLY), num_witnesses,
and a vector of (account_name_type, private_key) pairs.
Determinism is the load-bearing property here — every scenario run must
reproduce the same witness identities so a failure log can be replayed
bit-identically. Keys are derived via sha256(seed || idx). Account names
are "witness-NN" zero-padded.
Note: this only generates keys; it does NOT register witnesses on chain.
Witness key override happens in simulated_node post-open (Task 6).
Verified inside viz-dev image:
./tests/consensus_sim/consensus_sim_tests --run_test=genesis_factory_suite
→ 3 cases pass (same_seed_same_keys, different_seed_different_keys,
witness_names_are_distinct).
Full enabled suite count: 8 (4 virtual_clock + 3 genesis_factory + 1 main).
…abase
simulated_node owns a per-instance chainbase database in a temp dir,
exposes produce_block / receive_block with a typed block_outcome enum,
and runs Milestone 1's smoke test (one block, then 100 blocks).
Two chain-API quirks the plan didn't capture, both verified in this commit:
1. init_genesis only runs when database::open is called with
chainbase::database::read_write (flag = 1). Passing 0 leaves the DB
uninitialised and the first head-time query throws "unknown key".
2. With CHAIN_NUM_INITIATORS=0 (VIZ's compiled default), init_genesis
does NOT register a witness for CHAIN_INITIATOR_NAME ("viz"). The
only witness it creates and schedules in slot 0 is
CHAIN_COMMITTEE_ACCOUNT, signed by CHAIN_COMMITTEE_PUBLIC_KEY. So
the genesis_params identity fields were renamed initiator_* →
genesis_witness_* and now carry the committee account name + the
private key matching the hard-coded committee public key.
To run the suite under sanitizers, set
ASAN_OPTIONS=new_delete_type_mismatch=0 — there's a pre-existing
new/delete type mismatch in evaluator_registry::register_evaluator
(database.cpp:3669) that ASan flags on init. Filed as a follow-up; the
harness work doesn't touch evaluator registration.
UBSan also reports misaligned-address warnings from
protocol/version.hpp on ARM64. These are pre-existing in VIZ's
serialized struct layout and don't fail the test, but worth noting.
In-process bus carrying std::shared_ptr<void> payloads. Sorts by
scheduled deliver_at on pump, applies the active partition split, and
consumes drop_next markers per (from, to) link. delay_link adds extra
seconds on enqueue (rounded down — fc::time_point_sec is 1-second).
Suite covers: in-time-order delivery, partition blocks across the
split, heal restores delivery, drop_next skips exactly one message.
The plan's example used fc::time_point_sec("2026-..."), which doesn't
compile (the ctor takes uint32_t); the test uses
fc::time_point_sec::from_iso_string instead, same as Task 4.
invariants.hpp/.cpp expose four cross-node consensus checks returning optional<violation_report>: - chains_consistent: heads at the same num must have the same id (Milestone 2 coarse-graining; finer shared-prefix walk is deferred until simulated_node exposes a block enumerator). - lib_monotone_checker: LIB never decreases per node, stateful via a label -> last-seen map. - supply_conserved: stub; Milestone 2 floor check will land when a scenario actually consumes it. - no_double_signed_in_canonical: stub; filled in by Task 13 once simulated_node grows the block-enumeration helper for the equivocation scenario. std::optional and the structured bindings already in test_genesis_factory need C++17, so the harness library + scenarios target now compile at CXX_STANDARD 17. Chain code itself stays C++14 — this is scoped to the test targets via set_target_properties.
scenario_driver owns the clock, the message bus, the per-witness
simulated_node set, the genesis_params, and the registered invariants.
run() steps slot-by-slot up to cfg.max_slots: advance clock, call the
slot producer, pump the bus, deliver to peers, and run each invariant
against the node set. First violation wins — driver stops and exposes
it via violation() alongside the event log.
The slot producer is swappable (set_slot_producer); the default
round-robins through params.witness_keys. fault_injector will override
this in a later task to inject equivocation.
Two adaptations from the plan:
- scenario_config::start_time defaults to fc::time_point_sec() (=0),
not fc::time_point_sec("2026-..."). The ctor is explicit-uint32_t —
same compile bug the test code hit in earlier tasks. Scenarios set
an explicit time.
- The default round-robin producer assumes per-index witness keys are
registered on chain. Milestone 1 genesis only registers
CHAIN_COMMITTEE_ACCOUNT, so the default producer can't drive blocks
yet. Documented in the implementation comment; Milestone 3 will
either rotate keys via witness_update or seed a multi-witness
genesis. No test exercises run() yet — Task 10 is the first.
The plan's default round-robin in scenario_driver indexed into params_.witness_keys, assuming each per-witness identity was registered on chain. With CHAIN_NUM_INITIATORS=0 only the committee account exists, so the harness can't actually drive seven distinct witness signatures at Milestone 2. Adapted: the default producer still round-robins which node generates the block (so message-flow + bus + convergence get exercised), but every block is signed by params_.genesis_witness_*. Multi-witness rotation is deferred to Milestone 3, when register_witness_keys_ gains a witness_update path. Suite covers: 7 nodes, 100 slots, chains_consistent + lib_monotone invariants checked every slot, all nodes converge to the same head.
Two independent driver runs with seed=0x12345 produce byte-identical event logs across 50 slots × 7 nodes. This is the canary for non-determinism leaks the foundation plan calls out — if it starts failing, suspect (in order) an unordered container with a default hasher in chain code, a stray fc::time_point::now() that affects state, or pointer-address ordering in the harness. The plan's second case (different_seed_diverges) is dropped from Milestone 2: with the current producer signing every block as the genesis witness, `seed` only feeds the unused per-index witness_keys, so different seeds produce identical logs. It comes back in Milestone 3 once register_witness_keys_ rotates per-witness keys via witness_update.
write_failure_log dumps seed, config, full event log, final per-node head/lib, and the triggering violation into <cwd>/failures/<seed>-<scenario>.log. Scenarios call it themselves before BOOST_FAIL so the bad run is reproducible from the seed. Wired into the 7-node smoke scenario; full test binary still passes clean with no failure log written. Milestone 2 ships here — multi-node deterministic harness with invariants and failure capture.
Plumbs the equivocation-detection path end to end without yet driving it from a scenario: - simulated_node: expose recent_blocks(count) walking head backward via fetch_block_by_id; returns block_num + id + witness + timestamp so invariants can key on (witness, slot). - invariants: replace the no_double_signed_in_canonical stub with the real check — for each node, build a map from (witness, slot) to id over the last 200 canonical blocks; report a violation on collision. - fault_injector: new harness facade exposing partition/heal/delay_link/ drop_next as forwarders to message_bus, plus instruct_equivocation() which overrides the slot producer to fire once for a chosen witness. Honest path matches the default driver behavior (every block signed by genesis witness, per Milestone 2's single-witness genesis). The equivocation slot ships block_a only and flags the shadow-chain reconstruction gap inline — full second-block production requires returning signed_block bodies from recent_blocks (or in-place merkle mutation + resigning), both deferred to Task 14 when a concrete failure mode forces a choice. All 16 existing tests still pass under ASan/UBSan.
Adds equivocation_suite/seed_deadbeef_no_canonical_double_sign: - 7 witnesses, 30 slots, seed 0xDEADBEEF. - chains_consistent + no_double_signed_in_canonical + lib_monotone. - fi.instruct_equivocation(params.genesis_witness_name) so the override actually fires; the plan's per-index witness_keys[i] target is parked until multi-witness key rotation lands. Passes trivially today: Task 13's instruct_equivocation ships block_a only and flags the shadow-chain reconstruction gap inline. Closing that gap (sibling-state shadow or direct-mutation + resign) is a focused follow-up captured in the inline comment. Result: 3.0s, no invariant violations, exercises the fault_injector facade end to end.
Adds equivocation_suite/seed_sweep_one_hundred: loops seeds 0..99, varying genesis_params for each, runs the equivocation override against the genesis witness, asserts no_double_signed_in_canonical. Slot count dropped to 10 (from 30) for the sweep — each scenario spins up 7 chainbase databases (~340ms each under ASan), so the setup floor dominates. Bumping back when shadow-block construction actually produces equivocations worth running long for. Result: 100/100 pass, 10m31s wall time. No flakes — expected, since all 100 runs are functionally identical at the chain level until the shadow gap closes. The plumbing is exercised end to end.
Adds an opt-in -DWITH_COVERAGE flag that wires --coverage compile/link flags into graphene_protocol, graphene_chain, and the harness target, plus a consensus_sim_coverage make target that drives gcovr filtered to those three trees. gcovr is looked up at configure time; missing tool demotes to a configure-time warning, not an error. The README covers build/run, the ASan/UBSan workaround needed to get past pre-existing chain findings (evaluator_registry base-pointer delete, version/asset alignment), seed-driven determinism, failure-log layout, and the two M3 limitations still open: block_b production for real equivocation, and multi-witness key rotation. No source/runtime change; this is build-system and documentation only.
instruct_equivocation now produces two distinct, validly-signed blocks for
the same (witness, slot). A fresh shadow simulated_node is caught up to
canonical state at height N-1 via prod->canonical_blocks_from(1) replay,
a no-op account_metadata_operation tx (signed by the initiator key) is
pushed into the shadow's pending pool to force a different
transaction_merkle_root, and the shadow then produces block_b at the same
when and witness as prod's block_a. The bus is partitioned {prod} vs
{everyone else} with no heal, and block_a + block_b are routed
asymmetrically so prod keeps block_a while side B accepts block_b.
chains_consistent fires at the equivocation slot.
Adds:
- simulated_node::canonical_blocks_from(N) returning full signed_block bodies
- simulated_node::push_pending_transaction / chain_id accessors
- genesis_params::initiator_name + initiator_key (the only signable
identity under CHAIN_NUM_INITIATORS=0 genesis)
- tx_factory::make_noop_metadata_tx builder
- test_equivocation rewritten to assert the violation rather than its
absence; 100-seed sweep verifies the mechanism is robust across seeds
- canonical_blocks_from + initiator key are covered by unit tests
The equivocation defers to the first matching slot at height >= 2 because
the shadow's no-op tx needs a non-default reference_block.
01c8554 to
0c3647f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
test(consensus_sim): deterministic multi-node harness with equivocation injection
Summary
Adds an opt-in (
-DBUILD_CONSENSUS_TESTS=ON) in-process consensus testharness under
tests/consensus_sim/. It boots Ngraphene::chain::databaseinstances on a virtual clock with a deterministic message bus, drives them
through a round-robin slot producer, and asserts invariants after every
event. Chain code itself is unchanged. The first scripted fault is
equivocation: same
(witness, slot), two validly-signed blocks, asymmetricdelivery —
chains_consistentfires.What lands
harness/virtual_clock— monotonic-advance time wrapper.fc::time_pointis never read; everything goes through this clock.
harness/genesis_factory— seed-driven snapshot. Same seed → byte-identicalwitness keys, supply, and chain id. Exposes
initiator_name+initiator_key(CHAIN_INITIATOR — the only signable identity undersingle-witness genesis with
CHAIN_NUM_INITIATORS=0).harness/simulated_node— wraps one chain database.produce_block,receive_block,canonical_blocks_from(N)(fullsigned_blockbodies),push_pending_transaction,chain_id, head accessors.harness/message_bus— partition / heal /delay_link/drop_next,deterministic FIFO + time-ordered delivery.
harness/invariants—chains_consistent,lib_monotone_checker,no_double_signed_in_canonical. Returns a violation report(
{invariant_name, block_num, ...}) rather than asserting.harness/scenario_driver— drives the slot loop, fans invariants outafter each event, exposes a
set_slot_producerhook so faults canreplace the default honest path.
harness/failure_log— on violation, writestests/consensus_sim/failures/<scenario>-<seed>.logwith config, fullevent log, per-node final state, and the triggering report.
harness/fault_injector— thin facade over the bus + slot-producer hook.Network faults (
partition,heal,delay_link,drop_next) plusinstruct_equivocation. The latter caps off the harness's value: afresh shadow
simulated_nodeis caught up to height N-1 viacanonical_blocks_fromreplay; a signed no-opaccount_metadata_operationpushed into the shadow's pool forces adifferent
transaction_merkle_root; the shadow produces block_b atthe same
(when, witness)as prod's block_a; bus is partitioned{prod} vs {others}with no heal; block_a/block_b are routedasymmetrically.
chains_consistentfires at the equivocation slot.harness/tx_factory— builds the no-opaccount_metadata_operationtxused by the shadow.
share/vizd/docker/Dockerfile-dev(mirrors theproduction builder; mounts the worktree at
/workspace).-DWITH_COVERAGE=ON+make consensus_sim_coverageemits a filtered gcovr HTML report.
WITH_COVERAGE=ON; the production build is byte-identical.Verification
Built and tested in the
viz-devcontainer (Ubuntu noble, aarch64,-O1 -g -fsanitize=address,undefined).*** No errors detected, exit 0.seed_deadbeef_fires_chains_consistent: ~4.2 s. Fireschains_consistentat block 2 as expected.seed_sweep_one_hundred_all_fire: ~390 s. All 100 seeds produce thesame
chains_consistentviolation; none miss.test_determinism_replaypasses — two runs of the sameseed produce byte-identical event logs.
fc::static_variantis unchanged (documented in README, suppressed viaASAN_OPTIONS/UBSAN_OPTIONSenv vars).fault_injector84.8 %; gaps are thepartition/heal/delay_link/drop_nexthelpers, which theinstruct_equivocationscenario doesn't exercise via the public API(it manipulates the bus directly inside the slot-producer closure).
functions, 7.6 % branches. That number is dominated by the chain's
specialized evaluators which this PR doesn't target.
Known limitations
key rotation is a follow-up — equivocation works without it because
CHAIN_NUM_INITIATORS=0genesis meansCHAIN_COMMITTEE_ACCOUNTownsevery slot.
instruct_equivocationpartitions thebus and never heals; reorg behavior under heal is the next fault to
script.
Test plan
make consensus_sim_tests -j$(nproc) && ./tests/consensus_sim/consensus_sim_testsBUILD_CONSENSUS_TESTS=OFF(default) produces the sameartifacts as master
share/vizd/docker/Dockerfile-devbuilds locallytests/consensus_sim/failures/is gitignored except.gitkeep