[DX-4017] [DX-3716] Core Test Timeout Fix#22281
Conversation
…chainlink into coreTestTimeoutFix
|
I see you updated files related to
|
|
✅ No conflicts with other open PRs targeting |
There was a problem hiding this comment.
Pull request overview
Risk Rating: MEDIUM
This PR aims to reduce intermittent CI timeouts/flakiness by improving the tools/test diagnose harness (parallel iterations, smarter fail-fast, better output/logging metadata) and by hardening several flaky/timeout-prone Core tests (notably VRF v2 integration tests and LLO telemetry sampling).
Changes:
- Extend
tools/test diagnosewith parallel iteration execution, richer progress/output plumbing, fail-fast categories, and improved report/log metadata. - Add persistent test DB lifecycle commands (
setup-testdb,remove-testdb) and database pooling to support parallel diagnose safely. - Update multiple Core tests to avoid LogPoller indexing races and to use more robust timeouts/synchronization patterns.
Areas requiring scrupulous human review
tools/test/internal/runner/runner.go: parallel iteration scheduling/cancellation and progress rendering interplay.tools/test/internal/db/db.go: pool creation semantics (env propagation, global CL_DATABASE_URL behavior, cleanup on partial failure).- VRF v2 integration test changes around filtering/indexing boundaries (
commitRequestAndFilterIndexBlock,indexedFilterOpts) to ensure they’re correct under simulated backend semantics and don’t mask real regressions. tools/test/main.go --sync-skills: filesystem safety (symlink handling) givenRemoveAll+ recursive copy.
Reviewed changes
Copilot reviewed 49 out of 50 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| .gitignore | Ignore root .claude/* while allowing tools/test skills directory pattern. |
| README.md | Document new tools/test flow and link to the AI skill. |
| core/internal/testutils/testutils.go | Add WaitTimeoutCustom and change WaitTimeout to cap against requested duration. |
| core/internal/testutils/testutils_test.go | Add tests for new timeout helpers. |
| core/logger/internal/colortest/prettyconsole_test.go | Skip color assertions when ANSI is disabled. |
| core/services/llo/telem/sampling.go | Use wg.Go pattern for pruning loop. |
| core/services/llo/telem/sampling_test.go | Stabilize time-based behavior; use synctest for deterministic goroutine exit. |
| core/services/llo/telem/telemetry.go | Switch goroutine spawn to wg.Go for channel handlers. |
| core/services/llo/telem/telemetry_test.go | Update pointer construction and loops in telemetry tests. |
| core/services/vrf/v2/bhs_feeder_test.go | Make heartbeat assertions robust under parallel load by polling tip-256 dynamically. |
| core/services/vrf/v2/integration_helpers_test.go | Replace Gomega polling with require.Eventually*; improve receipt/log parsing patterns. |
| core/services/vrf/v2/integration_v2_plus_test.go | Parse SubscriptionCreated via tx receipt instead of log filters. |
| core/services/vrf/v2/integration_v2_reverted_txns_test.go | Increase timeout for force-fulfillment; use bounded filter opts. |
| core/services/vrf/v2/integration_v2_test.go | Reduce LogPoller race by retrying filters with bounded opts; add helper filter opts builders. |
| core/services/vrf/v2/listener_v2_log_processor.go | Switch goroutine management to wg.Go for request processing. |
| core/services/vrf/v2/reverted_txns.go | Avoid Fatalw on context cancellation for reverted-tx polling. |
| tools/test/AGENTS.md | Document harness output policy and skill location. |
| tools/test/.agents/skills/chainlink-test-diagnosis/SKILL.md | Add/define the new diagnose AI skill content and CLI reference. |
| tools/test/README.md | Document --fail-fast-on examples and update skill link. |
| tools/test/internal/cmd/diagnose.go | Add --parallel-iterations, --fail-fast-on, config validation, and DB pool wiring. |
| tools/test/internal/cmd/gotestsum.go | Route cleanup errors through output printer. |
| tools/test/internal/cmd/root.go | Skip DB pre-run for diagnose/setup/remove-testdb; wire output into DB ensure. |
| tools/test/internal/cmd/root_test.go | Update command path expectations; add validation tests for diagnose config. |
| tools/test/internal/cmd/run.go | Use output printer for cleanup error reporting. |
| tools/test/internal/cmd/testdb.go | Add setup-testdb / remove-testdb commands for persistent Postgres. |
| tools/test/internal/config/config.go | Add parallel iterations + fail-fast-on config and normalization. |
| tools/test/internal/config/config_test.go | Ensure new flags bind and normalize as expected. |
| tools/test/internal/db/db.go | Add DB pool + resource model; integrate output printer. |
| tools/test/internal/db/db_test.go | Update Ensure signature usage; add tests for pool behaviors and env. |
| tools/test/internal/db/persistent.go | Start/remove persistent Postgres containers; shell export helpers. |
| tools/test/internal/db/persistent_test.go | Test label filter and shell export quoting helpers. |
| tools/test/internal/output/output.go | Introduce output.Printer to centralize human vs AI output and TTY progress rules. |
| tools/test/internal/output/output_test.go | Validate printer behavior (AI/human gating, live inline detection). |
| tools/test/internal/runner/analyze.go | Add run metadata, per-problem log references, iteration digest, and log filename truncation logic. |
| tools/test/internal/runner/analyze_bench_test.go | Add benchmarks using real testdata JSONL payloads. |
| tools/test/internal/runner/analyze_files_test.go | Update log-file expectations for new ProblemLog model and “only problem iterations” behavior. |
| tools/test/internal/runner/analyze_test.go | Add digest tests; adjust expectations to ignore internal-only fields. |
| tools/test/internal/runner/diagnose_progress.go | Add parallel progress tracking and updated inline progress formatting. |
| tools/test/internal/runner/diagnose_progress_test.go | Update expectations for new inline format; test parallel progress rendering. |
| tools/test/internal/runner/diagnose_results_dir.go | Simplify results dir naming to slug+timestamp (full args moved into report metadata). |
| tools/test/internal/runner/diagnose_table.go | Add streaming iteration summary table formatting. |
| tools/test/internal/runner/diagnose_table_test.go | Add tests for table header/row formatting. |
| tools/test/internal/runner/runner.go | Major refactor: parallel iterations, live progress, fail-fast-on categories, run meta injection, result dir creation. |
| tools/test/internal/runner/runner_test.go | Update tests for new diagnose signature and new behaviors (parallelism, fail-fast-on, results dir uniqueness). |
| tools/test/internal/runner/testdata/iteration-0.log.jsonl | Add benchmark/analyze fixture log. |
| tools/test/internal/runner/testdata/iteration-2.log.jsonl | Add benchmark/analyze fixture log. |
| tools/test/main.go | Add go:generate sync to copy skills into .claude/skills; implement copy logic. |
| tools/test/main_test.go | Add tests for skill sync safety and mode preservation. |
|
|
||
| // WaitTimeoutCustom for longer timeouts, up to the test's Deadline, or the requested timeout, whichever is shorter. | ||
| // Use when DefaultWaitTimeout is too short for your specific test. | ||
| func WaitTimeoutCustom(t *testing.T, requested time.Duration) time.Duration { |
There was a problem hiding this comment.
Doesn't this change the logic of WaitTimeout? It used to alway prefer 90% no matter how that compared to the default. Now it would return the default instead if that were shorter, but that will artificially limit tests that we want to use 90%.
There was a problem hiding this comment.
Ya. My thought was "let's force people to use cleaner timeouts" but I'm realizing that's pretty heavy-handed and unrealistic in this case.
There was a problem hiding this comment.
I think it would re-introduce a lot of chaos. Switching from static, artificially short timeouts over to these dynamic extended ones was a big win originally. I think we've only increased concurrency and non-determinism since then.
There was a problem hiding this comment.
I see that logic is still in this version. Why is it better to quit early, when the suite is configured to run longer? Scaling to that limit is a feature - e.g. What if I have an older machine? Am I expected to edit the code to get tests to pass?
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 49 out of 51 changed files in this pull request and generated 7 comments.
Comments suppressed due to low confidence (1)
core/services/llo/telem/telemetry_test.go:363
new("test error")is not valid Go (the built-innewtakes a type, not a value). Use a pointer helper (e.g.,ptr("test error")) or take the address of a local string variable instead.
ch <- &LLOObservationTelemetry{
StreamId: 135,
StreamValueType: 1,
StreamValueBinary: []byte{0x01, 0x02, 0x03},
StreamValueText: "stream value text",
ObservationError: new("test error"),
ObservationTimestamp: time.Unix(1, 1).UnixNano(),
ObservationFinishedAt: time.Unix(2, 1).UnixNano(),
SeqNr: 42,
ConfigDigest: []byte{0x01, 0x02, 0x03},
}
|




chainlink-test-diagnosisSkillIntroduces an AI agent skill chainlink-test-diagnosis that wraps the
diagnosecommand to narrow down flakes, slow tests, and hangs. I used this skill to fix the many flakes in thegithub.com/smartcontractkit/chainlink/v2/core/services/vrf/v2package and included these in the PR. This includes a timeout that plagues CI Core and makes a normally 5-6 minute test run timeout after 15m+.To verify the flakes were fixed, I re-ran all tests in the package 200 times and encountered no issues! The first runs were showing flake rates of 10%+ for the package.
Other Improvements
tools/testharness, especially thediagnosecommand.gomegatotestifywhere possible.Extra Review Attention
Please pay attention to this change as it stops some race-conditions in test panics, but I fear it could be masking errors?