Skip to content

[DX-4017] [DX-3716] Core Test Timeout Fix#22281

Open
kalverra wants to merge 32 commits intodevelopfrom
coreTestTimeoutFix
Open

[DX-4017] [DX-3716] Core Test Timeout Fix#22281
kalverra wants to merge 32 commits intodevelopfrom
coreTestTimeoutFix

Conversation

@kalverra
Copy link
Copy Markdown
Collaborator

@kalverra kalverra commented May 1, 2026

chainlink-test-diagnosis Skill

Introduces an AI agent skill chainlink-test-diagnosis that wraps the diagnose command to narrow down flakes, slow tests, and hangs. I used this skill to fix the many flakes in the github.com/smartcontractkit/chainlink/v2/core/services/vrf/v2 package and included these in the PR. This includes a timeout that plagues CI Core and makes a normally 5-6 minute test run timeout after 15m+.

To verify the flakes were fixed, I re-ran all tests in the package 200 times and encountered no issues! The first runs were showing flake rates of 10%+ for the package.

Other Improvements

  • Improved the performance and UX of the tools/test harness, especially the diagnose command.
  • Some minor linting and modernization fixes.
  • Moving tests from gomega to testify where possible.

Extra Review Attention

Please pay attention to this change as it stops some race-conditions in test panics, but I fear it could be masking errors?

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 1, 2026

I see you updated files related to core. Please run make gocs in the root directory to add a changeset as well as in the text include at least one of the following tags:

  • #added For any new functionality added.
  • #breaking_change For any functionality that requires manual action for the node to boot.
  • #bugfix For bug fixes.
  • #changed For any change to the existing functionality.
  • #db_update For any feature that introduces updates to database schema.
  • #deprecation_notice For any upcoming deprecation functionality.
  • #internal For changesets that need to be excluded from the final changelog.
  • #nops For any feature that is NOP facing and needs to be in the official Release Notes for the release.
  • #removed For any functionality/config that is removed.
  • #updated For any functionality that is updated.
  • #wip For any change that is not ready yet and external communication about it should be held off till it is feature complete.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 1, 2026

✅ No conflicts with other open PRs targeting develop

@trunk-io
Copy link
Copy Markdown

trunk-io Bot commented May 1, 2026

Static BadgeStatic BadgeStatic BadgeStatic Badge

View Full Report ↗︎Docs

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Risk Rating: MEDIUM

This PR aims to reduce intermittent CI timeouts/flakiness by improving the tools/test diagnose harness (parallel iterations, smarter fail-fast, better output/logging metadata) and by hardening several flaky/timeout-prone Core tests (notably VRF v2 integration tests and LLO telemetry sampling).

Changes:

  • Extend tools/test diagnose with parallel iteration execution, richer progress/output plumbing, fail-fast categories, and improved report/log metadata.
  • Add persistent test DB lifecycle commands (setup-testdb, remove-testdb) and database pooling to support parallel diagnose safely.
  • Update multiple Core tests to avoid LogPoller indexing races and to use more robust timeouts/synchronization patterns.

Areas requiring scrupulous human review

  • tools/test/internal/runner/runner.go: parallel iteration scheduling/cancellation and progress rendering interplay.
  • tools/test/internal/db/db.go: pool creation semantics (env propagation, global CL_DATABASE_URL behavior, cleanup on partial failure).
  • VRF v2 integration test changes around filtering/indexing boundaries (commitRequestAndFilterIndexBlock, indexedFilterOpts) to ensure they’re correct under simulated backend semantics and don’t mask real regressions.
  • tools/test/main.go --sync-skills: filesystem safety (symlink handling) given RemoveAll + recursive copy.

Reviewed changes

Copilot reviewed 49 out of 50 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
.gitignore Ignore root .claude/* while allowing tools/test skills directory pattern.
README.md Document new tools/test flow and link to the AI skill.
core/internal/testutils/testutils.go Add WaitTimeoutCustom and change WaitTimeout to cap against requested duration.
core/internal/testutils/testutils_test.go Add tests for new timeout helpers.
core/logger/internal/colortest/prettyconsole_test.go Skip color assertions when ANSI is disabled.
core/services/llo/telem/sampling.go Use wg.Go pattern for pruning loop.
core/services/llo/telem/sampling_test.go Stabilize time-based behavior; use synctest for deterministic goroutine exit.
core/services/llo/telem/telemetry.go Switch goroutine spawn to wg.Go for channel handlers.
core/services/llo/telem/telemetry_test.go Update pointer construction and loops in telemetry tests.
core/services/vrf/v2/bhs_feeder_test.go Make heartbeat assertions robust under parallel load by polling tip-256 dynamically.
core/services/vrf/v2/integration_helpers_test.go Replace Gomega polling with require.Eventually*; improve receipt/log parsing patterns.
core/services/vrf/v2/integration_v2_plus_test.go Parse SubscriptionCreated via tx receipt instead of log filters.
core/services/vrf/v2/integration_v2_reverted_txns_test.go Increase timeout for force-fulfillment; use bounded filter opts.
core/services/vrf/v2/integration_v2_test.go Reduce LogPoller race by retrying filters with bounded opts; add helper filter opts builders.
core/services/vrf/v2/listener_v2_log_processor.go Switch goroutine management to wg.Go for request processing.
core/services/vrf/v2/reverted_txns.go Avoid Fatalw on context cancellation for reverted-tx polling.
tools/test/AGENTS.md Document harness output policy and skill location.
tools/test/.agents/skills/chainlink-test-diagnosis/SKILL.md Add/define the new diagnose AI skill content and CLI reference.
tools/test/README.md Document --fail-fast-on examples and update skill link.
tools/test/internal/cmd/diagnose.go Add --parallel-iterations, --fail-fast-on, config validation, and DB pool wiring.
tools/test/internal/cmd/gotestsum.go Route cleanup errors through output printer.
tools/test/internal/cmd/root.go Skip DB pre-run for diagnose/setup/remove-testdb; wire output into DB ensure.
tools/test/internal/cmd/root_test.go Update command path expectations; add validation tests for diagnose config.
tools/test/internal/cmd/run.go Use output printer for cleanup error reporting.
tools/test/internal/cmd/testdb.go Add setup-testdb / remove-testdb commands for persistent Postgres.
tools/test/internal/config/config.go Add parallel iterations + fail-fast-on config and normalization.
tools/test/internal/config/config_test.go Ensure new flags bind and normalize as expected.
tools/test/internal/db/db.go Add DB pool + resource model; integrate output printer.
tools/test/internal/db/db_test.go Update Ensure signature usage; add tests for pool behaviors and env.
tools/test/internal/db/persistent.go Start/remove persistent Postgres containers; shell export helpers.
tools/test/internal/db/persistent_test.go Test label filter and shell export quoting helpers.
tools/test/internal/output/output.go Introduce output.Printer to centralize human vs AI output and TTY progress rules.
tools/test/internal/output/output_test.go Validate printer behavior (AI/human gating, live inline detection).
tools/test/internal/runner/analyze.go Add run metadata, per-problem log references, iteration digest, and log filename truncation logic.
tools/test/internal/runner/analyze_bench_test.go Add benchmarks using real testdata JSONL payloads.
tools/test/internal/runner/analyze_files_test.go Update log-file expectations for new ProblemLog model and “only problem iterations” behavior.
tools/test/internal/runner/analyze_test.go Add digest tests; adjust expectations to ignore internal-only fields.
tools/test/internal/runner/diagnose_progress.go Add parallel progress tracking and updated inline progress formatting.
tools/test/internal/runner/diagnose_progress_test.go Update expectations for new inline format; test parallel progress rendering.
tools/test/internal/runner/diagnose_results_dir.go Simplify results dir naming to slug+timestamp (full args moved into report metadata).
tools/test/internal/runner/diagnose_table.go Add streaming iteration summary table formatting.
tools/test/internal/runner/diagnose_table_test.go Add tests for table header/row formatting.
tools/test/internal/runner/runner.go Major refactor: parallel iterations, live progress, fail-fast-on categories, run meta injection, result dir creation.
tools/test/internal/runner/runner_test.go Update tests for new diagnose signature and new behaviors (parallelism, fail-fast-on, results dir uniqueness).
tools/test/internal/runner/testdata/iteration-0.log.jsonl Add benchmark/analyze fixture log.
tools/test/internal/runner/testdata/iteration-2.log.jsonl Add benchmark/analyze fixture log.
tools/test/main.go Add go:generate sync to copy skills into .claude/skills; implement copy logic.
tools/test/main_test.go Add tests for skill sync safety and mode preservation.

Comment thread core/services/llo/telem/telemetry_test.go
Comment thread core/services/llo/telem/telemetry_test.go
Comment thread core/services/vrf/v2/integration_helpers_test.go
Comment thread core/services/vrf/v2/integration_helpers_test.go
Comment thread core/internal/testutils/testutils_test.go Outdated
Comment thread tools/test/internal/runner/analyze.go
Comment thread tools/test/internal/runner/analyze.go Outdated
Comment thread core/services/vrf/v2/integration_helpers_test.go Outdated
Comment thread core/services/vrf/v2/integration_helpers_test.go Outdated
Comment thread tools/test/main.go Outdated
Comment thread core/services/vrf/v2/reverted_txns.go Outdated

// WaitTimeoutCustom for longer timeouts, up to the test's Deadline, or the requested timeout, whichever is shorter.
// Use when DefaultWaitTimeout is too short for your specific test.
func WaitTimeoutCustom(t *testing.T, requested time.Duration) time.Duration {
Copy link
Copy Markdown
Contributor

@jmank88 jmank88 May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this change the logic of WaitTimeout? It used to alway prefer 90% no matter how that compared to the default. Now it would return the default instead if that were shorter, but that will artificially limit tests that we want to use 90%.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya. My thought was "let's force people to use cleaner timeouts" but I'm realizing that's pretty heavy-handed and unrealistic in this case.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would re-introduce a lot of chaos. Switching from static, artificially short timeouts over to these dynamic extended ones was a big win originally. I think we've only increased concurrency and non-determinism since then.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that logic is still in this version. Why is it better to quit early, when the suite is configured to run longer? Scaling to that limit is a feature - e.g. What if I have an older machine? Am I expected to edit the code to get tests to pass?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 49 out of 51 changed files in this pull request and generated 7 comments.

Comments suppressed due to low confidence (1)

core/services/llo/telem/telemetry_test.go:363

  • new("test error") is not valid Go (the built-in new takes a type, not a value). Use a pointer helper (e.g., ptr("test error")) or take the address of a local string variable instead.
		ch <- &LLOObservationTelemetry{
			StreamId:              135,
			StreamValueType:       1,
			StreamValueBinary:     []byte{0x01, 0x02, 0x03},
			StreamValueText:       "stream value text",
			ObservationError:      new("test error"),
			ObservationTimestamp:  time.Unix(1, 1).UnixNano(),
			ObservationFinishedAt: time.Unix(2, 1).UnixNano(),
			SeqNr:                 42,
			ConfigDigest:          []byte{0x01, 0x02, 0x03},
		}

Comment thread core/services/llo/telem/telemetry_test.go
Comment thread core/services/vrf/v2/integration_helpers_test.go
Comment thread core/services/vrf/v2/integration_helpers_test.go
Comment thread tools/test/main.go Outdated
Comment thread tools/test/internal/runner/runner.go
Comment thread tools/test/internal/db/db.go
Comment thread tools/test/internal/db/persistent.go
Comment thread tools/test/main.go Outdated
@kalverra kalverra changed the title [DX-4017] Core Test Timeout Fix [DX-4017] [DX-3716] Core Test Timeout Fix May 5, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 49 out of 51 changed files in this pull request and generated 2 comments.

Comment thread core/services/vrf/v2/integration_v2_plus_test.go
Comment thread core/services/vrf/v2/integration_helpers_test.go
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 49 out of 51 changed files in this pull request and generated 2 comments.

Comment thread core/services/vrf/v2/integration_helpers_test.go
Comment thread core/services/vrf/v2/integration_helpers_test.go
@cl-sonarqube-production
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants