[DX-4035] shard go_core_*_tests by Tofel · Pull Request #22286 · smartcontractkit/chainlink

Tofel · 2026-05-04T13:29:02Z

Approach

This PR shards go_core_tests, go_core_tests_integration, and go_core_deployment_tests in CI to reduce flakiness caused by running too many test packages in parallel in a single job.

The sharding logic is intentionally simple and deterministic. Each existing test runner first computes the exact package list it already intends to run today, then passes that package list to a small Go helper. The helper assigns each package to a shard by hashing the package path with FNV-1a and taking hash % shard_count. This gives a stable, zero-maintenance partitioning scheme: no manual package lists, no balancing tables, and new test packages are picked up automatically.

Testing

Correctness is guarded in two ways. First, a consistency check verifies that for a given package universe, the union of all shards covers every package exactly once. Second, I compared the executed test sets between unsharded and sharded runs by extracting package + test name from gotestsum --jsonfile output and diffing the sorted results. For go_core_tests, go_core_tests_integration, and go_core_deployment_tests, the executed test sets matched exactly between the non-sharded and sharded runs.

Results:

go_core_tests ✅
go_core_deployment_tests ✅
go_core_integration_tests ✅

Rejected approaches

Sharding based on regex/test names

We considered naive sharding by test-name regex, but rejected it because Go executes tests at package granularity and the suite is not evenly distributed by test-name prefix. In practice, leading-letter buckets were highly imbalanced, and the dominant runtime/flakiness hotspots are concentrated in a small number of heavy packages. Package-path hashing gives a simpler, future-proof, zero-maintenance partitioning mechanism that aligns with go test execution semantics.

Follow up work

Speeding up outlier packages

In order to fully unlock workflow execution speed up some of the deployments tests will need to made faster by respective teams. Once that is done plausible p50 of CI Core might land in the zone of 10-11 minutes.

Current execution swim line is shown below:

Slowest packages that need speeding up:

ok  	github.com/smartcontractkit/chainlink/deployment/ccip/changeset/solana_v0_1_1	232.204s
ok  	github.com/smartcontractkit/chainlink/deployment/ccip/changeset/v1_6	383.140s
ok  	github.com/smartcontractkit/chainlink/deployment/ccip/changeset/aptos	683.586s

Sharding fuzz tests

As can be seen on the graph above the next far outlier are fuzz tests. Once we shard them (2 shards should be enough) we should be able to cut down workflow execution time even further.

github-actions · 2026-05-04T13:30:30Z

✅ No conflicts with other open PRs targeting develop

trunk-io · 2026-05-04T13:42:00Z

_{View Full Report ↗︎ ⋅ Docs}

… job spawning

Copilot

Pull request overview

Risk Rating: MEDIUM

This PR introduces deterministic package-level sharding for core Go test jobs in CI to reduce flakiness and improve overall workflow throughput by splitting large test suites across multiple parallel jobs.

Changes:

Added a small Go helper (tools/ci-testshard) to deterministically assign packages to shards (FNV-1a hash modulo shard count) and to verify shard coverage.
Updated go_core_tests, go_core_tests_integration, and go_core_deployment_tests runners to compute their package universe and run only the shard-assigned subset.
Refactored ci-core.yml to generate a shard matrix, add a shard-consistency check, aggregate shard results into suite-level “result” jobs, and update artifact naming / SonarQube path collection accordingly.

Areas for scrupulous human review

tools/bin/go_core_tests and tools/bin/go_core_deployment_tests: ensure the new “package universe” matches the previous CI intent (including whether compile-only/no-test packages must still be exercised).
.github/workflows/ci-core.yml: validate job dependency graph and result aggregation logic (matrix generation, gh api job selection by name prefix, and gating behavior in scan).

Reviewed changes

Copilot reviewed 3 out of 6 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
tools/ci-testshard/main.go	New CLI utility for deterministic package sharding and shard coverage verification.
tools/ci-testshard/main_test.go	Unit tests covering package input parsing and sharding/verification behavior.
tools/bin/go_core_tests_integration	Computes integration package list, shards it, and runs only shard-assigned packages.
tools/bin/go_core_tests	Computes test package list, shards it, and runs only shard-assigned packages.
tools/bin/go_core_deployment_tests	Computes deployment test package list, shards it, and runs only shard-assigned packages (within deployment module).
.github/workflows/ci-core.yml	Adds shard matrix generation + consistency check, fans out sharded jobs, aggregates results, and updates artifact/report handling.

skudasov · 2026-05-05T09:18:16Z

If it solves resource contention and decreases flakiness, let's merge.
Ideally, later we should refactor and simplify the core test pipeline so it looks like a dead simple matrix:

         - display_name: "Test Go Core Common Shard #1 Tests"
            testcmd: "go test -v ./pkg_1 ./pkg_2"
            runner: "ubuntu-latest" # can be more beefy if it improves overall performance
          - display_name: "Test Go Core Common Shard #2 Tests"
            testcmd: "go test -v ./pkg_3 ./pkg_4"
            runner: "ubuntu-latest"
          - display_name: "Test Go Core Deployments Shard #1 Tests"
            testcmd: "go test -v ./pkg_5 ./pkg_6"
            runner: "ubuntu-latest"
          - display_name: "Flaky Shard #1 Common Tests"
            testcmd: "go test -v ./pkg_7"
            runner: "ubuntu-latest"

Example of a matrix job that's easy to read in my opinion.

I do understand we added Python and bash changes because the current pipeline is already overcomplicated, and it's good that we improved stats, but the long-term goal should be reducing the "go core" pipeline to a simple and very explicit matrix + adding automation in PRs to notify users that a new package is not yet included in CI.

To me, that's the simplest approach: easy to shard packages explicitly, easy to rebalance, easy to move between shards without reading any additional code. If that's not enough for some cases, we can also have a dynamic function which returns packages on some criteria, as @Tofel already did, testcmd: "go test -v $(go run custom_shard.go $params) which just returns packages.

Tofel · 2026-05-05T09:21:51Z

@skudasov explicit sharding is deffo the simplest, but I didn't go that route, because it requires maintenance: new/modified package names have to be explicitly added, which from my POV increases the risk that some won't be executed if they are not manually added.

That's why I went the way I did, since no maintenance is required at all. The trade off is higher complexity and lack of control of the bucket to which the test goes.

jmank88 · 2026-05-05T11:38:15Z

+          gh api \
+            --paginate \
+            --jq '.jobs[] | {name, conclusion} | @json' \
+            "/repos/${GITHUB_REPOSITORY}/actions/runs/${GITHUB_RUN_ID}/jobs?per_page=100" \
+            > jobs.jsonl


Is this the canonical way to get this information? It seems really odd to reach back outside and query yourself, and I would worry about whether it is actually atomically updated, or if it would be racing.

I went that way so that we could keep a single job to execute all matrix jobs. Otherwise I'd need to either copy it or move it a composite action and I wanted to keep changes minimal.

That meant that I needed to find a way to aggregate sharded jobs results to satisfy branch protection rules. I couldn't use needs.core-tests.result, because it only gives the aggregate result of the whole matrix job, not per child. GitHub Actions does not expose matrix child conclusions individually through needs, expressions, or normal job outputs.

Without using GH API I'd need to come up with custom way of sharing results (e.g. via artifacts) that would make the whole thing even more complex, because I'd need to handle Trunk's rewriting of test results. So I went the API way which returns final result of each job.

What makes you think that races are possible if this job runs only after tests job has finished?

What makes you think that races are possible if this job runs only after tests job has finished?

It would not naturally happen with many distributed systems since global consistency and atomicity are costly and don't scale, so I think depending on it is speculative unless github has documented that this is OK.

jmank88 · 2026-05-05T11:45:08Z

What is the overall before vs. after here?

jmank88 · 2026-05-05T11:49:22Z

+                  "run-timeout-minutes": "15",
+                  "touch-web-assets": "true",
+                  "restore-build-cache-only": "true",
+                  "increase-timeouts-on-schedule": "true",


Why is there an additional qualifier for timeout increases?

That's how it worked before, we increase the default timeouts for fuzz/race tests and pass them as env vars to bash scripts that run them.

There was not an extra qualifier before. We simply checked if we had a scheduled run, then increased timeouts.

jmank88 · 2026-05-05T11:51:49Z

+                  "touch-web-assets": "true",
+                  "restore-build-cache-only": "true",
+                  "increase-timeouts-on-schedule": "true",
+                  "print-races": "true",


Why do we need an extra variable to ensure that races are printed?

that's how it already worked: https://github.com/smartcontractkit/chainlink/blob/develop/.github/workflows/ci-core.yml#L419C1-L435C47

But that code that you linked does not include an extra parameter to indicate if races should be printed. It checks if we are running races via matrix.type.cmd == 'go_core_race_tests', and if so it prints them. If we need a different means of checking whether we are in race mode now, then so be it, but I don't think we should add extra params for each action that is implied.

Suggested change

"print-races": "true",

"race": "true",

jmank88 · 2026-05-05T12:02:07Z

+                      "logs-artifact-name": f"go_core_tests_integration_logs_shard_{i}",
+                  })
+
+          def add_core_deployment_rows():


Why don't we need touch-web-assets for this one? Isn't that necessary to compile the core binary?

Because these tests run in deployments folder, which is a separate Go module that doesn't compile core/web package.

What is the rationale for including tests from that module here? This is otherwise a consistent set of variations on the core tests.

Tofel · 2026-05-05T12:05:02Z

What is the overall before vs. after here?

@jmank88 Roughly speaking it is the same, around 17 minutes, because 3 packages mentioned in the PR are resulting in a long tail. If we do go the sharding way then I will go after teams that own them and request they speed them up so that we can reap the benefits of sharding.

Until then the only benefit will be less flakyness.

jmank88 · 2026-05-05T12:25:44Z

What is the overall before vs. after here?

@jmank88 Roughly speaking it is the same, around 17 minutes, because 3 packages mentioned in the PR are resulting in a long tail. If we do go the sharding way then I will go after teams that own them and request they speed them up so that we can reap the benefits of sharding.

Until then the only benefit will be less flakyness.

Huh, I only expected speedup. Why would there be less flakyness? I don't think it is true that:

flakiness caused by running too many test packages in parallel in a single job.

Too many packages is not inherently a problem because concurrency is already limited. There has to be more to it than that. Is the database contention the problem?

Tofel · 2026-05-05T12:31:58Z

Huh, I only expected speedup.

Hard to get a speed up if we shard by package level if have packages that run for 11 minutes. If these are sped up them we will see a big speed up.

Too many packages is not inherently a problem because concurrency is already limited.

You mean limited by GOMAXPROCS?

Is the database contention the problem?

I think it is one of the problems. Since now every job will start its own DB, issues that were related to DB contention should also have lower impact.

jmank88 · 2026-05-05T12:47:43Z

You mean limited by GOMAXPROCS?

Indirectly, yes. Both package and test level concurrency.

Is the database contention the problem?

I think it is one of the problems. Since now every job will start its own DB, issues that were related to DB contention should also have lower impact.

IMHO it is important to be certain about this. If avoiding DB contention is the main difference, then there are simpler approaches that we can try, and some of those will benefit local development as well rather than addressing only CI.

mchain0 · 2026-05-05T13:09:58Z

+		"pkg/c",
+		"pkg/d",
+		"pkg/e",
+		"pkg/f",


not seeing a duplicate case

cl-sonarqube-production · 2026-05-05T16:43:07Z

Quality Gate passed

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarQube

shard some go_core tests

ef974cb

change amount of shards, add jobs that gather results

79d027b

github-advanced-security AI found potential problems May 4, 2026

View reviewed changes

Comment thread .github/workflows/ci-core.yml Fixed

Comment thread .github/workflows/ci-core.yml Fixed

Comment thread .github/workflows/ci-core.yml Fixed

Comment thread .github/workflows/ci-core.yml Fixed

change shard count

2aecc53

github-advanced-security AI found potential problems May 4, 2026

View reviewed changes

Comment thread .github/workflows/ci-core.yml Fixed

Comment thread .github/workflows/ci-core.yml Fixed

Comment thread .github/workflows/ci-core.yml Fixed

Comment thread .github/workflows/ci-core.yml Fixed

Tofel force-pushed the dx-4035-experiment-with-go-core-test-sharding branch 3 times, most recently from 72a9f6c to a9749e1 Compare May 4, 2026 16:15

Tofel added 2 commits May 4, 2026 19:01

restore old ifs, use the same shard count for verification and actual…

8768cab

… job spawning

clean up

cc0765d

Tofel force-pushed the dx-4035-experiment-with-go-core-test-sharding branch from a9749e1 to cc0765d Compare May 4, 2026 17:27

github-advanced-security AI found potential problems May 4, 2026

View reviewed changes

Comment thread .github/workflows/ci-core.yml Fixed

Comment thread .github/workflows/ci-core.yml Fixed

Tofel added 2 commits May 4, 2026 19:31

fix jobs' dependencies

adeab7a

use GH API to get jobs conclusion

9e98a2f

Tofel force-pushed the dx-4035-experiment-with-go-core-test-sharding branch from aca57a2 to 9e98a2f Compare May 5, 2026 06:13

Tofel added 3 commits May 5, 2026 08:41

upload deployment output.txt

928e618

fix sensible GH lints

844b709

fix permissions

3dc1005

Tofel marked this pull request as ready for review May 5, 2026 07:20

Tofel requested a review from a team as a code owner May 5, 2026 07:20

Copilot AI review requested due to automatic review settings May 5, 2026 07:20

Tofel requested review from a team as code owners May 5, 2026 07:20

product-security-plaid-production Bot requested review from kalverra, mchain0 and skudasov May 5, 2026 07:20

Copilot started reviewing on behalf of Tofel May 5, 2026 07:21 View session

Copilot AI reviewed May 5, 2026

View reviewed changes

Tofel requested a review from jmank88 May 5, 2026 07:40

Tofel changed the title ~~shard some go_core tests~~ [DX-4035] shard go_core tests May 5, 2026

Tofel changed the title ~~[DX-4035] shard go_core tests~~ [DX-4035] shard go_core_*_tests May 5, 2026

skudasov previously approved these changes May 5, 2026

View reviewed changes

mchain0 reviewed May 5, 2026

View reviewed changes

Comment thread tools/ci-testshard/main.go Outdated

mchain0 reviewed May 5, 2026

View reviewed changes

Comment thread tools/ci-testshard/main_test.go

mchain0 previously approved these changes May 5, 2026

View reviewed changes

jmank88 reviewed May 5, 2026

View reviewed changes

Tofel dismissed stale reviews from mchain0 and skudasov via b0aae99 May 5, 2026 12:33

update tool defaults

b1e1a14

Tofel force-pushed the dx-4035-experiment-with-go-core-test-sharding branch from b0aae99 to b1e1a14 Compare May 5, 2026 12:36

mchain0 reviewed May 5, 2026

View reviewed changes

Comment thread tools/ci-testshard/main_test.go

"pkg/c",

"pkg/d",

"pkg/e",

"pkg/f",

Copy link
Copy Markdown

Contributor

mchain0 May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not seeing a duplicate case

mchain0 previously approved these changes May 5, 2026

View reviewed changes

bump CI timeouts + add one more test

5ed135e

Tofel dismissed mchain0’s stale review via 5ed135e May 5, 2026 16:23

mchain0 approved these changes May 6, 2026

View reviewed changes

Conversation

Tofel commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Approach

Testing

Rejected approaches

Sharding based on regex/test names

Follow up work

Speeding up outlier packages

Sharding fuzz tests

Uh oh!

github-actions Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

trunk-io Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Areas for scrupulous human review

Reviewed changes

Uh oh!

skudasov commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Tofel commented May 5, 2026

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jmank88 commented May 5, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Tofel May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Tofel commented May 5, 2026

Uh oh!

jmank88 commented May 5, 2026

Uh oh!

Tofel commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jmank88 commented May 5, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cl-sonarqube-production Bot commented May 5, 2026

Tofel commented May 4, 2026 •

edited

Loading

github-actions Bot commented May 4, 2026 •

edited

Loading

trunk-io Bot commented May 4, 2026 •

edited

Loading

skudasov commented May 5, 2026 •

edited

Loading

Tofel May 5, 2026 •

edited

Loading

Tofel commented May 5, 2026 •

edited

Loading