Skip to content

[ML] Add serverless MKI testing for ml-cpp PR builds#3027

Open
edsavage wants to merge 12 commits intoelastic:mainfrom
edsavage:feature/serverless-pr-testing
Open

[ML] Add serverless MKI testing for ml-cpp PR builds#3027
edsavage wants to merge 12 commits intoelastic:mainfrom
edsavage:feature/serverless-pr-testing

Conversation

@edsavage
Copy link
Copy Markdown
Contributor

@edsavage edsavage commented Apr 16, 2026

Summary

Adds two automated workflows for testing custom ml-cpp builds in the serverless MKI environment, replacing the manual process documented in elastic/ml-team#1243.

  1. Serverless E2E tests: builds the Docker image with custom ml-cpp and runs the full serverless test suite (functional tests, Docker build, E2E tests on MKI QA across AWS/Azure/GCP).
  2. Serverless QA deploy: builds the Docker image and deploys it to QA for interactive use (no tests). The developer gets a running serverless environment they can curl, deploy models to, run queries against, etc. See https://github.com/elastic/elasticsearch-serverless?tab=readme-ov-file#deploy-snapshot-into-in-qa-environment for more details.

Both workflows inject custom ml-cpp at Docker image build time via a local Ivy repo override (-Dbuild.ml_cpp.repo), using Buildkite's native cross-pipeline artifact download. No cross-repo git cloning, no AWS credentials, and no manual kubectl/scp steps.

How to trigger

Run serverless E2E tests

PR comment:

buildkite run_serverless_tests

PR label: ci:run-serverless-tests

Deploy to serverless QA for interactive use

PR comment:

buildkite deploy_serverless_qa

PR label: ci:deploy-serverless-qa

Optional parameters on buildkite … comments

When you trigger via a buildkite … PR comment (not via labels alone), you may append additional space-separated KEY=value tokens after the normal comment text (including optional clauses such as for ES_BRANCH=…, on linux x86_64, etc.). Those keys are applied to the generated Buildkite pipeline environment for the serverless upload steps.

Allowed keys (each value must be a single token—no spaces):

Key Notes
KEEP_DEPLOYMENT Only true or false (case-insensitive); any other value is ignored.
REGION_ID e.g. aws-eu-west-1, gcp-us-central-1
PROJECT_TYPE e.g. elasticsearch; downstream scripts apply defaults where documented.
ES_SERVERLESS_BRANCH Override the elasticsearch-serverless branch selection for this build.

Examples:

buildkite deploy_serverless_qa KEEP_DEPLOYMENT=true
buildkite run_serverless_tests on linux x86_64 REGION_ID=gcp-us-central-1 PROJECT_TYPE=elasticsearch

Labels (ci:run-serverless-tests, ci:deploy-serverless-qa) do not parse this tail. For label-only runs, set overrides as Buildkite environment variables on the build (or in the New Build form) if you need non-defaults.

The QA deployment stays up for 1 hour by default. Use KEEP_DEPLOYMENT=true in the comment tail or as a Buildkite env var to keep it longer. Build annotations show the URL and encrypted credentials.

How it works

  1. ml-cpp pipeline builds C++ for Linux x86_64 + aarch64 (in parallel)
  2. dev-tools/minimal.zip is uploaded as a Buildkite artifact
  3. The pipeline resolves:
  • elasticsearch-serverless branch: checks PR author's fork, then elastic/, falls back to main
  • ES submodule commit: checks PR author's elasticsearch fork for a matching branch, falls back to latest main
  • ES PR number: looks for a matching open ES PR, falls back to any recent open PR
  1. Triggers the serverless pipeline (es-pr-check for tests, deploy-qa for interactive deploy) with ML_CPP_BUILD_ID
  2. The serverless pipeline's publish-docker-artifacts.sh detects ML_CPP_BUILD_ID, downloads the ml-cpp artifacts from the triggering build via buildkite-agent artifact download --build, creates a local Ivy repo, and builds the Docker image with -Dbuild.ml_cpp.repo

Coordinated changes

  • ml-cpp + elasticsearch: if the developer has a matching branch on their elasticsearch fork, the serverless pipeline checks out that ES commit via the submodule
  • ml-cpp + elasticsearch-serverless: if a matching branch exists on elastic/elasticsearch-serverless, the pipeline uses it automatically

Files

New

  • .buildkite/pipelines/run_serverless_tests.yml.sh -- E2E test pipeline
  • .buildkite/pipelines/deploy_serverless_qa.yml.sh -- QA deploy pipeline

Modified

  • .buildkite/ml_pipeline/config.py -- run_serverless_tests and deploy_serverless_qa flags; parses optional comment tail into os.environ
  • .buildkite/pipeline.json.py -- wires both pipelines into the PR pipeline; forwards serverless-related env vars when those steps are selected
  • .buildkite/pull-requests.json -- adds both actions to the trigger comment regex and optional KEY=value tail capture

Required companion change

elastic/elasticsearch-serverless#6364 -- adds ML_CPP_BUILD_ID support to publish-docker-artifacts.sh

Relates elastic/ml-team#1243

@prodsecmachine
Copy link
Copy Markdown

prodsecmachine commented Apr 16, 2026

Snyk checks have passed. No issues have been found so far.

Status Scan Engine Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues
Licenses 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

@edsavage edsavage force-pushed the feature/serverless-pr-testing branch 4 times, most recently from d089817 to 86a5bb1 Compare April 16, 2026 04:08
Add pipeline and scripts to build a serverless ES Docker image
with custom ml-cpp artifacts and run E2E tests against MKI QA.

Triggered via PR comment (`buildkite run_serverless_tests`) or
label (`ci:run-serverless-tests`). Builds Linux x86_64 ml-cpp,
sets up a local Ivy repo, clones elasticsearch-serverless, builds
the Docker image with -Dbuild.ml_cpp.repo override, pushes to the
CI registry, and triggers the E2E QA pipeline.

Relates elastic/ml-team#1243

Made-with: Cursor
@edsavage edsavage force-pushed the feature/serverless-pr-testing branch from 86a5bb1 to 0c5b4ea Compare April 16, 2026 04:11
@elasticsearchmachine
Copy link
Copy Markdown

Pinging @elastic/ml-core (Team:ML)

@edsavage edsavage force-pushed the feature/serverless-pr-testing branch 12 times, most recently from 6a83fb5 to c07be04 Compare April 17, 2026 04:35
Comment thread .buildkite/pipelines/run_serverless_tests.yml.sh
Comment thread .buildkite/pipelines/run_serverless_tests.yml.sh Outdated
Comment thread .buildkite/pipeline.json.py Outdated
Comment thread .buildkite/pipelines/run_serverless_tests.yml.sh Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the purpose of this file?

@edsavage edsavage marked this pull request as draft April 19, 2026 23:21
@edsavage edsavage force-pushed the feature/serverless-pr-testing branch from c07be04 to a32e98b Compare April 19, 2026 23:25
Instead of cloning elasticsearch-serverless from the ml-cpp
pipeline (which requires cross-repo GitHub token access), upload
the custom ml-cpp artifacts to an S3 staging path under the
existing prelert-artifacts bucket, then trigger the serverless
validation pipeline which runs in its own context with full
permissions.

The serverless pipeline picks up the custom artifacts via the
ML_CPP_REPO_OVERRIDE env var, which is passed through as
-Dbuild.ml_cpp.repo to the Gradle build. This requires a small
change in elasticsearch-serverless (documented in
docs/serverless-integration/).

Made-with: Cursor
@edsavage edsavage force-pushed the feature/serverless-pr-testing branch from a32e98b to 0b5d1db Compare April 19, 2026 23:44
Reduce duplication by refactoring shell script routines to extract
common functionality.
Allows for setting a different project type than the "elasticsearch"
default, e.g. "observability", "security".
@edsavage edsavage marked this pull request as ready for review April 21, 2026 01:47
Expose the ml-cpp repo commit to the triggered serverless build so
pipeline.pr.yml.sh can fold it into IMAGE_TAG_SUFFIX with ML_CPP_BUILD_ID.

Made-with: Cursor
@edsavage edsavage requested a review from valeriy42 April 23, 2026 22:17
Extend trigger_comment_regex with optional serverless_kv capture (space-
separated KEY=value tokens). Parse GITHUB_PR_COMMENT_VAR_SERVERLESS_KV in
config and whitelist KEEP_DEPLOYMENT, REGION_ID, PROJECT_TYPE,
ES_SERVERLESS_BRANCH into os.environ. Forward those to pipeline env when
serverless deploy or test steps are selected.

Made-with: Cursor
Copy link
Copy Markdown
Contributor

@valeriy42 valeriy42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @edsavage ,

I noticed that you force-pushed a number of commits, but my comments particularly around the code duplication were neither addressed nor commented. Could it be that this was lost during the commits?

@edsavage
Copy link
Copy Markdown
Contributor Author

edsavage commented Apr 27, 2026

Hi @edsavage ,

I noticed that you force-pushed a number of commits, but my comments particularly around the code duplication were neither addressed nor commented. Could it be that this was lost during the commits?

:-( So sorry Valeriy, looks like I've messed up here a bit. I'll address the code duplication concerns shortly. I dealt with duplicated code/logic in pickCloneTarget but completely missed the duplication in the elasticsearch-serverless branch picker.

Extract duplicate elasticsearch-serverless branch resolution from
run_serverless_tests.yml.sh and deploy_serverless_qa.yml.sh into
pick_elasticsearch_serverless_branch.sh, alongside
pick_elasticsearch_clone_target.sh for the elasticsearch repo.

Addresses PR 3027 review feedback on duplicated fork/branch logic.

Made-with: Cursor
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Buildkite-driven serverless MKI workflows to let ml-cpp PR builds (a) trigger full serverless E2E validation and (b) deploy a custom image to serverless QA for interactive testing, using ml-cpp artifacts and coordinated ES/serverless branch/commit selection.

Changes:

  • Introduces two new Buildkite pipeline-generator scripts to trigger serverless E2E tests and QA deploys using custom ml-cpp artifacts.
  • Refactors Elasticsearch fork/branch selection into a shared dev-tools/pick_elasticsearch_clone_target.sh helper and adds a new serverless branch picker.
  • Extends PR comment/label parsing and pipeline wiring to expose the new actions and forward optional serverless-related env overrides.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
dev-tools/run_es_tests_common.sh Reuses shared ES fork/branch selection logic via sourced helper.
dev-tools/pick_elasticsearch_clone_target.sh New shared ES fork/branch selection + remote SHA helper for CI/pipeline triggers.
dev-tools/pick_elasticsearch_serverless_branch.sh New helper to resolve elasticsearch-serverless branch for triggers (with optional override).
.buildkite/pull-requests.json Extends trigger comment regex to include new actions and optional serverless KEY=value tail.
.buildkite/pipelines/run_serverless_tests.yml.sh New pipeline generator to upload ml-cpp deps artifact and trigger serverless E2E validation pipeline.
.buildkite/pipelines/deploy_serverless_qa.yml.sh New pipeline generator to upload ml-cpp deps artifact and trigger serverless QA deployment.
.buildkite/pipeline.json.py Wires new serverless actions into PR pipeline generation and forwards serverless env keys.
.buildkite/ml_pipeline/config.py Adds flags for new actions and parses optional serverless KEY=value comment tail into env.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +39 to +42
SERVERLESS_BRANCH="${ES_SERVERLESS_BRANCH}"
echo "Using explicit ES_SERVERLESS_BRANCH override: $SERVERLESS_BRANCH" >&2
echo "Resolved elasticsearch-serverless branch: $SERVERLESS_BRANCH" >&2
return 0
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When ES_SERVERLESS_BRANCH is provided, the script accepts it without verifying that the branch exists on elastic/elasticsearch-serverless. If the override is mistyped, the downstream trigger will fail later with a less actionable error. Consider validating the override with isElasticsearchServerlessBranchAtRemote "elastic" "$ES_SERVERLESS_BRANCH" and failing fast (or at least warning) if it doesn’t exist.

Suggested change
SERVERLESS_BRANCH="${ES_SERVERLESS_BRANCH}"
echo "Using explicit ES_SERVERLESS_BRANCH override: $SERVERLESS_BRANCH" >&2
echo "Resolved elasticsearch-serverless branch: $SERVERLESS_BRANCH" >&2
return 0
if isElasticsearchServerlessBranchAtRemote "elastic" "$ES_SERVERLESS_BRANCH"; then
SERVERLESS_BRANCH="${ES_SERVERLESS_BRANCH}"
echo "Using explicit ES_SERVERLESS_BRANCH override: $SERVERLESS_BRANCH" >&2
echo "Resolved elasticsearch-serverless branch: $SERVERLESS_BRANCH" >&2
return 0
fi
echo "ERROR: ES_SERVERLESS_BRANCH override '$ES_SERVERLESS_BRANCH' was not found on elastic/elasticsearch-serverless." >&2
echo "Set ES_SERVERLESS_BRANCH to an existing branch on elastic/elasticsearch-serverless." >&2
return 1

Copilot uses AI. Check for mistakes.
Comment on lines +30 to +40
function isCloneTargetValid {
FORK_TO_CHECK="$1"
BRANCH_TO_CHECK="$2"
# Diagnostics must go to stderr: callers (e.g. deploy_serverless_qa.yml.sh)
# pipe stdout to `buildkite-agent pipeline upload` and expect only YAML.
echo "Checking for '$BRANCH_TO_CHECK' branch at $FORK_TO_CHECK/elasticsearch" >&2
if [ -n "$(git ls-remote --heads "git@github.com:$FORK_TO_CHECK/elasticsearch.git" "$BRANCH_TO_CHECK" 2>/dev/null)" ]; then
echo "Will use '$BRANCH_TO_CHECK' branch at $FORK_TO_CHECK/elasticsearch for ES integration tests" >&2
return 0
fi
return 1
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isCloneTargetValid treats an empty branch name as valid because git ls-remote --heads ... "" returns all heads. This can result in SELECTED_BRANCH being set to an empty string (e.g., on manual/non-PR builds where PR_SOURCE_BRANCH is empty), which then breaks downstream consumers (clone with -b "" or selecting an arbitrary SHA). Add an explicit non-empty check for both fork and branch (and ideally make the function return 1 when either is empty).

Copilot uses AI. Check for mistakes.
Comment on lines 77 to +88
if "GITHUB_PR_COMMENT_VAR_ARCH" in os.environ:
csv_arch = os.environ["GITHUB_PR_COMMENT_VAR_ARCH"]
for each in [ x.strip().lower() for x in csv_arch.split(",")]:
if each == "aarch64":
self.build_aarch64 = "--build-aarch64"
elif each == "x86_64":
self.build_x86_64 = "--build-x86_64"
elif self.run_qa_tests or self.run_pytorch_tests:
self.build_x86_64 = "--build-x86_64"
elif self.run_serverless_tests or self.deploy_serverless_qa:
self.build_aarch64 = "--build-aarch64"
self.build_x86_64 = "--build-x86_64"
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For serverless actions, the scripts require both Linux x86_64 and aarch64 builds (the generated serverless pipelines depends_on both build step keys). However, if a PR comment includes an explicit x86_64/aarch64 tail (setting GITHUB_PR_COMMENT_VAR_ARCH), this block honors it and will not force both architectures, which can produce a pipeline that references missing build steps. Consider overriding any arch restriction when run_serverless_tests/deploy_serverless_qa is selected (or failing fast with a clear error).

Copilot uses AI. Check for mistakes.
Comment thread .buildkite/pipeline.json.py Outdated
Comment on lines +93 to +97
# Serverless tests require both x86_64 and aarch64 Linux builds.
if config.run_serverless_tests:
pipeline_steps.append(pipeline_steps.generate_step("Upload serverless tests runner pipeline",
".buildkite/pipelines/run_serverless_tests.yml.sh"))
if config.deploy_serverless_qa:
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run_serverless_tests/deploy_serverless_qa steps are appended unconditionally when the flags are set, but the generated serverless runner pipelines hard-depend on both Linux build step keys. If config parsing ends up with only one arch enabled (e.g. via a PR comment ... x86_64), this will create a pipeline with dangling depends_on keys. Gate these uploads on both config.build_aarch64 and config.build_x86_64 being set, or force-enable both when serverless actions are requested.

Suggested change
# Serverless tests require both x86_64 and aarch64 Linux builds.
if config.run_serverless_tests:
pipeline_steps.append(pipeline_steps.generate_step("Upload serverless tests runner pipeline",
".buildkite/pipelines/run_serverless_tests.yml.sh"))
if config.deploy_serverless_qa:
# Serverless tests and deploys require both x86_64 and aarch64 Linux builds.
if config.run_serverless_tests and config.build_aarch64 and config.build_x86_64:
pipeline_steps.append(pipeline_steps.generate_step("Upload serverless tests runner pipeline",
".buildkite/pipelines/run_serverless_tests.yml.sh"))
if config.deploy_serverless_qa and config.build_aarch64 and config.build_x86_64:

Copilot uses AI. Check for mistakes.
Comment on lines +93 to +101
message: "ml-cpp PR #${PR_NUM}: ${SAFE_MESSAGE}"
env:
UPDATE_SUBMODULE: "false"
ML_CPP_BUILD_ID: "${BUILDKITE_BUILD_ID}"
# ml-cpp repo commit at trigger time; serverless folds this into IMAGE_TAG
# with ML_CPP_BUILD_ID so Docker tags never collide with stock builds.
ML_CPP_COMMIT: "${BUILDKITE_COMMIT}"
ELASTICSEARCH_SUBMODULE_COMMIT: "${ES_COMMIT}"
ELASTICSEARCH_PR_NUMBER: "${ES_PR_NUM}"
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description documents optional KEEP_DEPLOYMENT/REGION_ID/PROJECT_TYPE/ES_SERVERLESS_BRANCH tail parameters for buildkite run_serverless_tests ..., but this trigger step only passes a fixed env set into the downstream serverless pipeline. If REGION_ID/PROJECT_TYPE (and any other supported serverless overrides) are meant to affect the serverless build/test run, they need to be forwarded in this build.env block (similar to deploy_serverless_qa.yml.sh) so the triggered pipeline can see them.

Copilot uses AI. Check for mistakes.
Comment on lines +30 to +37
function isCloneTargetValid {
FORK_TO_CHECK="$1"
BRANCH_TO_CHECK="$2"
# Diagnostics must go to stderr: callers (e.g. deploy_serverless_qa.yml.sh)
# pipe stdout to `buildkite-agent pipeline upload` and expect only YAML.
echo "Checking for '$BRANCH_TO_CHECK' branch at $FORK_TO_CHECK/elasticsearch" >&2
if [ -n "$(git ls-remote --heads "git@github.com:$FORK_TO_CHECK/elasticsearch.git" "$BRANCH_TO_CHECK" 2>/dev/null)" ]; then
echo "Will use '$BRANCH_TO_CHECK' branch at $FORK_TO_CHECK/elasticsearch for ES integration tests" >&2
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isCloneTargetValid and pickCloneTarget are intended to be sourced, but isCloneTargetValid writes to FORK_TO_CHECK/BRANCH_TO_CHECK as globals. Because this file is sourced, those variables will leak into the caller’s shell after the function returns. Use local variables inside the function to avoid unintended side effects in scripts that source this helper.

Copilot uses AI. Check for mistakes.
Comment on lines +73 to +78
env:
ML_CPP_BUILD_ID: "${BUILDKITE_BUILD_ID}"
ELASTICSEARCH_SUBMODULE_COMMIT: "${ES_COMMIT}"
KEEP_DEPLOYMENT: "${KEEP_DEPLOYMENT:-false}"
REGION_ID: "${REGION_ID:-aws-eu-west-1}"
PROJECT_TYPE: "${PROJECT_TYPE:-elasticsearch}"
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

REGION_ID and PROJECT_TYPE are taken from PR comment input and interpolated directly into double-quoted YAML here without escaping. Because the regex/parser allows any non-whitespace token, a value containing " or other YAML-significant characters can break the generated pipeline YAML (or potentially inject additional YAML). Either restrict these values to a safe character set when parsing the comment tail, or escape them before emitting YAML (similar to how SAFE_MESSAGE is handled).

Copilot uses AI. Check for mistakes.
- Validate ES_SERVERLESS_BRANCH exists on elastic before accepting override
- isCloneTargetValid: reject empty fork/branch; use local vars when sourced
- Force Linux + both CPU arches for serverless actions after parsing PR comment
- Gate serverless pipeline uploads on both Linux arch build flags
- Forward KEEP_DEPLOYMENT/REGION_ID/PROJECT_TYPE to serverless test trigger with YAML-safe escaping
- Escape deploy QA trigger env values for generated YAML; fail fast on bad serverless branch

Made-with: Cursor
Move shared SAFE_MESSAGE/PR metadata, pick_* wiring, ES commit resolution,
YAML-safe QA env vars, and the upload_ml_cpp_deps step fragment into
dev-tools/serverless_buildkite_trigger_prepare.sh so deploy and es-pr-check
pipelines stay in sync.

Made-with: Cursor
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread dev-tools/pick_elasticsearch_serverless_branch.sh Outdated
Resolve elasticsearch-serverless branch in ordered steps so a fork-only PR_SOURCE warning does not skip the elastic PR_TARGET fallback.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants