[ML] Add serverless MKI testing for ml-cpp PR builds#3027
[ML] Add serverless MKI testing for ml-cpp PR builds#3027edsavage wants to merge 12 commits intoelastic:mainfrom
Conversation
✅ Snyk checks have passed. No issues have been found so far.
💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse. |
d089817 to
86a5bb1
Compare
Add pipeline and scripts to build a serverless ES Docker image with custom ml-cpp artifacts and run E2E tests against MKI QA. Triggered via PR comment (`buildkite run_serverless_tests`) or label (`ci:run-serverless-tests`). Builds Linux x86_64 ml-cpp, sets up a local Ivy repo, clones elasticsearch-serverless, builds the Docker image with -Dbuild.ml_cpp.repo override, pushes to the CI registry, and triggers the E2E QA pipeline. Relates elastic/ml-team#1243 Made-with: Cursor
86a5bb1 to
0c5b4ea
Compare
|
Pinging @elastic/ml-core (Team:ML) |
6a83fb5 to
c07be04
Compare
There was a problem hiding this comment.
what is the purpose of this file?
c07be04 to
a32e98b
Compare
Instead of cloning elasticsearch-serverless from the ml-cpp pipeline (which requires cross-repo GitHub token access), upload the custom ml-cpp artifacts to an S3 staging path under the existing prelert-artifacts bucket, then trigger the serverless validation pipeline which runs in its own context with full permissions. The serverless pipeline picks up the custom artifacts via the ML_CPP_REPO_OVERRIDE env var, which is passed through as -Dbuild.ml_cpp.repo to the Gradle build. This requires a small change in elasticsearch-serverless (documented in docs/serverless-integration/). Made-with: Cursor
a32e98b to
0b5d1db
Compare
Reduce duplication by refactoring shell script routines to extract common functionality.
Allows for setting a different project type than the "elasticsearch" default, e.g. "observability", "security".
Expose the ml-cpp repo commit to the triggered serverless build so pipeline.pr.yml.sh can fold it into IMAGE_TAG_SUFFIX with ML_CPP_BUILD_ID. Made-with: Cursor
Extend trigger_comment_regex with optional serverless_kv capture (space- separated KEY=value tokens). Parse GITHUB_PR_COMMENT_VAR_SERVERLESS_KV in config and whitelist KEEP_DEPLOYMENT, REGION_ID, PROJECT_TYPE, ES_SERVERLESS_BRANCH into os.environ. Forward those to pipeline env when serverless deploy or test steps are selected. Made-with: Cursor
:-( So sorry Valeriy, looks like I've messed up here a bit. I'll address the code duplication concerns shortly. I dealt with duplicated code/logic in |
Extract duplicate elasticsearch-serverless branch resolution from run_serverless_tests.yml.sh and deploy_serverless_qa.yml.sh into pick_elasticsearch_serverless_branch.sh, alongside pick_elasticsearch_clone_target.sh for the elasticsearch repo. Addresses PR 3027 review feedback on duplicated fork/branch logic. Made-with: Cursor
There was a problem hiding this comment.
Pull request overview
Adds Buildkite-driven serverless MKI workflows to let ml-cpp PR builds (a) trigger full serverless E2E validation and (b) deploy a custom image to serverless QA for interactive testing, using ml-cpp artifacts and coordinated ES/serverless branch/commit selection.
Changes:
- Introduces two new Buildkite pipeline-generator scripts to trigger serverless E2E tests and QA deploys using custom ml-cpp artifacts.
- Refactors Elasticsearch fork/branch selection into a shared
dev-tools/pick_elasticsearch_clone_target.shhelper and adds a new serverless branch picker. - Extends PR comment/label parsing and pipeline wiring to expose the new actions and forward optional serverless-related env overrides.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
dev-tools/run_es_tests_common.sh |
Reuses shared ES fork/branch selection logic via sourced helper. |
dev-tools/pick_elasticsearch_clone_target.sh |
New shared ES fork/branch selection + remote SHA helper for CI/pipeline triggers. |
dev-tools/pick_elasticsearch_serverless_branch.sh |
New helper to resolve elasticsearch-serverless branch for triggers (with optional override). |
.buildkite/pull-requests.json |
Extends trigger comment regex to include new actions and optional serverless KEY=value tail. |
.buildkite/pipelines/run_serverless_tests.yml.sh |
New pipeline generator to upload ml-cpp deps artifact and trigger serverless E2E validation pipeline. |
.buildkite/pipelines/deploy_serverless_qa.yml.sh |
New pipeline generator to upload ml-cpp deps artifact and trigger serverless QA deployment. |
.buildkite/pipeline.json.py |
Wires new serverless actions into PR pipeline generation and forwards serverless env keys. |
.buildkite/ml_pipeline/config.py |
Adds flags for new actions and parses optional serverless KEY=value comment tail into env. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| SERVERLESS_BRANCH="${ES_SERVERLESS_BRANCH}" | ||
| echo "Using explicit ES_SERVERLESS_BRANCH override: $SERVERLESS_BRANCH" >&2 | ||
| echo "Resolved elasticsearch-serverless branch: $SERVERLESS_BRANCH" >&2 | ||
| return 0 |
There was a problem hiding this comment.
When ES_SERVERLESS_BRANCH is provided, the script accepts it without verifying that the branch exists on elastic/elasticsearch-serverless. If the override is mistyped, the downstream trigger will fail later with a less actionable error. Consider validating the override with isElasticsearchServerlessBranchAtRemote "elastic" "$ES_SERVERLESS_BRANCH" and failing fast (or at least warning) if it doesn’t exist.
| SERVERLESS_BRANCH="${ES_SERVERLESS_BRANCH}" | |
| echo "Using explicit ES_SERVERLESS_BRANCH override: $SERVERLESS_BRANCH" >&2 | |
| echo "Resolved elasticsearch-serverless branch: $SERVERLESS_BRANCH" >&2 | |
| return 0 | |
| if isElasticsearchServerlessBranchAtRemote "elastic" "$ES_SERVERLESS_BRANCH"; then | |
| SERVERLESS_BRANCH="${ES_SERVERLESS_BRANCH}" | |
| echo "Using explicit ES_SERVERLESS_BRANCH override: $SERVERLESS_BRANCH" >&2 | |
| echo "Resolved elasticsearch-serverless branch: $SERVERLESS_BRANCH" >&2 | |
| return 0 | |
| fi | |
| echo "ERROR: ES_SERVERLESS_BRANCH override '$ES_SERVERLESS_BRANCH' was not found on elastic/elasticsearch-serverless." >&2 | |
| echo "Set ES_SERVERLESS_BRANCH to an existing branch on elastic/elasticsearch-serverless." >&2 | |
| return 1 |
| function isCloneTargetValid { | ||
| FORK_TO_CHECK="$1" | ||
| BRANCH_TO_CHECK="$2" | ||
| # Diagnostics must go to stderr: callers (e.g. deploy_serverless_qa.yml.sh) | ||
| # pipe stdout to `buildkite-agent pipeline upload` and expect only YAML. | ||
| echo "Checking for '$BRANCH_TO_CHECK' branch at $FORK_TO_CHECK/elasticsearch" >&2 | ||
| if [ -n "$(git ls-remote --heads "git@github.com:$FORK_TO_CHECK/elasticsearch.git" "$BRANCH_TO_CHECK" 2>/dev/null)" ]; then | ||
| echo "Will use '$BRANCH_TO_CHECK' branch at $FORK_TO_CHECK/elasticsearch for ES integration tests" >&2 | ||
| return 0 | ||
| fi | ||
| return 1 |
There was a problem hiding this comment.
isCloneTargetValid treats an empty branch name as valid because git ls-remote --heads ... "" returns all heads. This can result in SELECTED_BRANCH being set to an empty string (e.g., on manual/non-PR builds where PR_SOURCE_BRANCH is empty), which then breaks downstream consumers (clone with -b "" or selecting an arbitrary SHA). Add an explicit non-empty check for both fork and branch (and ideally make the function return 1 when either is empty).
| if "GITHUB_PR_COMMENT_VAR_ARCH" in os.environ: | ||
| csv_arch = os.environ["GITHUB_PR_COMMENT_VAR_ARCH"] | ||
| for each in [ x.strip().lower() for x in csv_arch.split(",")]: | ||
| if each == "aarch64": | ||
| self.build_aarch64 = "--build-aarch64" | ||
| elif each == "x86_64": | ||
| self.build_x86_64 = "--build-x86_64" | ||
| elif self.run_qa_tests or self.run_pytorch_tests: | ||
| self.build_x86_64 = "--build-x86_64" | ||
| elif self.run_serverless_tests or self.deploy_serverless_qa: | ||
| self.build_aarch64 = "--build-aarch64" | ||
| self.build_x86_64 = "--build-x86_64" |
There was a problem hiding this comment.
For serverless actions, the scripts require both Linux x86_64 and aarch64 builds (the generated serverless pipelines depends_on both build step keys). However, if a PR comment includes an explicit x86_64/aarch64 tail (setting GITHUB_PR_COMMENT_VAR_ARCH), this block honors it and will not force both architectures, which can produce a pipeline that references missing build steps. Consider overriding any arch restriction when run_serverless_tests/deploy_serverless_qa is selected (or failing fast with a clear error).
| # Serverless tests require both x86_64 and aarch64 Linux builds. | ||
| if config.run_serverless_tests: | ||
| pipeline_steps.append(pipeline_steps.generate_step("Upload serverless tests runner pipeline", | ||
| ".buildkite/pipelines/run_serverless_tests.yml.sh")) | ||
| if config.deploy_serverless_qa: |
There was a problem hiding this comment.
run_serverless_tests/deploy_serverless_qa steps are appended unconditionally when the flags are set, but the generated serverless runner pipelines hard-depend on both Linux build step keys. If config parsing ends up with only one arch enabled (e.g. via a PR comment ... x86_64), this will create a pipeline with dangling depends_on keys. Gate these uploads on both config.build_aarch64 and config.build_x86_64 being set, or force-enable both when serverless actions are requested.
| # Serverless tests require both x86_64 and aarch64 Linux builds. | |
| if config.run_serverless_tests: | |
| pipeline_steps.append(pipeline_steps.generate_step("Upload serverless tests runner pipeline", | |
| ".buildkite/pipelines/run_serverless_tests.yml.sh")) | |
| if config.deploy_serverless_qa: | |
| # Serverless tests and deploys require both x86_64 and aarch64 Linux builds. | |
| if config.run_serverless_tests and config.build_aarch64 and config.build_x86_64: | |
| pipeline_steps.append(pipeline_steps.generate_step("Upload serverless tests runner pipeline", | |
| ".buildkite/pipelines/run_serverless_tests.yml.sh")) | |
| if config.deploy_serverless_qa and config.build_aarch64 and config.build_x86_64: |
| message: "ml-cpp PR #${PR_NUM}: ${SAFE_MESSAGE}" | ||
| env: | ||
| UPDATE_SUBMODULE: "false" | ||
| ML_CPP_BUILD_ID: "${BUILDKITE_BUILD_ID}" | ||
| # ml-cpp repo commit at trigger time; serverless folds this into IMAGE_TAG | ||
| # with ML_CPP_BUILD_ID so Docker tags never collide with stock builds. | ||
| ML_CPP_COMMIT: "${BUILDKITE_COMMIT}" | ||
| ELASTICSEARCH_SUBMODULE_COMMIT: "${ES_COMMIT}" | ||
| ELASTICSEARCH_PR_NUMBER: "${ES_PR_NUM}" |
There was a problem hiding this comment.
The PR description documents optional KEEP_DEPLOYMENT/REGION_ID/PROJECT_TYPE/ES_SERVERLESS_BRANCH tail parameters for buildkite run_serverless_tests ..., but this trigger step only passes a fixed env set into the downstream serverless pipeline. If REGION_ID/PROJECT_TYPE (and any other supported serverless overrides) are meant to affect the serverless build/test run, they need to be forwarded in this build.env block (similar to deploy_serverless_qa.yml.sh) so the triggered pipeline can see them.
| function isCloneTargetValid { | ||
| FORK_TO_CHECK="$1" | ||
| BRANCH_TO_CHECK="$2" | ||
| # Diagnostics must go to stderr: callers (e.g. deploy_serverless_qa.yml.sh) | ||
| # pipe stdout to `buildkite-agent pipeline upload` and expect only YAML. | ||
| echo "Checking for '$BRANCH_TO_CHECK' branch at $FORK_TO_CHECK/elasticsearch" >&2 | ||
| if [ -n "$(git ls-remote --heads "git@github.com:$FORK_TO_CHECK/elasticsearch.git" "$BRANCH_TO_CHECK" 2>/dev/null)" ]; then | ||
| echo "Will use '$BRANCH_TO_CHECK' branch at $FORK_TO_CHECK/elasticsearch for ES integration tests" >&2 |
There was a problem hiding this comment.
isCloneTargetValid and pickCloneTarget are intended to be sourced, but isCloneTargetValid writes to FORK_TO_CHECK/BRANCH_TO_CHECK as globals. Because this file is sourced, those variables will leak into the caller’s shell after the function returns. Use local variables inside the function to avoid unintended side effects in scripts that source this helper.
| env: | ||
| ML_CPP_BUILD_ID: "${BUILDKITE_BUILD_ID}" | ||
| ELASTICSEARCH_SUBMODULE_COMMIT: "${ES_COMMIT}" | ||
| KEEP_DEPLOYMENT: "${KEEP_DEPLOYMENT:-false}" | ||
| REGION_ID: "${REGION_ID:-aws-eu-west-1}" | ||
| PROJECT_TYPE: "${PROJECT_TYPE:-elasticsearch}" |
There was a problem hiding this comment.
REGION_ID and PROJECT_TYPE are taken from PR comment input and interpolated directly into double-quoted YAML here without escaping. Because the regex/parser allows any non-whitespace token, a value containing " or other YAML-significant characters can break the generated pipeline YAML (or potentially inject additional YAML). Either restrict these values to a safe character set when parsing the comment tail, or escape them before emitting YAML (similar to how SAFE_MESSAGE is handled).
- Validate ES_SERVERLESS_BRANCH exists on elastic before accepting override - isCloneTargetValid: reject empty fork/branch; use local vars when sourced - Force Linux + both CPU arches for serverless actions after parsing PR comment - Gate serverless pipeline uploads on both Linux arch build flags - Forward KEEP_DEPLOYMENT/REGION_ID/PROJECT_TYPE to serverless test trigger with YAML-safe escaping - Escape deploy QA trigger env values for generated YAML; fail fast on bad serverless branch Made-with: Cursor
Move shared SAFE_MESSAGE/PR metadata, pick_* wiring, ES commit resolution, YAML-safe QA env vars, and the upload_ml_cpp_deps step fragment into dev-tools/serverless_buildkite_trigger_prepare.sh so deploy and es-pr-check pipelines stay in sync. Made-with: Cursor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Resolve elasticsearch-serverless branch in ordered steps so a fork-only PR_SOURCE warning does not skip the elastic PR_TARGET fallback. Made-with: Cursor
Summary
Adds two automated workflows for testing custom ml-cpp builds in the serverless MKI environment, replacing the manual process documented in elastic/ml-team#1243.
Both workflows inject custom ml-cpp at Docker image build time via a local Ivy repo override (
-Dbuild.ml_cpp.repo), using Buildkite's native cross-pipeline artifact download. No cross-repo git cloning, no AWS credentials, and no manual kubectl/scp steps.How to trigger
Run serverless E2E tests
PR comment:
PR label:
ci:run-serverless-testsDeploy to serverless QA for interactive use
PR comment:
PR label:
ci:deploy-serverless-qaOptional parameters on
buildkite …commentsWhen you trigger via a
buildkite …PR comment (not via labels alone), you may append additional space-separatedKEY=valuetokens after the normal comment text (including optional clauses such asfor ES_BRANCH=…,on linux x86_64, etc.). Those keys are applied to the generated Buildkite pipeline environment for the serverless upload steps.Allowed keys (each value must be a single token—no spaces):
KEEP_DEPLOYMENTtrueorfalse(case-insensitive); any other value is ignored.REGION_IDaws-eu-west-1,gcp-us-central-1PROJECT_TYPEelasticsearch; downstream scripts apply defaults where documented.ES_SERVERLESS_BRANCHelasticsearch-serverlessbranch selection for this build.Examples:
Labels (
ci:run-serverless-tests,ci:deploy-serverless-qa) do not parse this tail. For label-only runs, set overrides as Buildkite environment variables on the build (or in the New Build form) if you need non-defaults.The QA deployment stays up for 1 hour by default. Use
KEEP_DEPLOYMENT=truein the comment tail or as a Buildkite env var to keep it longer. Build annotations show the URL and encrypted credentials.How it works
dev-tools/minimal.zipis uploaded as a Buildkite artifactCoordinated changes
Files
New
.buildkite/pipelines/run_serverless_tests.yml.sh-- E2E test pipeline.buildkite/pipelines/deploy_serverless_qa.yml.sh-- QA deploy pipelineModified
.buildkite/ml_pipeline/config.py--run_serverless_testsanddeploy_serverless_qaflags; parses optional comment tail intoos.environ.buildkite/pipeline.json.py-- wires both pipelines into the PR pipeline; forwards serverless-related env vars when those steps are selected.buildkite/pull-requests.json-- adds both actions to the trigger comment regex and optionalKEY=valuetail captureRequired companion change
elastic/elasticsearch-serverless#6364 -- adds ML_CPP_BUILD_ID support to publish-docker-artifacts.sh
Relates elastic/ml-team#1243