[ML] Add AI-powered build failure analysis to CI pipelines by edsavage · Pull Request #2909 · elastic/ml-cpp

edsavage · 2026-02-20T00:44:41Z

Summary

Adds a new Buildkite pipeline step that automatically analyses failed builds using the Anthropic Claude API
When a build fails, the step fetches logs from failed steps and posts a structured diagnosis (root cause, classification, suggested fix, confidence) as a Buildkite annotation
The step is soft-fail and only runs when the build is actually failing (if: "build.state == 'failed' || build.state == 'failing'")
Wired into all three pipelines: PR builds, nightly snapshot builds, and nightly debug builds
Claude API key stored in Vault at secret/ci/elastic-ml-cpp/anthropic/claude

New files

dev-tools/analyze_build_failure.py — core analysis script
.buildkite/pipelines/analyze_build_failure.yml.sh — pipeline step definition

Test plan

Tested locally with --dry-run against real failed builds (snapshot #5819, debug [7.8][ML] Add a constant to the prediction which minimises the unregularised loss for classification and regression #1194)
Verify Vault secret is accessible from CI agents
Trigger a PR build and confirm the step appears (it should be skipped if the build passes)

Made with Cursor

prodsecmachine · 2026-02-20T00:44:53Z

✅ Snyk checks have passed. No issues have been found so far.

Status	Scan Engine	Critical	High	Medium	Low	Total (0)
✅	Open Source Security	0	0	0	0	0 issues
✅	Licenses	0	0	0	0	0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

valeriy42

Will this create a GitHub comment on the PR with fix suggestions?

edsavage · 2026-02-26T03:08:02Z

Will this create a GitHub comment on the PR with fix suggestions?

No, just as an annotation on the Buildkite build. I guess it could be a GH comment as well though, which would notify the user and have better visibility if changes are suggested.

edsavage · 2026-02-27T02:12:40Z

Pushed a new commit that adds GitHub PR comments for build failure analysis. When the build is a PR build, the analysis is now posted as a comment directly on the PR (in addition to the Buildkite annotation and optional Slack notification).

Key details:

Uses BUILDKITE_PULL_REQUEST env var (set automatically by Buildkite) to detect PR builds
Posts via GitHub API using a token from Vault (secret/ci/elastic-ml-cpp/github/pr_comment_token)
Uses an HTML comment marker () to find and update existing comments on rebuild/retry, avoiding duplicates
Non-PR builds (nightly snapshots, etc.) continue to use Buildkite annotations only

This addresses @valeriy42's feedback about improving visibility for PR authors.

valeriy42

Thank you for adding GH comment functionality. I think it makes a lot of sense to reduce friction so this information is visible to the developer.

My last concern is about burning the API tokens. Is it possible to activate this function per GitHub comment?

I think we have the following user story:
As a developer, I want to ask the CI system why it failed and what needs to be done to fix it.

edsavage · 2026-03-19T02:24:53Z

User Experience: Examining a Failed Build

After this PR merges, here's what a developer sees when a PR build fails:

1. Immediate feedback — GitHub commit statuses (existing)

Red/green marks appear on the PR for each Buildkite step (e.g. "Build on Linux x86_64 RelWithDebInfo — failed"). Each is a clickable link to the Buildkite step. This is the existing publish_commit_status_per_step behaviour.

2. Native PR comment from `@elasticmachine` (new)

Once the build completes, a PR comment appears listing what failed:

💔 Build Failed

Buildkite Build (link)

Failed CI Steps

Build on Linux x86_64 RelWithDebInfo (link)

Test on Linux x86_64 RelWithDebInfo (link)

History

💛 Build [ML] Platform independent dependency report script #2350 was flaky (previous commit)

This is automatic (no opt-in), and includes build history across commits on the PR, flagging flaky builds. Enabled by ELASTIC_PR_COMMENTS_ENABLED in catalog-info.yaml.

3. AI analysis (opt-in for PR builds)

If the failure isn't obvious, the developer comments on the PR:

buildkite analyze

This triggers the analyze_build_failure step which fetches logs from failed steps, sends them to Claude for diagnosis, and posts a Buildkite annotation at the top of the build page.

4. AI analysis PR comment from `@github-actions[bot]` (new)

A GitHub Actions workflow picks up the analyze step's commit status and posts a PR comment:

🔍 Build Failure Analysis

Root Cause

The CMultiFileDataAdderTest test failed due to a temp file collision...

Classification

test failure

Suggested Fix

Use process ID for unique temp file names...

Confidence

high — The error message clearly indicates...

This comment is updated in-place on subsequent analyses (not duplicated). No personal access token or GitHub App is required — the workflow uses the built-in GITHUB_TOKEN.

5. Slack + email (nightly/snapshot builds)

For non-PR builds, the analysis is also posted to #machine-learn-build via Slack. Email notifications to build-machine-learning@elastic.co continue as before.

Summary

When	Where	What	Source
Per-step	PR checks	Red/green status with Buildkite links	Buildkite (existing)
Build complete	PR comment	Failed steps + build history	`@elasticmachine` (new)
After `buildkite analyze`	Buildkite build page	AI diagnosis annotation	Buildkite annotation
After `buildkite analyze`	PR comment	AI diagnosis with root cause + fix	`@github-actions[bot]` (new)

The developer can stop at any layer depending on how obvious the failure is. Most of the time steps 1–2 are enough; the AI analysis is there for non-obvious cases.

edsavage · 2026-03-20T00:11:12Z

buildkite analyze

edsavage · 2026-03-20T02:07:22Z

buildkite analyze

edsavage · 2026-03-24T20:45:42Z

buildkite test this

edsavage · 2026-03-24T20:50:09Z

buildkite run_qa_tests on linux, macos, windows aarch64, x86_64

edsavage · 2026-03-24T21:21:29Z

buildkite run_qa_tests on linux, macos, windows x86_64, aarch64

When a Buildkite build fails, a new soft-fail step fetches the failed step logs and sends them to Claude for diagnosis. The analysis (root cause, classification, suggested fix, confidence) is posted as a Buildkite annotation directly on the build page. The step uses an `if` guard so it only runs when the build is failing, and the Claude API key is retrieved from Vault at runtime. Co-authored-by: Cursor <cursoragent@cursor.com>

When SLACK_WEBHOOK_URL is set, posts a compact summary of each failed step's AI diagnosis to #machine-learn-build. The message includes the classification emoji, root cause, and a link back to the build page. The webhook URL is retrieved from Vault at runtime; if absent, the Slack step is silently skipped and only the Buildkite annotation is posted. Co-authored-by: Cursor <cursoragent@cursor.com>

When the build is a PR build (BUILDKITE_PULL_REQUEST is set), post the Claude analysis as a comment on the GitHub PR in addition to the Buildkite annotation and Slack notification. Uses an HTML comment marker to find and update existing comments on rebuild/retry, avoiding duplicate comments on the same PR. Addresses review feedback from valeriy42 requesting better visibility of failure analysis for PR authors. Made-with: Cursor

Allows overriding the PR number from the command line, useful for local testing of the GitHub comment feature without being in a Buildkite PR build environment. Tested end-to-end against build elastic#2232 (Bayesian test timeout), posting to a throwaway PR. Both initial post and update-in-place (deduplication) verified working. Made-with: Cursor

Failure analysis now only runs on PR builds when triggered by a `buildkite analyze` comment, avoiding unnecessary API token usage. Nightly and debug pipelines retain automatic analysis on failure. Made-with: Cursor

Enable the ELASTIC_PR_COMMENTS_ENABLED feature on the PR builds pipeline so that elasticmachine posts a summary comment listing failed steps and build history directly on the GitHub PR. Made-with: Cursor

Replace direct GitHub API calls from the Buildkite analyze step with a GitHub Actions workflow that uses the built-in GITHUB_TOKEN. The Buildkite step now saves the analysis as build metadata, and a GitHub Actions workflow triggered by the commit status event fetches it and posts/updates the PR comment. This eliminates the need for a personal access token or GitHub App for PR comments. Made-with: Cursor

Made-with: Cursor

The test confirmed Vault is reachable from GitHub Actions runners and JWT auth paths exist. Actual OIDC login needs to be verified with the infra team. Made-with: Cursor

Apply the same fix as PR elastic#3003 to the analyze_build_failure step: compute which build step keys will exist based on the platform config and pass them as ML_BUILD_STEP_KEYS for the shell script to use in its depends_on section. This prevents "Step dependencies not found" errors when not all platforms are built. Made-with: Cursor

The analyze_build_failure step already guards itself with if: "build.state == 'failed' || build.state == 'failing'" so it is automatically skipped for passing builds. Making it always-on (rather than requiring a special "buildkite analyze" comment trigger) ensures it is available whenever a build fails without needing to be requested in advance. Remove the run_analyze config flag and the "analyze" action from the PR comment trigger regex since they are no longer needed. Made-with: Cursor

Introduce a compile error to test the build failure analysis step. This commit will be reverted immediately after verifying the step. Made-with: Cursor

Remove the Buildkite `if` condition from analyze_build_failure.yml.sh. Buildkite evaluates `if` on dynamically uploaded steps at upload time (not at step execution time), so the condition always saw build.state == 'running' and the step was never created. The Python script already checks the build state via the Buildkite API and exits early if the build passed, so the YAML-level `if` is unnecessary. Also reverts the deliberate compile error in CBuildInfo.cc that was used to test the failure analysis flow. Made-with: Cursor

Made-with: Cursor

Use python:3 instead of python:3-slim for the analyze_build_failure step. The slim image lacks curl and git which the Buildkite agent hooks require. Also reverts the deliberate compile error. Made-with: Cursor

Made-with: Cursor

The "Analyze build failure" step ran successfully on Build elastic#2385, correctly identifying the deliberate #error as a code bug with high confidence. Reverting to restore normal builds. Made-with: Cursor

Instead of always including the analysis step or requiring a full rebuild, "buildkite analyze" now triggers a lightweight pipeline that finds the most recent failed build for the branch via the Buildkite API and analyzes it retroactively — no recompilation needed. Also improves log extraction: instead of blindly taking the last 30K chars (which often misses the actual error), the script now scans for error patterns and extracts matching lines with surrounding context. Made-with: Cursor

Replace BOOST_ERROR/BOOST_FAIL patterns (source-code macro names that don't appear in logs) with a pattern matching the actual Boost.Test summary output: "*** N failure(s) detected in test suite". Made-with: Cursor

…l be reverted Made-with: Cursor

The analysis step correctly identified the Boost.Test failure on all platforms. Reverting to restore normal test behaviour. Made-with: Cursor

edsavage · 2026-03-27T01:19:30Z

buildkite test this

edsavage · 2026-03-27T01:21:03Z

buildkite run_qa_tests

Copilot

Pull request overview

Introduces an AI-based Buildkite failure-analysis capability that fetches logs from failed jobs, generates a structured diagnosis via Anthropic Claude, and publishes results to Buildkite annotations and (intended) GitHub PR comments.

Changes:

Add dev-tools/analyze_build_failure.py to extract error context from Buildkite job logs and request an LLM diagnosis (plus optional Slack posting).
Add a Buildkite pipeline step generator .buildkite/pipelines/analyze_build_failure.yml.sh and wire it into nightly pipelines plus a new buildkite analyze PR-comment action.
Add a GitHub Actions workflow to post/update PR comments by fetching analysis from Buildkite build metadata.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
`dev-tools/analyze_build_failure.py`	New failure analysis script: Buildkite API log fetch, excerpting, Claude call, Buildkite annotation + metadata, optional Slack
`.buildkite/pipelines/analyze_build_failure.yml.sh`	New Buildkite step definition for running the analysis script and depending on build steps
`.github/workflows/post-build-analysis.yml`	New workflow to read Buildkite metadata and post/update a PR comment
`.buildkite/pipeline.json.py`	Adds “analyze-only” lightweight pipeline path for PR comment trigger
`.buildkite/ml_pipeline/config.py`	Adds `run_analyze` flag parsed from PR comment action
`.buildkite/pull-requests.json`	Extends trigger comment regex to allow `buildkite analyze` action
`.buildkite/job-build-test-all-debug.json.py`	Wires analysis step into nightly debug pipeline and computes build step keys
`.buildkite/branch.json.py`	Wires analysis step into nightly snapshot pipeline
`.buildkite/hooks/post-checkout`	Fetches Buildkite read token + Anthropic key + Slack webhook from Vault for the analysis step
`catalog-info.yaml`	Adds an environment flag intended to enable PR comments in the PR build pipeline configuration

Comments suppressed due to low confidence (1)

.buildkite/pipeline.json.py:65

PR description says the failure-analysis step is wired into PR builds, but in the normal PR build path this file never appends the analyze-build-failure step (it’s only added in the run_analyze lightweight pipeline). Either add the step to the standard PR pipeline (like nightly pipelines do) or update the PR description/comments to match the actual behavior.

    pipeline_steps.append(pipeline_steps.generate_step("Queue a :slack: notification for the pipeline",
                                                       ".buildkite/pipelines/send_slack_notification.sh"))
    pipeline_steps.append(pipeline_steps.generate_step("Queue a :email: notification for the pipeline",
                                                       ".buildkite/pipelines/send_email_notification.sh"))
    pipeline_steps.append(pipeline_steps.generate_step("Upload clang-format validation",
                                                       ".buildkite/pipelines/format_and_validation.yml.sh"))

    # Compute which build step keys will exist so that analytics and
    # failure-analysis steps can emit a correct depends_on list.
    build_step_keys = []
    if config.build_linux and config.build_aarch64:
        build_step_keys.append("build_test_linux-aarch64-RelWithDebInfo")
    if config.build_linux and config.build_x86_64:
        build_step_keys.append("build_test_linux-x86_64-RelWithDebInfo")
    if config.build_macos and config.build_aarch64:
        build_step_keys.append("build_test_macos-aarch64-RelWithDebInfo")
    if config.build_windows and config.build_x86_64:
        build_step_keys.append("build_test_Windows-x86_64-RelWithDebInfo")

    env = {
        "VERSION_QUALIFIER": "",
        "ML_BUILD_STEP_KEYS": ",".join(build_step_keys),
    }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-30T03:31:29Z

+cat <<EOL
+steps:
+  - label: "Analyze build failure :mag:"
+    key: "analyze_build_failure"
+    command:
+        - "python3 dev-tools/analyze_build_failure.py --pipeline \$BUILDKITE_PIPELINE_SLUG --build \$BUILDKITE_BUILD_NUMBER${EXTRA_FLAGS}"
+EOL


The step definition doesn’t include the conditional mentioned in the PR description (only run when the build is failing). As written, the analysis step will run on successful builds too (and require secrets / call Buildkite APIs), which is unnecessary and could incur external API cost. Add an if: guard on the step (e.g., build.state == 'failed' || build.state == 'failing') or an equivalent gating mechanism.

Copilot · 2026-04-30T03:31:30Z

+            subprocess.run(
+                ["buildkite-agent", "meta-data", "set",
+                 "build-failure-analysis"],
+                input=annotation_body.encode(),
+                check=True,
+            )


buildkite-agent meta-data set is invoked without providing a value argument. The Buildkite agent CLI expects a value (or a --value/--file flag); stdin isn’t used for the value, so this will fail or store an empty value and the GitHub Actions workflow won’t be able to fetch the analysis. Pass the metadata value explicitly (prefer --file to avoid command-length/escaping issues).

Copilot · 2026-04-30T03:31:30Z

+    bk_token = (get_env_or_file("BUILDKITE_TOKEN", "~/.buildkite/token")
+                or get_env_or_file("BUILDKITE_API_READ_TOKEN", ""))
+    claude_key = get_env_or_file("ANTHROPIC_API_KEY", "~/.elastic/claude_api_key")
+
+    if not bk_token:
+        print("Error: No Buildkite token available", file=sys.stderr)
+        sys.exit(1)
+    if not claude_key:
+        print("Error: No Anthropic API key available", file=sys.stderr)
+        sys.exit(1)
+


The script exits if ANTHROPIC_API_KEY is missing before it checks whether the target build actually failed. This means the step still requires the Claude key even when the build passed (or when no failed jobs are found), which defeats the “only run on failure” intent and makes local/dry-run usage harder. Move the Claude key requirement to just before calling Claude (after confirming there are failed jobs to analyze).

Copilot · 2026-04-30T03:31:30Z

+    failed_jobs = [
+        j for j in build.get("jobs", [])
+        if j.get("type") == "script" and j.get("state") == "failed"


Failed jobs are filtered to state == 'failed' only. Buildkite can mark failing jobs as timed_out (and potentially other terminal states), so the script may print “No failed steps found” even when the build is failing. Include additional failure states (at least timed_out) when selecting jobs to analyze.

Suggested change

failed_jobs = [

j for j in build.get("jobs", [])

if j.get("type") == "script" and j.get("state") == "failed"

failed_job_states = {"failed", "timed_out"}

failed_jobs = [

j for j in build.get("jobs", [])

if j.get("type") == "script" and j.get("state") in failed_job_states

Copilot · 2026-04-30T03:31:30Z

+        log_excerpt = extract_error_context(log_text)
+
+        prompt = f"""Analyze this CI build failure.
+
+**Pipeline**: {args.pipeline}
+**Build**: #{args.build}
+**Branch**: {build.get('branch', 'unknown')}
+**Failed step**: {step_label} (key: {step_key})
+
+{KNOWN_FAILURE_PATTERNS}
+
+**Build log (error-relevant sections extracted from full log)**:
+```
+{log_excerpt}
+```
+
+Analyze the root cause and suggest a fix."""


The script forwards (parts of) raw Buildkite logs to an external LLM API. Currently only ANSI/timestamp markers are removed; there’s no redaction for secrets (tokens, API keys, credentials, URLs with embedded auth, etc.). Add a redaction pass before sending log_excerpt to Claude (and before posting to Slack/annotations), and consider explicitly disabling analysis when the build is handling sensitive secrets.

Copilot · 2026-04-30T03:31:30Z

+          ${MARKER}
+          ## :mag: Build Failure Analysis
+
+          ${ANALYSIS}
+
+          ---
+          [View Buildkite build](${BUILD_URL}) | *Analysis generated by Claude. Verify before acting.*
+          EOF


The here-doc that builds BODY is indented, which will inject leading spaces into the posted comment content. In GitHub-flavored Markdown, leading spaces can turn the entire comment into a code block and break formatting. Remove the extra indentation inside the here-doc (or use a <<-EOF heredoc with tabs) so the marker/header start at column 0.

Suggested change

${MARKER}

## :mag: Build Failure Analysis

${ANALYSIS}

---

[View Buildkite build](${BUILD_URL}) | *Analysis generated by Claude. Verify before acting.*

EOF

${MARKER}

## :mag: Build Failure Analysis

${ANALYSIS}

---

[View Buildkite build](${BUILD_URL}) | *Analysis generated by Claude. Verify before acting.*

EOF

Copilot · 2026-04-30T03:31:31Z

+# Emit depends_on dynamically — ML_BUILD_STEP_KEYS is a comma-separated
+# list of step keys set by the pipeline generator.  In analyze-previous
+# mode there are no build steps so this block is skipped.
+if [ -n "${ML_BUILD_STEP_KEYS:-}" ]; then
+    echo '    depends_on:'
+    IFS=',' read -ra STEP_KEYS <<< "$ML_BUILD_STEP_KEYS"
+    for key in "${STEP_KEYS[@]}"; do


depends_on is built only from ML_BUILD_STEP_KEYS. In pipelines that have separate test steps (e.g. branch.json.py sets ML_TEST_STEP_KEYS), this allows the analysis step to start before tests finish and it may miss test failures entirely. Consider including test step keys in depends_on (or using a combined env var) so the analysis runs after all relevant jobs have completed.

Suggested change

# Emit depends_on dynamically — ML_BUILD_STEP_KEYS is a comma-separated

# list of step keys set by the pipeline generator. In analyze-previous

# mode there are no build steps so this block is skipped.

if [ -n "${ML_BUILD_STEP_KEYS:-}" ]; then

echo ' depends_on:'

IFS=',' read -ra STEP_KEYS <<< "$ML_BUILD_STEP_KEYS"

for key in "${STEP_KEYS[@]}"; do

# Emit depends_on dynamically — ML_BUILD_STEP_KEYS and ML_TEST_STEP_KEYS

# are comma-separated lists of step keys set by the pipeline generator.

# In analyze-previous mode there may be no build or test steps, so this

# block is skipped when neither list is present.

DEPENDS_ON_KEYS=()

if [ -n "${ML_BUILD_STEP_KEYS:-}" ]; then

IFS=',' read -ra STEP_KEYS <<< "$ML_BUILD_STEP_KEYS"

DEPENDS_ON_KEYS+=("${STEP_KEYS[@]}")

fi

if [ -n "${ML_TEST_STEP_KEYS:-}" ]; then

IFS=',' read -ra STEP_KEYS <<< "$ML_TEST_STEP_KEYS"

DEPENDS_ON_KEYS+=("${STEP_KEYS[@]}")

fi

if [ "${#DEPENDS_ON_KEYS[@]}" -gt 0 ]; then

echo ' depends_on:'

for key in "${DEPENDS_ON_KEYS[@]}"; do

Copilot · 2026-04-30T03:31:31Z

+def buildkite_get(path, token):
+    url = f"https://api.buildkite.com/v2/organizations/{BUILDKITE_ORG}/{path}"
+    req = urllib.request.Request(url, headers={"Authorization": f"Bearer {token}"})
+    with urllib.request.urlopen(req) as resp:
+        return json.loads(resp.read())
+
+
+def find_previous_failed_build(pipeline, token, branch=None, exclude_build=None):
+    """Find the most recent failed build for a pipeline, optionally filtered by branch."""
+    params = {"state": "failed", "per_page": "5"}
+    if branch:
+        params["branch"] = branch
+    query = urllib.parse.urlencode(params)
+    builds = buildkite_get(f"pipelines/{pipeline}/builds?{query}", token)
+    for build in builds:
+        if exclude_build and build.get("number") == exclude_build:
+            continue
+        return build
+    return None
+
+
+def get_job_log(log_url, token):
+    """Fetch the raw log for a Buildkite job."""
+    req = urllib.request.Request(
+        log_url,
+        headers={
+            "Authorization": f"Bearer {token}",
+            "Accept": "text/plain",
+        },
+    )
+    try:
+        with urllib.request.urlopen(req) as resp:
+            return resp.read().decode("utf-8", errors="replace")
+    except urllib.error.HTTPError:
+        return None
+


Buildkite API calls don’t set explicit timeouts and have limited error handling. buildkite_get() has no timeout and get_job_log() only catches HTTPError, so transient network/DNS issues can crash the script and prevent annotations/metadata from being published. Add reasonable timeouts and catch URLError/timeouts around Buildkite API and log fetches, returning a clean “could not fetch” analysis instead of raising.

Copilot · 2026-04-30T03:31:31Z

+          ANALYSIS=$(curl -sS -f \
+            -H "Authorization: Bearer ${BK_TOKEN}" \
+            "https://api.buildkite.com/v2/organizations/elastic/pipelines/${PIPELINE}/builds/${BUILD_NUM}/meta-data/build-failure-analysis" \
+            2>/dev/null) || true
+
+          if [ -z "$ANALYSIS" ]; then
+            echo "No analysis metadata found — skipping."
+            echo "skip=true" >> "$GITHUB_OUTPUT"
+            exit 0
+          fi
+
+          # Save to file to avoid shell quoting issues.
+          echo "$ANALYSIS" > /tmp/analysis.md


The Buildkite meta-data endpoint returns JSON (e.g. { key, value }). This workflow currently writes the entire JSON response into analysis.md, so the PR comment will contain JSON rather than the actual analysis text. Parse the response and extract .value (e.g. with jq -r) before writing to the file.

Suggested change

ANALYSIS=$(curl -sS -f \

-H "Authorization: Bearer ${BK_TOKEN}" \

"https://api.buildkite.com/v2/organizations/elastic/pipelines/${PIPELINE}/builds/${BUILD_NUM}/meta-data/build-failure-analysis" \

2>/dev/null) || true

if [ -z "$ANALYSIS" ]; then

echo "No analysis metadata found — skipping."

echo "skip=true" >> "$GITHUB_OUTPUT"

exit 0

fi

# Save to file to avoid shell quoting issues.

echo "$ANALYSIS" > /tmp/analysis.md

ANALYSIS_JSON=$(curl -sS -f \

-H "Authorization: Bearer ${BK_TOKEN}" \

"https://api.buildkite.com/v2/organizations/elastic/pipelines/${PIPELINE}/builds/${BUILD_NUM}/meta-data/build-failure-analysis" \

2>/dev/null) || true

if [ -z "$ANALYSIS_JSON" ]; then

echo "No analysis metadata found — skipping."

echo "skip=true" >> "$GITHUB_OUTPUT"

exit 0

fi

ANALYSIS=$(printf '%s' "$ANALYSIS_JSON" | jq -r '.value // empty')

if [ -z "$ANALYSIS" ]; then

echo "Analysis metadata did not contain a value — skipping."

echo "skip=true" >> "$GITHUB_OUTPUT"

exit 0

fi

# Save to file to avoid shell quoting issues.

printf '%s\n' "$ANALYSIS" > /tmp/analysis.md

- Gate analysis at job start on BUILDKITE_BUILD_STATE (avoid YAML if on build.state, which is evaluated at upload time). Still run when ML_ANALYZE_PREVIOUS is set. - depends_on: merge ML_BUILD_STEP_KEYS and ML_TEST_STEP_KEYS with dedupe. - analyze_build_failure.py: timeouts and RuntimeError on Buildkite API; treat timed_out jobs as failed; redact secrets in log excerpts; defer Anthropic key until Claude call; dry-run without key; meta-data via stdin from temp file; find-previous tolerates API errors. - post-build-analysis: jq extract metadata value; build PR body without leading-indent heredoc; gh api with JSON from jq --rawfile. Made-with: Cursor

elasticsearchmachine · 2026-04-30T04:56:11Z

Pinging @elastic/ml-core (Team:ML)

valeriy42 reviewed Feb 20, 2026

View reviewed changes

edsavage added >build >non-issue :ml v9.4.0 labels Feb 27, 2026

valeriy42 reviewed Feb 27, 2026

View reviewed changes

edsavage force-pushed the ci-build-failure-analyzer branch from 55d3ad2 to 19a3caf Compare March 19, 2026 01:48

edsavage requested a review from valeriy42 March 19, 2026 02:30

edsavage force-pushed the ci-build-failure-analyzer branch from 16df0df to 4cc72f3 Compare March 20, 2026 00:11

edsavage force-pushed the ci-build-failure-analyzer branch from bfe59eb to 934d453 Compare March 24, 2026 20:39

edsavage added the ci:run-qa-tests Run a subset of the QA tests label Mar 24, 2026

edsavage and others added 10 commits March 26, 2026 15:51

[ML] Make AI failure analysis opt-in for PR builds

f87e872

Failure analysis now only runs on PR builds when triggered by a `buildkite analyze` comment, avoiding unnecessary API token usage. Nightly and debug pipelines retain automatic analysis on failure. Made-with: Cursor

[ML] Enable native Buildkite PR comments for build failures

5f13660

Enable the ELASTIC_PR_COMMENTS_ENABLED feature on the PR builds pipeline so that elasticmachine posts a summary comment listing failed steps and build history directly on the GitHub PR. Made-with: Cursor

[ML] Add temporary workflow to test Vault OIDC for GitHub Actions

5a038f4

Made-with: Cursor

[ML] Remove temporary Vault OIDC test workflow

8907f3e

The test confirmed Vault is reachable from GitHub Actions runners and JWT auth paths exist. Actual OIDC login needs to be verified with the infra team. Made-with: Cursor

edsavage added 11 commits March 26, 2026 15:51

[ML] TEMPORARY: deliberate compile error for CI testing

48a036f

Introduce a compile error to test the build failure analysis step. This commit will be reverted immediately after verifying the step. Made-with: Cursor

[ML] TEMPORARY: deliberate compile error for CI testing (take 2)

a567efb

Made-with: Cursor

[ML] Fix analyze step Docker image and revert compile error

729f382

Use python:3 instead of python:3-slim for the analyze_build_failure step. The slim image lacks curl and git which the Buildkite agent hooks require. Also reverts the deliberate compile error. Made-with: Cursor

[ML] TEMPORARY: deliberate compile error for CI testing (take 3)

038527f

Made-with: Cursor

[ML] Revert deliberate compile error after successful CI test

84766f1

The "Analyze build failure" step ran successfully on Build elastic#2385, correctly identifying the deliberate #error as a code bug with high confidence. Reverting to restore normal builds. Made-with: Cursor

[ML] Improve Boost.Test failure detection in log extraction

4d9f3c0

Replace BOOST_ERROR/BOOST_FAIL patterns (source-code macro names that don't appear in logs) with a pattern matching the actual Boost.Test summary output: "*** N failure(s) detected in test suite". Made-with: Cursor

[ML] TEMPORARY: deliberate test failure for CI analysis testing — wil…

27dd2d2

…l be reverted Made-with: Cursor

[ML] Revert deliberate test failure after successful analysis testing

7b7a667

The analysis step correctly identified the Boost.Test failure on all platforms. Reverting to restore normal test behaviour. Made-with: Cursor

edsavage force-pushed the ci-build-failure-analyzer branch from 934d453 to 7b7a667 Compare March 26, 2026 02:51

Merge upstream/main into ci-build-failure-analyzer

31285a5

edsavage added v9.5.0 and removed v9.4.0 labels Apr 13, 2026

edsavage requested a review from Copilot April 30, 2026 03:24

Copilot started reviewing on behalf of edsavage April 30, 2026 03:25 View session

Copilot AI reviewed Apr 30, 2026

View reviewed changes

-# Emit depends_on dynamically — ML_BUILD_STEP_KEYS is a comma-separated
-# list of step keys set by the pipeline generator.  In analyze-previous
-# mode there are no build steps so this block is skipped.
-if [ -n "${ML_BUILD_STEP_KEYS:-}" ]; then
-    echo '    depends_on:'
-    IFS=',' read -ra STEP_KEYS <<< "$ML_BUILD_STEP_KEYS"
-    for key in "${STEP_KEYS[@]}"; do
+# Emit depends_on dynamically — ML_BUILD_STEP_KEYS and ML_TEST_STEP_KEYS
+# are comma-separated lists of step keys set by the pipeline generator.
+# In analyze-previous mode there may be no build or test steps, so this
+# block is skipped when neither list is present.
+DEPENDS_ON_KEYS=()
+if [ -n "${ML_BUILD_STEP_KEYS:-}" ]; then
+    IFS=',' read -ra STEP_KEYS <<< "$ML_BUILD_STEP_KEYS"
+    DEPENDS_ON_KEYS+=("${STEP_KEYS[@]}")
+fi
+if [ -n "${ML_TEST_STEP_KEYS:-}" ]; then
+    IFS=',' read -ra STEP_KEYS <<< "$ML_TEST_STEP_KEYS"
+    DEPENDS_ON_KEYS+=("${STEP_KEYS[@]}")
+fi
+if [ "${#DEPENDS_ON_KEYS[@]}" -gt 0 ]; then
+    echo '    depends_on:'
+    for key in "${DEPENDS_ON_KEYS[@]}"; do

Conversation

edsavage commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New files

Test plan

Uh oh!

prodsecmachine commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Snyk checks have passed. No issues have been found so far.

Uh oh!

valeriy42 left a comment

Choose a reason for hiding this comment

Uh oh!

edsavage commented Feb 26, 2026

Uh oh!

edsavage commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

valeriy42 left a comment

Choose a reason for hiding this comment

Uh oh!

edsavage commented Mar 19, 2026

User Experience: Examining a Failed Build

1. Immediate feedback — GitHub commit statuses (existing)

2. Native PR comment from @elasticmachine (new)

3. AI analysis (opt-in for PR builds)

4. AI analysis PR comment from @github-actions[bot] (new)

Root Cause

Classification

Suggested Fix

Confidence

5. Slack + email (nightly/snapshot builds)

Uh oh!

edsavage commented Mar 20, 2026

Uh oh!

edsavage commented Mar 20, 2026

Uh oh!

edsavage commented Mar 24, 2026

Uh oh!

edsavage commented Mar 24, 2026

Uh oh!

edsavage commented Mar 24, 2026

Uh oh!

edsavage commented Mar 27, 2026

Uh oh!

edsavage commented Mar 27, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

edsavage commented Feb 20, 2026 •

edited

Loading

prodsecmachine commented Feb 20, 2026 •

edited

Loading

edsavage commented Feb 27, 2026 •

edited

Loading

2. Native PR comment from `@elasticmachine` (new)

4. AI analysis PR comment from `@github-actions[bot]` (new)