Skip to content

[ML] Fix PyTorch Docker CI: sccache layout and MKL for torch imports#3028

Merged
edsavage merged 3 commits intoelastic:mainfrom
edsavage:fix/pytorch-docker-sccache-path
Apr 28, 2026
Merged

[ML] Fix PyTorch Docker CI: sccache layout and MKL for torch imports#3028
edsavage merged 3 commits intoelastic:mainfrom
edsavage:fix/pytorch-docker-sccache-path

Conversation

@edsavage
Copy link
Copy Markdown
Contributor

@edsavage edsavage commented Apr 23, 2026

Summary

  • PyTorch nightly Docker image (dev-tools/docker/pytorch_linux_image)
    Install the sccache release binary into /usr/local/gcc133/bin so it is on the existing builder PATH (avoids sccache: command not found / exit 127 when BuildKit mounts the GCS key). Strip sccache from the final image after copying gcc133 so the runtime image does not ship the build-only tool.

  • ml-linux-build image (dev-tools/docker/linux_image)
    Export LD_LIBRARY_PATH and PATH in the final rockylinux stage (aligned with the builder) so dynamically loaded libraries (Intel MKL, libtorch_cpu, etc.) resolve when running tools such as python3 -c "import torch" outside the compile RUN.

  • Buildkite
    Set LD_LIBRARY_PATH on the Validate PyTorch allowlist step so currently published ml-linux-build:34 agents pick up MKL until a new Linux image is published that includes the Dockerfile change.

Root causes

  1. PyTorch image build: sccache was unpacked to /usr/local/bin while PATH listed /usr/local/gcc133/bin and /usr/bin only, so the GCS-enabled compile step could not execute sccache.

  2. PR pipeline: Allowlist validation uses the same Docker image as Linux builds; import torch failed with libmkl_intel_lp64.so.2: cannot open shared object file because MKL lives under /usr/local/gcc133/lib but the runtime environment did not include that on the dynamic linker search path.

@prodsecmachine
Copy link
Copy Markdown

prodsecmachine commented Apr 23, 2026

Snyk checks have passed. No issues have been found so far.

Status Scan Engine Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues
Licenses 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

@elasticsearchmachine
Copy link
Copy Markdown

Pinging @elastic/ml-core (Team:ML)

@edsavage edsavage force-pushed the fix/pytorch-docker-sccache-path branch from c94c32d to f6bb853 Compare April 23, 2026 22:28
Install the sccache release binary into /usr/local/gcc133/bin so it is on
the existing PATH (no extra directories). The previous layout used
/usr/local/bin without listing it on PATH, which broke the GCS-backed
compile RUN with 'sccache: command not found' (exit 127).

Remove sccache from the final runtime image after copying gcc133; it is
only required during the builder compile step.

Made-with: Cursor
@edsavage edsavage force-pushed the fix/pytorch-docker-sccache-path branch from f6bb853 to c18f430 Compare April 23, 2026 22:33
@edsavage edsavage changed the title [ML] Fix PyTorch Docker build: add /usr/local/bin to PATH for sccache [ML] Fix PyTorch Docker build: ensure sccache is on PATH Apr 23, 2026
PyTorch in ml-linux-build is linked against MKL under /usr/local/gcc133 but
the final image stage did not export LD_LIBRARY_PATH, so import torch failed
in CI (libmkl_intel_lp64.so.2 not found).

Set LD_LIBRARY_PATH on the validate_pytorch_allowlist Buildkite step for
existing agents, and bake the same env into linux_image for future image
releases.

Made-with: Cursor
@edsavage edsavage changed the title [ML] Fix PyTorch Docker build: ensure sccache is on PATH [ML] Fix PyTorch Docker CI: sccache layout and MKL for torch imports Apr 23, 2026
Copy link
Copy Markdown
Contributor

@valeriy42 valeriy42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


FROM rockylinux:8
COPY --from=builder /usr/local/gcc133 /usr/local/gcc133
RUN rm -f /usr/local/gcc133/bin/sccache
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Are you doing this to reduce the image size? It will be present in the Docker image anyway from the previous COPY layer.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the intent was to reduce the image size. I've restructured thing so it never gets copied over in the first place.

Place sccache in /usr/local/bin and add that directory to PATH only in the
builder stage so COPY --from=builder /usr/local/gcc133 no longer carries
sccache into the runtime image. Removes the post-COPY rm workaround and
avoids leaving sccache bytes in a gcc133 layer (per review feedback).

Made-with: Cursor
@edsavage edsavage merged commit 2e8cb19 into elastic:main Apr 28, 2026
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants