[ML] Fix PyTorch Docker CI: sccache layout and MKL for torch imports by edsavage · Pull Request #3028 · elastic/ml-cpp

edsavage · 2026-04-23T03:39:16Z

Summary

PyTorch nightly Docker image (dev-tools/docker/pytorch_linux_image)
Install the sccache release binary into /usr/local/gcc133/bin so it is on the existing builder PATH (avoids sccache: command not found / exit 127 when BuildKit mounts the GCS key). Strip sccache from the final image after copying gcc133 so the runtime image does not ship the build-only tool.
ml-linux-build image (dev-tools/docker/linux_image)
Export LD_LIBRARY_PATH and PATH in the final rockylinux stage (aligned with the builder) so dynamically loaded libraries (Intel MKL, libtorch_cpu, etc.) resolve when running tools such as python3 -c "import torch" outside the compile RUN.
Buildkite
Set LD_LIBRARY_PATH on the Validate PyTorch allowlist step so currently published ml-linux-build:34 agents pick up MKL until a new Linux image is published that includes the Dockerfile change.

Root causes

PyTorch image build: sccache was unpacked to /usr/local/bin while PATH listed /usr/local/gcc133/bin and /usr/bin only, so the GCS-enabled compile step could not execute sccache.
PR pipeline: Allowlist validation uses the same Docker image as Linux builds; import torch failed with libmkl_intel_lp64.so.2: cannot open shared object file because MKL lives under /usr/local/gcc133/lib but the runtime environment did not include that on the dynamic linker search path.

prodsecmachine · 2026-04-23T03:39:28Z

✅ Snyk checks have passed. No issues have been found so far.

Status	Scan Engine	Critical	High	Medium	Low	Total (0)
✅	Open Source Security	0	0	0	0	0 issues
✅	Licenses	0	0	0	0	0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

elasticsearchmachine · 2026-04-23T03:39:41Z

Pinging @elastic/ml-core (Team:ML)

Install the sccache release binary into /usr/local/gcc133/bin so it is on the existing PATH (no extra directories). The previous layout used /usr/local/bin without listing it on PATH, which broke the GCS-backed compile RUN with 'sccache: command not found' (exit 127). Remove sccache from the final runtime image after copying gcc133; it is only required during the builder compile step. Made-with: Cursor

PyTorch in ml-linux-build is linked against MKL under /usr/local/gcc133 but the final image stage did not export LD_LIBRARY_PATH, so import torch failed in CI (libmkl_intel_lp64.so.2 not found). Set LD_LIBRARY_PATH on the validate_pytorch_allowlist Buildkite step for existing agents, and bake the same env into linux_image for future image releases. Made-with: Cursor

valeriy42

LGTM

valeriy42 · 2026-04-27T09:10:00Z


 FROM rockylinux:8
 COPY --from=builder /usr/local/gcc133 /usr/local/gcc133
+RUN rm -f /usr/local/gcc133/bin/sccache


nit: Are you doing this to reduce the image size? It will be present in the Docker image anyway from the previous COPY layer.

Yes, the intent was to reduce the image size. I've restructured thing so it never gets copied over in the first place.

Place sccache in /usr/local/bin and add that directory to PATH only in the builder stage so COPY --from=builder /usr/local/gcc133 no longer carries sccache into the runtime image. Removes the post-COPY rm workaround and avoids leaving sccache bytes in a gcc133 layer (per review feedback). Made-with: Cursor

edsavage added >build >non-issue :ml v9.5.0 labels Apr 23, 2026

edsavage force-pushed the fix/pytorch-docker-sccache-path branch from c94c32d to f6bb853 Compare April 23, 2026 22:28

edsavage force-pushed the fix/pytorch-docker-sccache-path branch from f6bb853 to c18f430 Compare April 23, 2026 22:33

edsavage changed the title ~~[ML] Fix PyTorch Docker build: add /usr/local/bin to PATH for sccache~~ [ML] Fix PyTorch Docker build: ensure sccache is on PATH Apr 23, 2026

edsavage changed the title ~~[ML] Fix PyTorch Docker build: ensure sccache is on PATH~~ [ML] Fix PyTorch Docker CI: sccache layout and MKL for torch imports Apr 23, 2026

valeriy42 approved these changes Apr 27, 2026

View reviewed changes

edsavage merged commit 2e8cb19 into elastic:main Apr 28, 2026
22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Fix PyTorch Docker CI: sccache layout and MKL for torch imports#3028

[ML] Fix PyTorch Docker CI: sccache layout and MKL for torch imports#3028
edsavage merged 3 commits intoelastic:mainfrom
edsavage:fix/pytorch-docker-sccache-path

edsavage commented Apr 23, 2026 •

edited

Loading

Uh oh!

prodsecmachine commented Apr 23, 2026 •

edited

Loading

Uh oh!

elasticsearchmachine commented Apr 23, 2026

Uh oh!

valeriy42 left a comment

Uh oh!

valeriy42 Apr 27, 2026

Uh oh!

edsavage Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

edsavage commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root causes

Uh oh!

prodsecmachine commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Snyk checks have passed. No issues have been found so far.

Uh oh!

elasticsearchmachine commented Apr 23, 2026

Uh oh!

valeriy42 left a comment

Choose a reason for hiding this comment

Uh oh!

valeriy42 Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

edsavage Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

edsavage commented Apr 23, 2026 •

edited

Loading

prodsecmachine commented Apr 23, 2026 •

edited

Loading