[ML] Fix PyTorch Docker CI: sccache layout and MKL for torch imports#3028
Merged
edsavage merged 3 commits intoelastic:mainfrom Apr 28, 2026
Merged
[ML] Fix PyTorch Docker CI: sccache layout and MKL for torch imports#3028edsavage merged 3 commits intoelastic:mainfrom
edsavage merged 3 commits intoelastic:mainfrom
Conversation
✅ Snyk checks have passed. No issues have been found so far.
💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse. |
|
Pinging @elastic/ml-core (Team:ML) |
c94c32d to
f6bb853
Compare
Install the sccache release binary into /usr/local/gcc133/bin so it is on the existing PATH (no extra directories). The previous layout used /usr/local/bin without listing it on PATH, which broke the GCS-backed compile RUN with 'sccache: command not found' (exit 127). Remove sccache from the final runtime image after copying gcc133; it is only required during the builder compile step. Made-with: Cursor
f6bb853 to
c18f430
Compare
PyTorch in ml-linux-build is linked against MKL under /usr/local/gcc133 but the final image stage did not export LD_LIBRARY_PATH, so import torch failed in CI (libmkl_intel_lp64.so.2 not found). Set LD_LIBRARY_PATH on the validate_pytorch_allowlist Buildkite step for existing agents, and bake the same env into linux_image for future image releases. Made-with: Cursor
valeriy42
approved these changes
Apr 27, 2026
|
|
||
| FROM rockylinux:8 | ||
| COPY --from=builder /usr/local/gcc133 /usr/local/gcc133 | ||
| RUN rm -f /usr/local/gcc133/bin/sccache |
Contributor
There was a problem hiding this comment.
nit: Are you doing this to reduce the image size? It will be present in the Docker image anyway from the previous COPY layer.
Contributor
Author
There was a problem hiding this comment.
Yes, the intent was to reduce the image size. I've restructured thing so it never gets copied over in the first place.
Place sccache in /usr/local/bin and add that directory to PATH only in the builder stage so COPY --from=builder /usr/local/gcc133 no longer carries sccache into the runtime image. Removes the post-COPY rm workaround and avoids leaving sccache bytes in a gcc133 layer (per review feedback). Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PyTorch nightly Docker image (
dev-tools/docker/pytorch_linux_image)Install the
sccacherelease binary into/usr/local/gcc133/binso it is on the existing builderPATH(avoidssccache: command not found/ exit 127 when BuildKit mounts the GCS key). Stripsccachefrom the final image after copyinggcc133so the runtime image does not ship the build-only tool.ml-linux-buildimage (dev-tools/docker/linux_image)Export
LD_LIBRARY_PATHandPATHin the finalrockylinuxstage (aligned with the builder) so dynamically loaded libraries (Intel MKL,libtorch_cpu, etc.) resolve when running tools such aspython3 -c "import torch"outside the compileRUN.Buildkite
Set
LD_LIBRARY_PATHon the Validate PyTorch allowlist step so currently publishedml-linux-build:34agents pick up MKL until a new Linux image is published that includes the Dockerfile change.Root causes
PyTorch image build:
sccachewas unpacked to/usr/local/binwhilePATHlisted/usr/local/gcc133/binand/usr/binonly, so the GCS-enabled compile step could not executesccache.PR pipeline: Allowlist validation uses the same Docker image as Linux builds;
import torchfailed withlibmkl_intel_lp64.so.2: cannot open shared object filebecause MKL lives under/usr/local/gcc133/libbut the runtime environment did not include that on the dynamic linker search path.