fix(training): leave NVLS on + run one rank per GPU in non-torchrun reference#74
Open
wimarkuske wants to merge 3 commits into
Open
fix(training): leave NVLS on + run one rank per GPU in non-torchrun reference#74wimarkuske wants to merge 3 commits into
wimarkuske wants to merge 3 commits into
Conversation
NVLS (NVLink SHARP) should be left on for the reference torch-allreduce sbatch files. Per Peter Salanki, we should not be recommending users disable NVLS in the reference architecture. Removes `export NCCL_NVLS_ENABLE=0` from both allreduce-torchrun.sbatch and allreduce.sbatch so NCCL's default behavior (NVLS on where supported) is preserved. Ref: https://coreweave.slack.com/archives/C06HJF9F16E/p1778789195577419 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The non-torchrun reference previously used SLURM_NODEID as the rank and SLURM_JOB_NUM_NODES as world_size, with ntasks-per-node=1. That collapses to one rank per node, so only 1 of every node's GPUs actually participates in the all-reduce. Switch the canonical mapping to one Slurm task per GPU: - allreduce.sbatch: ntasks-per-node=8, gpus-per-node=8 (all GPUs visible to every task so SLURM_LOCALID-based device pinning works under ConstrainDevices). - allreduce.py: rank = SLURM_PROCID, world_size = SLURM_NTASKS. The rank-0 log gate switches to SLURM_PROCID accordingly. Validated on a 2-node 8xH100 SUNK cluster: world_size=16, AllReduce sum matches expected (120), all 8 IB HCAs (ibp0..ibp7) discovered via GDRDMA, and 16 NVLS channels allocated per rank. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Without a GPU resource directive, srun on a ConstrainDevices=yes cluster hands the torchrun parent zero GPUs and the workers fail with `ProcessGroupNCCL is only supported with GPUs, no GPUs found!`. Add `#SBATCH --gpus-per-node=8` (matching the rewritten allreduce.sbatch) and clarify the existing `ntasks-per-node=1` comment to call out that torchrun spawns the per-GPU workers. Validated on a 2-node 8xH100 SUNK cluster: world_size=16, AllReduce sum matches expected (120). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MYanello
approved these changes
May 15, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Commit 1 — Remove
export NCCL_NVLS_ENABLE=0fromallreduce-torchrun.sbatchandallreduce.sbatch. NCCL's default (NVLS on where supported) is now what the reference recommends.Commit 2 — Rewrite
allreduce.sbatch+allreduce.pyto launch one rank per GPU. The old mapping (rank=SLURM_NODEID,world_size=SLURM_JOB_NUM_NODES,ntasks-per-node=1) collapsed to one rank per node so only 1 GPU per node participated in the AllReduce.Why
Test plan
NCCL_NVLS_ENABLE=0line removed from each sbatch; rank/world_size mapping switched toSLURM_PROCID/SLURM_NTASKS; ntasks-per-node bumped to 8 withgpus-per-node=8.allreduce-torchrun.sbatchon 1-node and 2-node 8xH100 SUNK clusters: COMPLETED, AllReduce sums correct (28 for 8 ranks, 120 for 16 ranks),16 nvls channelsallocated, NVLS multicast available on all GPUs.allreduce.sbatchon 2-node 8xH100: COMPLETED, world_size=16, AllReduce sum=120 (expected 120), all 8 IB HCAs (ibp0..ibp7) discovered via GDRDMA, 16 NVLS channels per rank.🤖 Generated with Claude Code