Skip to content

fix(training): leave NVLS on + run one rank per GPU in non-torchrun reference#74

Open
wimarkuske wants to merge 3 commits into
mainfrom
wmarkuske/leave-nvls-on
Open

fix(training): leave NVLS on + run one rank per GPU in non-torchrun reference#74
wimarkuske wants to merge 3 commits into
mainfrom
wmarkuske/leave-nvls-on

Conversation

@wimarkuske
Copy link
Copy Markdown
Contributor

@wimarkuske wimarkuske commented May 15, 2026

Summary

Commit 1 — Remove export NCCL_NVLS_ENABLE=0 from allreduce-torchrun.sbatch and allreduce.sbatch. NCCL's default (NVLS on where supported) is now what the reference recommends.

Commit 2 — Rewrite allreduce.sbatch + allreduce.py to launch one rank per GPU. The old mapping (rank=SLURM_NODEID, world_size=SLURM_JOB_NUM_NODES, ntasks-per-node=1) collapsed to one rank per node so only 1 GPU per node participated in the AllReduce.

Why

  • NVLS: Per Peter Salanki (Slack) — "We should fix this to not recommend disabling of NVLS. We should leave NVLS on." The reference sbatch is what gets quoted to customers in support threads; leaving NVLS disabled in the reference propagates the wrong recommendation.
  • Rank-per-GPU: discovered while validating the NVLS change on a 2-node SUNK cluster. With the old mapping, NVLS could never engage because there was only one rank per node (NVLS is an intra-node multi-rank optimization). The rewrite makes the non-torchrun reference actually exercise all GPUs/HCAs.

Test plan

  • Diff inspected — only NCCL_NVLS_ENABLE=0 line removed from each sbatch; rank/world_size mapping switched to SLURM_PROCID/SLURM_NTASKS; ntasks-per-node bumped to 8 with gpus-per-node=8.
  • Submitted allreduce-torchrun.sbatch on 1-node and 2-node 8xH100 SUNK clusters: COMPLETED, AllReduce sums correct (28 for 8 ranks, 120 for 16 ranks), 16 nvls channels allocated, NVLS multicast available on all GPUs.
  • Submitted patched allreduce.sbatch on 2-node 8xH100: COMPLETED, world_size=16, AllReduce sum=120 (expected 120), all 8 IB HCAs (ibp0..ibp7) discovered via GDRDMA, 16 NVLS channels per rank.

🤖 Generated with Claude Code

wimarkuske and others added 2 commits May 15, 2026 10:59
NVLS (NVLink SHARP) should be left on for the reference torch-allreduce
sbatch files. Per Peter Salanki, we should not be recommending users
disable NVLS in the reference architecture.

Removes `export NCCL_NVLS_ENABLE=0` from both allreduce-torchrun.sbatch
and allreduce.sbatch so NCCL's default behavior (NVLS on where
supported) is preserved.

Ref: https://coreweave.slack.com/archives/C06HJF9F16E/p1778789195577419

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The non-torchrun reference previously used SLURM_NODEID as the rank and
SLURM_JOB_NUM_NODES as world_size, with ntasks-per-node=1. That collapses
to one rank per node, so only 1 of every node's GPUs actually
participates in the all-reduce.

Switch the canonical mapping to one Slurm task per GPU:

- allreduce.sbatch: ntasks-per-node=8, gpus-per-node=8 (all GPUs visible
  to every task so SLURM_LOCALID-based device pinning works under
  ConstrainDevices).
- allreduce.py: rank = SLURM_PROCID, world_size = SLURM_NTASKS. The
  rank-0 log gate switches to SLURM_PROCID accordingly.

Validated on a 2-node 8xH100 SUNK cluster: world_size=16, AllReduce sum
matches expected (120), all 8 IB HCAs (ibp0..ibp7) discovered via
GDRDMA, and 16 NVLS channels allocated per rank.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@wimarkuske wimarkuske changed the title fix(training): leave NVLS enabled in reference allreduce sbatch fix(training): leave NVLS on + run one rank per GPU in non-torchrun reference May 15, 2026
Without a GPU resource directive, srun on a ConstrainDevices=yes cluster
hands the torchrun parent zero GPUs and the workers fail with
`ProcessGroupNCCL is only supported with GPUs, no GPUs found!`.

Add `#SBATCH --gpus-per-node=8` (matching the rewritten allreduce.sbatch)
and clarify the existing `ntasks-per-node=1` comment to call out that
torchrun spawns the per-GPU workers.

Validated on a 2-node 8xH100 SUNK cluster: world_size=16, AllReduce sum
matches expected (120).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants