fix(training): leave NVLS on + run one rank per GPU in non-torchrun reference by wimarkuske · Pull Request #74 · coreweave/reference-architecture

wimarkuske · 2026-05-15T17:59:53Z

Summary

Commit 1 — Remove export NCCL_NVLS_ENABLE=0 from allreduce-torchrun.sbatch and allreduce.sbatch. NCCL's default (NVLS on where supported) is now what the reference recommends.

Commit 2 — Rewrite allreduce.sbatch + allreduce.py to launch one rank per GPU. The old mapping (rank=SLURM_NODEID, world_size=SLURM_JOB_NUM_NODES, ntasks-per-node=1) collapsed to one rank per node so only 1 GPU per node participated in the AllReduce.

Why

NVLS: Per Peter Salanki (Slack) — "We should fix this to not recommend disabling of NVLS. We should leave NVLS on." The reference sbatch is what gets quoted to customers in support threads; leaving NVLS disabled in the reference propagates the wrong recommendation.
Rank-per-GPU: discovered while validating the NVLS change on a 2-node SUNK cluster. With the old mapping, NVLS could never engage because there was only one rank per node (NVLS is an intra-node multi-rank optimization). The rewrite makes the non-torchrun reference actually exercise all GPUs/HCAs.

Test plan

Diff inspected — only NCCL_NVLS_ENABLE=0 line removed from each sbatch; rank/world_size mapping switched to SLURM_PROCID/SLURM_NTASKS; ntasks-per-node bumped to 8 with gpus-per-node=8.
Submitted allreduce-torchrun.sbatch on 1-node and 2-node 8xH100 SUNK clusters: COMPLETED, AllReduce sums correct (28 for 8 ranks, 120 for 16 ranks), 16 nvls channels allocated, NVLS multicast available on all GPUs.
Submitted patched allreduce.sbatch on 2-node 8xH100: COMPLETED, world_size=16, AllReduce sum=120 (expected 120), all 8 IB HCAs (ibp0..ibp7) discovered via GDRDMA, 16 NVLS channels per rank.

🤖 Generated with Claude Code

NVLS (NVLink SHARP) should be left on for the reference torch-allreduce sbatch files. Per Peter Salanki, we should not be recommending users disable NVLS in the reference architecture. Removes `export NCCL_NVLS_ENABLE=0` from both allreduce-torchrun.sbatch and allreduce.sbatch so NCCL's default behavior (NVLS on where supported) is preserved. Ref: https://coreweave.slack.com/archives/C06HJF9F16E/p1778789195577419 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The non-torchrun reference previously used SLURM_NODEID as the rank and SLURM_JOB_NUM_NODES as world_size, with ntasks-per-node=1. That collapses to one rank per node, so only 1 of every node's GPUs actually participates in the all-reduce. Switch the canonical mapping to one Slurm task per GPU: - allreduce.sbatch: ntasks-per-node=8, gpus-per-node=8 (all GPUs visible to every task so SLURM_LOCALID-based device pinning works under ConstrainDevices). - allreduce.py: rank = SLURM_PROCID, world_size = SLURM_NTASKS. The rank-0 log gate switches to SLURM_PROCID accordingly. Validated on a 2-node 8xH100 SUNK cluster: world_size=16, AllReduce sum matches expected (120), all 8 IB HCAs (ibp0..ibp7) discovered via GDRDMA, and 16 NVLS channels allocated per rank. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Without a GPU resource directive, srun on a ConstrainDevices=yes cluster hands the torchrun parent zero GPUs and the workers fail with `ProcessGroupNCCL is only supported with GPUs, no GPUs found!`. Add `#SBATCH --gpus-per-node=8` (matching the rewritten allreduce.sbatch) and clarify the existing `ntasks-per-node=1` comment to call out that torchrun spawns the per-GPU workers. Validated on a 2-node 8xH100 SUNK cluster: world_size=16, AllReduce sum matches expected (120). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

wimarkuske and others added 2 commits May 15, 2026 10:59

wimarkuske changed the title ~~fix(training): leave NVLS enabled in reference allreduce sbatch~~ fix(training): leave NVLS on + run one rank per GPU in non-torchrun reference May 15, 2026

MYanello approved these changes May 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(training): leave NVLS on + run one rank per GPU in non-torchrun reference#74

fix(training): leave NVLS on + run one rank per GPU in non-torchrun reference#74
wimarkuske wants to merge 3 commits into
mainfrom
wmarkuske/leave-nvls-on

wimarkuske commented May 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wimarkuske commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wimarkuske commented May 15, 2026 •

edited

Loading