feat: auto-pad FP8 GEMM dimensions for unaligned sequence packing by NoonePauseferg · Pull Request #2911 · NVIDIA/TransformerEngine

NoonePauseferg · 2026-04-21T15:40:05Z

Problem

cuBLAS FP8 GEMM requires lda/ldb % 16 == 0 and m % 8 == 0. With sequence packing (used in RL training frameworks like VERL, OpenRLHF), the total token count per micro-batch is dynamic and almost never aligned to 16:

Micro-batch: 8 sequences, total = 11486 tokens
11486 / TP_size(2) = 5743 tokens per rank
5743 % 16 = 15 ≠ 0 → cuBLAS FP8 GEMM crashes with:
  Assertion failed: ret.lda % 16 == 0

This affects:

RL training (GRPO/PPO with sequence packing and dynamic batch sizes)
MoE models (GroupedLinear per-expert token counts are dynamic after AllToAll dispatch)
Any workload with variable-length packed sequences + FP8

Currently users must manually pad tensors before calling TE modules. External padding corrupts training — padding tokens distort FP8 scale factors, causing 40–500× gradient explosion (documented in #2892).

Solution

Auto-pad m and k dimensions to multiples of 16 inside cublas_gemm() using temporary buffers. No external padding needed, no training corruption.

How it works

m-padding (output dimension, e.g. fprop/dgrad):

Allocate padded output buffer via cudaMallocAsync (stream-ordered, no CPU sync)
Run cuBLAS GEMM into padded buffer (ldd = m_padded)
Copy m_real rows per column back to original output via cudaMemcpy2DAsync
cudaFreeAsync — no sync, no pipeline bubbles

k-padding (contraction dimension, e.g. wgrad where k = num_tokens):

Allocate zero-initialized padded copies of A and/or B via cudaMallocAsync
Copy original data with cudaMemcpy2DAsync (k_real rows per column, rest stays zero)
Run GEMM — zero-padded rows contribute 0 to dot product (mathematically exact)
cudaFreeAsync

Changes

transformer_engine/common/gemm/cublaslt_gemm.cu: auto-pad logic in cublas_gemm(), removed lda%16 / m%8 assertions in CanonicalizeGemmInput()
transformer_engine/pytorch/utils.py: relaxed assert_dim_for_fp8_exec() — C++ handles alignment now

Results

Tested on H100 (SM90), TE 2.12, PyTorch 2.9.1, CUDA 12.9.
Full RL training pipeline: DeepSeek 10B MoE, 4 nodes × 8 H100, TP=2 PP=2 EP=2, sequence packing.

Training quality (FP8 E2E vs BF16 baseline)

Metric	BF16 baseline	FP8 E2E (this PR)	FP8 with external padding
grad_norm	0.29	0.27–0.30	3.3–500
training_log_ppl	1.28	1.34	6.87
log_ppl_diff	0.0003	0.018–0.035	5.30

Per-layer gradient accuracy (FP8 with unaligned M=5743 vs BF16)

Layer type	Dimensions	FP8/BF16 ratio
kv_down_proj (MLA)	1536→512	1.0000×
kv_up_proj (MLA)	512→1536	1.0000×
q_proj	1536→1536	1.0000×
proj (output)	1536→1536	1.0000×
shared_expert fc1	1536→2560	1.0000×
shared_expert fc2	1280→1536	1.0000×
MoE fc1 (32 experts)	1536→2560	1.0000×
MoE fc2 (32 experts)	1280→1536	1.0000×

Memory & performance

Worst-case temp buffer: 15 rows × 4096 × 2B = 120 KB per GEMM
cudaMallocAsync/cudaFreeAsync reuses stream-ordered pool — no fragmentation
Aligned vs unaligned perf: no regression (1.2ms vs 1.1ms per iter)

Addressing previous review comments

The initial version of this PR had issues flagged by the review bot. All have been addressed:

Issue	Status	Fix
P1: Output buffer corruption (writing to undersized buffer with padded ldd)	Fixed	Separate `_pad_D` temp buffer via `cudaMallocAsync`; `cudaMemcpy2DAsync` copies only `m_real` rows back
P1: Block-scaled FP8 scale coupling (padding corrupts per-block scales)	Fixed	Padding uses zero-filled copies; zeros don't affect dot product. Scale factors computed on original data before GEMM
P2: Unused variables (`m_orig`, `k_orig`, `did_pad`)	Fixed	Removed. Now uses `m_real`/`k_real` throughout
P2: Read-beyond-buffer safety (relying on allocator alignment)	Fixed	No out-of-bounds reads. All padded data is in explicitly allocated temp buffers

Related issues

[BUG] SIGSEGV in fused_attn_fwd when cu_seqlens_q != cu_seqlens_q_padded with FP8 blockwise #2892 — SIGSEGV with unpadded cu_seqlens (FP8 + sequence packing)
Issue Using Qwen2-VL HuggingFace Model with Transformer Engine FP8 Training #1889 — cuBLAS assertion with unaligned FP8 dims

greptile-apps · 2026-04-21T15:47:18Z

Greptile Summary

This PR adds auto-padding of FP8 GEMM m and k dimensions inside cublas_gemm() to handle dynamically unaligned token counts from sequence packing, allocating stream-ordered temporary buffers and copying valid rows back after the GEMM. Two P1 defects remain in the padding path:

All three cudaMallocAsync calls (_pad_D, _pad_A, _pad_B) are unchecked; a silent allocation failure leaves ldd = m_padded while the output pointer falls back to the original unpadded buffer, reproducing the exact corruption the PR fixes.
ld_gelumat is set to ldd (padded), so when a GELU epilogue is active with m != m_real and n > 1, cuBLAS writes the pre-GELU auxiliary output out-of-bounds into outputPreGelu->data.dptr.

Confidence Score: 3/5

Not safe to merge — two P1 memory-corruption bugs remain in the new padding path.

Two P1 defects: unchecked cudaMallocAsync return values that silently reintroduce the original buffer-corruption bug on allocation failure, and an unpadded GELU auxiliary output buffer written with a padded stride. Both affect the core FP8 padding path introduced by this PR.

transformer_engine/common/gemm/cublaslt_gemm.cu — padding allocation/error-handling and GELU auxiliary buffer path

Important Files Changed

Filename	Overview
transformer_engine/common/gemm/cublaslt_gemm.cu	Adds auto-padding for FP8 GEMM m/k dimensions; contains two P1 bugs: unchecked cudaMallocAsync return values (silent fallback to corrupt unpadded path) and GELU auxiliary output buffer written with padded stride into an unpadded allocation.
transformer_engine/pytorch/utils.py	assert_dim_for_fp8_exec replaced with a no-op pass; intentional since alignment is now enforced in C++, but removes all Python-side dimension validation.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[cublas_gemm called] --> B{is_fp8_a or is_fp8_b?}
    B -- No --> C[m = m_real, k = k_real, ldd = m_real]
    B -- Yes --> D[m = round_up_16 m_real, k = round_up_16 k_real, ldd = m]
    D --> E{m != m_real AND outputD.dptr != null?}
    E -- Yes --> F[cudaMallocAsync _pad_D - return value unchecked]
    E -- No --> G[_pad_D = nullptr]
    F --> H[CanonicalizeGemmInput m/k padded]
    G --> H
    H --> I{k != k_real AND param.A AND param.B?}
    I -- Yes --> J[cudaMallocAsync _pad_A/_pad_B unchecked, cudaMemcpy2DAsync real rows]
    I -- No --> K
    J --> K[C = _pad_D ? _pad_D : outputD.dptr]
    K --> L{gelu active?}
    L -- Yes --> M[ld_gelumat = ldd = m padded, pre_gelu_out sized for m_real rows, Buffer overflow when n > 1]
    L -- No --> N[Run cublasLtMatmul]
    M --> N
    N --> O{_pad_D != null?}
    O -- Yes --> P[cudaMemcpy2DAsync m_real rows back, cudaFreeAsync _pad_D]
    O -- No --> Q
    P --> Q[cudaFreeAsync _pad_A, _pad_B]
    Q --> R[Done]

Comments Outside Diff (1)

transformer_engine/common/gemm/cublaslt_gemm.cu, line 504 (link)

GELU auxiliary buffer written with padded stride into unpadded allocation

ld_gelumat is set to ldd (= m, the padded row count). When a GELU epilogue is active (CUBLASLT_EPILOGUE_GELU_AUX or CUBLASLT_EPILOGUE_GELU_AUX_BIAS) and m != m_real, cuBLAS writes column j of the pre-GELU auxiliary output at byte offset j × m × sizeof(T) in pre_gelu_out, but that buffer was allocated for m_real rows per column. For n > 1 every column after the first is written out-of-bounds — silent memory corruption structurally identical to the _pad_D bug that was already fixed.

The check at line 484 only prevents an FP8 dtype on outputPreGelu; it does not prevent tensor-scaled FP8 inputs paired with a non-FP8 GELU aux buffer, so the path is reachable. A compatible fix mirrors the _pad_D pattern: allocate a padded GELU aux buffer, use it during the GEMM, then cudaMemcpy2DAsync the m_real valid rows back into outputPreGelu->data.dptr, and cudaFreeAsync the padded buffer.

_{Reviews (3): Last reviewed commit: "Merge branch 'main' into fix/fp8-gemm-au..." | Re-trigger Greptile}

greptile-apps · 2026-04-21T15:47:22Z

+      m += m_pad;
+      k += k_pad;
+      did_pad = true;
+    }
+  }
+  const int ldd = m;


Output-buffer corruption when m is padded

When m_pad > 0, ldd is set to m_padded and Ddesc tells cuBLAS the output matrix has m_padded rows. But the output buffer outputD->data.dptr was allocated for m_orig × n elements, not m_padded × n. Because cuBLAS writes column j at byte offset j × ldd × sizeof(T) = j × m_padded × sizeof(T), every column after the first is written to a position shifted by j × m_pad elements relative to what the caller expects. The caller's buffer for column j starts at j × m_orig, so column 1 onwards is silently misaligned and contains corrupt data.

This manifests in practice with MoE AllToAll (the third use case listed in the PR): after dispatch, each expert receives a variable token count. When transa == CUBLAS_OP_T (e.g., weight-gradient GEMM), m = A0 = token_count_per_expert, which can be unaligned. For n > 1 output columns, the gradient accumulation tensor will have scrambled contents.

A safe approach for the m-dimension is to keep ldd = m_orig and only report m_padded to cuBLAS as the logical number of rows — or, as standard practice suggests, allocate a temporary padded output buffer and copy only the valid m_orig rows back.

greptile-apps · 2026-04-21T15:47:23Z

+  const bool is_fp8_a = is_fp8_dtype(inputA->data.dtype) ||
+                         (inputA->has_columnwise_data() && is_fp8_dtype(inputA->columnwise_data.dtype));
+  const bool is_fp8_b = is_fp8_dtype(inputB->data.dtype) ||
+                         (inputB->has_columnwise_data() && is_fp8_dtype(inputB->columnwise_data.dtype));
+  const int m_orig = m;
+  const int k_orig = k;
+  bool did_pad = false;
+  if (is_fp8_a || is_fp8_b) {


Auto-padding fires for MXFP8 and block-scaled FP8, breaking scale-factor coupling

is_fp8_dtype() returns true for any FP8 data type regardless of the scaling mode — it will trigger for NVTE_MXFP8_SCALING, NVTE_BLOCK_SCALING_1D, and NVTE_BLOCK_SCALING_2D tensors as well as plain tensor-scaled FP8. For MXFP8 and block-scaled modes, scale factors are tied to a fixed-size block (32 or 128 elements) along the contracted dimension k. Padding k by up to 15 elements causes the last block's scale factor to be applied to phantom (out-of-bounds) data, producing incorrect accumulation values for that block.

Consider scoping the auto-padding to tensor-scaled FP8 only:

const bool is_tensor_fp8_a = is_tensor_scaling(inputA->scaling_mode) && (is_fp8_dtype(inputA->data.dtype) || (inputA->has_columnwise_data() && is_fp8_dtype(inputA->columnwise_data.dtype))); const bool is_tensor_fp8_b = is_tensor_scaling(inputB->scaling_mode) && (is_fp8_dtype(inputB->data.dtype) || (inputB->has_columnwise_data() && is_fp8_dtype(inputB->columnwise_data.dtype))); if (is_tensor_fp8_a || is_tensor_fp8_b) { /* pad */ }

greptile-apps · 2026-04-21T15:47:24Z

+  const int m_orig = m;
+  const int k_orig = k;
+  bool did_pad = false;


Unused variables — dead code

m_orig, k_orig, and did_pad are assigned but never referenced again anywhere in cublas_gemm. They appear to be leftovers from a planned output-truncation step that was not implemented. They can be removed.

Suggested change

const int m_orig = m;

const int k_orig = k;

bool did_pad = false;

greptile-apps · 2026-04-21T15:47:25Z

+      // Pad m and k to multiples of 16.
+      // For the GEMM, we pass padded m/k. Input data pointers still point to
+      // the original (unpadded) buffers. cuBLAS will read beyond the valid data
+      // for the padded rows — this is OK as long as:
+      // 1. The padded area is within allocated memory (tensor allocations are
+      //    typically page-aligned, so a few extra rows are safe)
+      // 2. The padded rows' values don't matter (they only affect padded output rows
+      //    which we discard)


Read-beyond-buffer safety relies on allocator implementation details

The comment claims reading beyond the valid input buffer is safe "due to page-aligned GPU allocations." This is generally true for CUDA's caching allocator (allocations are rounded up to 512-byte or larger boundaries), but it is not guaranteed by the CUDA API. A future allocator, a custom allocator (cudaMallocAsync pools, RAPIDS RMM, etc.), or a tensor obtained from a memory-mapped source might not have this guarantee. A brief note acknowledging the reliance on the caching allocator, or a check like NVTE_CHECK(k_pad < 16), would make the intent explicit and defensible.

ptrendx · 2026-04-21T20:16:35Z

Hi @NoonePauseferg - the PR as-is contains many unrelated commits (I think you made it on top of the 2.12 branch), which makes it very difficult to see the actual changes. Could you rebase the PR on top of the current main branch? Thank you!

NoonePauseferg · 2026-04-22T10:10:09Z

Hi @NoonePauseferg - the PR as-is contains many unrelated commits (I think you made it on top of the 2.12 branch), which makes it very difficult to see the actual changes. Could you rebase the PR on top of the current main branch? Thank you!

yeah, working on it - soon gonna fix pr

cuBLAS FP8 GEMM requires lda/ldb % 16 == 0 and m % 8 == 0. RL training frameworks (VERL, OpenRLHF) use sequence packing where total token counts are dynamic and rarely aligned. Manual pre-padding corrupts training by distorting FP8 scale factors (proven: BF16 with padding tokens = grad_norm 1064x explosion). Changes in cublas_gemm(): - Detect FP8 inputs, round up m and k to multiples of 16 - Allocate padded temp buffers via cudaMallocAsync (stream-ordered) - For k-padding: zero-pad A/B columns beyond k_real with cudaMemcpy2D - For m-padding: GEMM into padded output, copy m_real rows back - cudaFreeAsync for cleanup (no CPU-GPU sync, no pipeline bubbles) Changes in utils.py: - Relax assert_dim_for_fp8_exec — C++ now handles alignment internally Tested on H100 (SM90), TE 2.12, PyTorch 2.9.1, CUDA 12.9: - DeepSeek 10B MoE, 4 nodes x 8 GPUs, TP=2 PP=2 EP=2 - FP8/BF16 grad ratio: 0.99-1.00 across all layer types - grad_norm: 0.27-0.30 (BF16 baseline: 0.29) - Memory overhead: <120KB per GEMM (worst case +15 pad rows) - No performance regression (cudaMallocAsync reuses pool) Related: NVIDIA#2892 NVIDIA#1889

for more information, see https://pre-commit.ci

greptile-apps · 2026-04-29T11:27:10Z

+    cudaMallocAsync(&_pad_D, (size_t)m * n * d_elem, stream);
+    cudaMemsetAsync(_pad_D, 0, (size_t)m * n * d_elem, stream);


cudaMallocAsync failures silently fall back to the unpadded buffer

All three cudaMallocAsync calls (for _pad_D, _pad_A, _pad_B) are unchecked. If a call fails—stream-ordered pool exhausted, allocation limit hit, etc.—the pointer stays nullptr and the code silently falls back to the original (unpadded) buffer while ldd = m (padded) has already been set. For _pad_D, this means cuBLAS writes column j at offset j × m_padded × sizeof(T) into a buffer sized for m_real rows, which is exactly the out-of-bounds corruption the PR was written to fix. Wrap all three allocations with NVTE_CHECK_CUDA:

NVTE_CHECK_CUDA(cudaMallocAsync(&_pad_D, (size_t)m * n * d_elem, stream));

Same pattern for _pad_A and _pad_B.

greptile-apps Bot reviewed Apr 21, 2026

View reviewed changes

ptrendx added community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. labels Apr 21, 2026

NoonePauseferg force-pushed the fix/fp8-gemm-auto-alignment-padding branch from 756b153 to b65b244 Compare April 22, 2026 10:19

[pre-commit.ci] auto fixes from pre-commit.com hooks

b6f0ea4

for more information, see https://pre-commit.ci

ptrendx self-assigned this Apr 23, 2026

Merge branch 'main' into fix/fp8-gemm-auto-alignment-padding

7edbdef

greptile-apps Bot reviewed Apr 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: auto-pad FP8 GEMM dimensions for unaligned sequence packing#2911

feat: auto-pad FP8 GEMM dimensions for unaligned sequence packing#2911
NoonePauseferg wants to merge 3 commits intoNVIDIA:mainfrom
NoonePauseferg:fix/fp8-gemm-auto-alignment-padding

NoonePauseferg commented Apr 21, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Apr 21, 2026 •

edited

Loading

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot Apr 21, 2026

Uh oh!

greptile-apps Bot Apr 21, 2026

Uh oh!

greptile-apps Bot Apr 21, 2026

Uh oh!

greptile-apps Bot Apr 21, 2026

Uh oh!

ptrendx commented Apr 21, 2026

Uh oh!

NoonePauseferg commented Apr 22, 2026

Uh oh!

greptile-apps Bot Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	const int m_orig = m;
	const int k_orig = k;
	bool did_pad = false;

		cudaMallocAsync(&_pad_D, (size_t)m * n * d_elem, stream);
		cudaMemsetAsync(_pad_D, 0, (size_t)m * n * d_elem, stream);

Conversation

NoonePauseferg commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

How it works

Changes

Results

Training quality (FP8 E2E vs BF16 baseline)

Per-layer gradient accuracy (FP8 with unaligned M=5743 vs BF16)

Memory & performance

Addressing previous review comments

Related issues

Uh oh!

greptile-apps Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Flowchart

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

ptrendx commented Apr 21, 2026

Uh oh!

NoonePauseferg commented Apr 22, 2026

Uh oh!

greptile-apps Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

NoonePauseferg commented Apr 21, 2026 •

edited

Loading

greptile-apps Bot commented Apr 21, 2026 •

edited

Loading