Correctly pad scaling factor inverses to satisfy cuteDSL requirements by ksivaman · Pull Request #2924 · NVIDIA/TransformerEngine

ksivaman · 2026-04-24T20:37:37Z

Description

Fix grouped MXFP8 swizzle when per-expert rows aren't a multiple of 128 and pad each expert's scales to (128, 4).

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Make sure scaling factor inverses are 128x4 padded per tensor.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

ksivaman · 2026-04-24T20:37:58Z

/te-ci

greptile-apps · 2026-04-24T20:42:59Z

Greptile Summary

This PR fixes grouped MXFP8 swizzle when each expert's row count is not a multiple of 128 by teaching the swizzle kernels to accept a "compact" input buffer (per-tensor stride = m × padded_k rather than padded_m × padded_k) and allocating the output in the correct per-tensor padded layout. The kernel refactor adds compile-time IS_PADDED_K/IS_PADDED_M template specializations to avoid out-of-bounds loads past the unpadded extent of each expert's buffer, dispatching at the block level where the decision is uniform across all threads.

Confidence Score: 5/5

Safe to merge; all findings are P2 documentation nits with no impact on correctness

The compact-layout detection, per-tensor stride separation, and IS_PADDED_K/IS_PADDED_M dispatch are logically sound. The out-of-bounds guard correctly prevents reading past the unpadded per-tensor extent in every grouped kernel variant. No P1/P0 issues found.

transformer_engine/common/swizzle/swizzle.cu — most complex change; worth a final read of the compact colwise stride semantics

Important Files Changed

Filename	Overview
transformer_engine/common/swizzle/swizzle.cu	Core fix: introduces IS_PADDED_K/IS_PADDED_M compile-time dispatch to skip out-of-bounds loads from compact input buffers; adds compact-layout detection and separate input/output strides for grouped kernels; one minor comment typo (DIVUP(original_K, 1))
transformer_engine/pytorch/csrc/extensions/swizzle.cpp	Output scale buffers are now allocated with the per-tensor padded shape (num_tensors * padded_m, padded_k) instead of the raw input shape, ensuring the swizzle kernel receives correctly sized output regardless of whether the input is compact or padded
transformer_engine/common/common.h	Adds TRANSFORMER_ENGINE_VECTORIZED_LOAD_INTEGER_TYPE_SWITCH macro to replace the repeated switch-case boilerplate for vec_load_size in the grouped swizzle dispatch
tests/cpp/operator/test_swizzle.cu	Adds SwizzleGroupedCompactInputTestSuite covering aligned/unaligned M and K shapes, including the originally-failing 2880×2880 case; also refactors existing ceiling-division calls to divide_round_up

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[swizzle_grouped_scaling_factors] --> B{Detect input layout}
    B -->|numel == num_tensors x padded_scale_elems| C[Padded layout\ninput_stride = padded_m x padded_k]
    B -->|numel == compact_total_scale_elems| D[Compact layout\ninput_stride = m x padded_k]
    B -->|neither| E[NVTE_ERROR]
    C --> F[output_stride = num_tensors x padded_scale_elems]
    D --> F
    F --> G{rowwise?}
    G -->|yes| H[grouped_swizzle_row_scaling_uniform_shape_kernel]
    G -->|no| I[grouped_swizzle_col_scaling_uniform_shape_kernel]
    H --> J{boundary block?}
    I --> J
    J -->|IS_PADDED_M| K[Skip __ldg for row >= original_M, write 0]
    J -->|IS_PADDED_K| L[Skip __ldg for k_coord >= original_K, write 0]
    J -->|neither| M[Normal __ldg load + shuffle + store]
    K --> N[Output: num_tensors x padded_m x padded_k, padded regions = 0]
    L --> N
    M --> N

_{Reviews (4): Last reviewed commit: "Merge branch 'main' into pad_weight_scal..." | Re-trigger Greptile}

greptile-apps · 2026-04-24T20:43:03Z

+  const auto logical_shape_nvte = input.logical_shape();
+  NVTE_CHECK(logical_shape_nvte.ndim >= 2,
+             "Grouped GEMM swizzle expects logical_shape with ndim >= 2.");
+  const size_t per_tensor_first_dim = logical_shape_nvte.data[0] / num_tensors;


Silent truncation when logical_shape_nvte.data[0] is not divisible by num_tensors

per_tensor_first_dim is computed with plain integer division. If logical_shape_nvte.data[0] is not an exact multiple of num_tensors (e.g. due to a caller bug or unexpected grouped layout), the result is silently truncated, causing padded_m to be underestimated and the output buffer to be too small. A divisibility assertion would catch this much earlier with a clear error message.

Suggested change

const size_t per_tensor_first_dim = logical_shape_nvte.data[0] / num_tensors;

const size_t per_tensor_first_dim = logical_shape_nvte.data[0] / num_tensors;

NVTE_CHECK(logical_shape_nvte.data[0] % num_tensors == 0,

"Grouped GEMM swizzle expects logical_shape first dim to be divisible by num_tensors.");

greptile-apps · 2026-04-24T20:43:04Z

+    bool input_is_compact;
+    if (input_scale_numel == input->num_tensors * padded_scale_elems) {
+      input_is_compact = false;
+    } else if (input_scale_numel == compact_total_scale_elems) {
+      input_is_compact = true;
    } else {
-      NVTE_CHECK(input->columnwise_scale_inv.numel() == input->num_tensors * scale_elems,
-                 "Grouped input columnwise_scale_inv size does not match expected packed size.");
-      NVTE_CHECK(output->columnwise_scale_inv.numel() == output->num_tensors * scale_elems,
-                 "Grouped output columnwise_scale_inv size does not match expected packed size.");
+      NVTE_ERROR("Grouped input ", (rowwise ? "scale_inv" : "columnwise_scale_inv"),
+                 " size does not match expected packed size (got ", input_scale_numel,
+                 ", expected either ", input->num_tensors * padded_scale_elems,
+                 " (per-tensor padded) or ", compact_total_scale_elems, " (compact)).");
    }


Implicit contract on compact-buffer alignment is not validated

The compact_total_scale_elems formula assumes the upstream quantize kernel allocates the compact scale buffer with its total first dim rounded up to 128 (rowwise) or 4 (colwise). If a caller passes a "plain compact" buffer of size exactly num_tensors * m * padded_k (without trailing alignment slack), neither branch matches and NVTE_ERROR fires with a size-mismatch message that may be hard to diagnose.

Consider also accepting num_tensors * compact_scale_elems as a valid compact size, or documenting this alignment requirement in the error message.

ptrendx · 2026-04-24T22:06:44Z

@ksivaman Could you add a test exercising the change?

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

ksivaman · 2026-04-27T21:29:47Z

/te-ci

Oleg-Goncharov

LGTM overall

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

ksivaman · 2026-04-28T23:15:09Z

/te-ci

Oleg-Goncharov

LGTM

ksivaman added 5 commits April 23, 2026 14:53

Fix contiguous path for k=2880

ffe94e4

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Merge branch 'NVIDIA:main' into pad_weight_scale_inv

5ed0f8c

format

5de0389

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Merge branch 'NVIDIA:main' into pad_weight_scale_inv

d13c30b

Review suggestion from @Oleg-Goncharov

bccbf6a

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

ksivaman requested a review from Oleg-Goncharov April 24, 2026 20:37

ksivaman added MoE 2.15.0 labels Apr 24, 2026

greptile-apps Bot reviewed Apr 24, 2026

View reviewed changes

ksivaman added 3 commits April 27, 2026 13:35

Merge branch 'main' into pad_weight_scale_inv

af6f86f

Merge branch 'main' into pad_weight_scale_inv

5b0aadb

Add test for swizzle + padding fusion

7917042

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Oleg-Goncharov reviewed Apr 28, 2026

View reviewed changes

ksivaman added 2 commits April 28, 2026 23:13

Address review comments

731f4e4

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Merge branch 'main' into pad_weight_scale_inv

c9c6cf7

Oleg-Goncharov approved these changes Apr 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correctly pad scaling factor inverses to satisfy cuteDSL requirements#2924

Correctly pad scaling factor inverses to satisfy cuteDSL requirements#2924
ksivaman wants to merge 10 commits intoNVIDIA:mainfrom
ksivaman:pad_weight_scale_inv

ksivaman commented Apr 24, 2026

Uh oh!

ksivaman commented Apr 24, 2026

Uh oh!

greptile-apps Bot commented Apr 24, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot Apr 24, 2026

Uh oh!

greptile-apps Bot Apr 24, 2026

Uh oh!

ptrendx commented Apr 24, 2026

Uh oh!

ksivaman commented Apr 27, 2026

Uh oh!

Oleg-Goncharov left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ksivaman commented Apr 28, 2026

Uh oh!

Oleg-Goncharov left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ksivaman commented Apr 24, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

ksivaman commented Apr 24, 2026

Uh oh!

greptile-apps Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

ptrendx commented Apr 24, 2026

Uh oh!

ksivaman commented Apr 27, 2026

Uh oh!

Oleg-Goncharov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ksivaman commented Apr 28, 2026

Uh oh!

Oleg-Goncharov left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps Bot commented Apr 24, 2026 •

edited

Loading