Add NVFP4 1x64 Local Encode Recipe by cael-ling · Pull Request #2941 · NVIDIA/TransformerEngine

cael-ling · 2026-04-29T03:54:17Z

Description

Adds a hierarchical NVFP4 cast: the FP32 encoding scale S_enc is computed per 1x64 K-window instead of from a single per-tensor amax, and the four 1x16 sub-blocks inside a window share their parent's S_enc. FP4 grid, E4M3 decode scale, and the saturating cast itself are unchanged.

The motivation is outliers. When activations have a few rows or K-positions that blow up the per-tensor amax, the rest of the tensor gets squashed into a tiny corner of the FP4 grid. Letting each 64-element K-window pick its own S_enc keeps the calm regions' resolution.

The byte layout is identical to the production NVFP4 path (FP4 packed bytes + 1x16 E4M3 scales + transposed columnwise twin), but the per-block scales are now relative to a per-window S_enc rather than a per-tensor one. That makes them semantically incompatible with the existing cuBLAS NVFP4 GEMM — a matching GEMM-side change is needed, and that's a follow-up PR.

The kernel emits both rowwise and (transposed) columnwise outputs in one fused pass per 64x64 tile, with the same shape as quantize_transpose_nvfp4 so the columnwise twin is wgrad-ready for a future 1x64-aware backward. Either direction can be skipped if you only ask for the other. Since downstream needs the per-window S_enc to dequantize, the kernel also writes per-window FP32 amax tensors, on the existing amax / columnwise_amax slots — shapes upgrade from (1,) to (M, N/64) and (N, M/64) in 1x64 mode. (Taking max over those gives you the old per-tensor amax if you still need it.)

Gated by NVTE_NVFP4_ROWWISE_1X64_LOCAL_ENCODE=1 (env-var name kept for backward-compat with the original rowwise-only kernel). Non-RHT, non-2D, non-SR, with M and N multiples of 64; all four are NVTE_CHECK'd at dispatch so a bad config can't silently fall back to the per-tensor kernel.

Filing as draft. Looking for early feedback on the kernel and on whether reference_hierarchical_nvfp4/ belongs in-tree.

Caveats

The per-1x16 E4M3 scales look byte-identical to the per-tensor NVFP4 path, but they're now encoded relative to a per-window S_enc. The cuBLAS NVFP4 GEMM cannot consume this output as-is; matching GEMM-side changes are a follow-up PR.
In 1x64 mode Tensor::amax / Tensor::columnwise_amax are no longer scalar (1,) buffers. Anything that reads those slots assuming a scalar must guard on the env-var first.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Kernel:

transformer_engine/common/cast/nvfp4/quantize_nvfp4_1x64.cu / .cuh -- one CTA per 64x64 tile, fused rowwise + columnwise pass over a [64][65]-padded SMEM stage. Either direction can be skipped if its output buffer is null. Both passes emit FP4 packed bytes, per-1x16 E4M3 s_dec, and an FP32 per-1x64-window amax; the columnwise twin uses the production (N, M/2) transposed layout. s_dec == 0 -> block_scale = 0 short-circuit avoids the NaN -> +FP4_MAX saturation seen on SM100 for all-zero windows.
transformer_engine/common/cast/dispatch/quantize.cuh -- routes to nvfp4::quantize_1x64_local_encode when the env flag is set; enforces the non-RHT / non-2D / non-SR / M, N % 64 == 0 preconditions.
transformer_engine/common/common.h, transformer_engine/common/transformer_engine.cpp, transformer_engine/common/include/transformer_engine/transformer_engine.h -- nvfp4_rowwise_1x64_local_encode field on the C-side QuantizationConfig, default false.
transformer_engine/common/CMakeLists.txt -- compile the new TU.

PyTorch:

transformer_engine/pytorch/csrc/nvfp4_1x64.h -- local_encode_from_env() and require_ok_for_* shared between cast and split_quantize.
transformer_engine/pytorch/csrc/extensions/cast.cpp, transformer_engine/pytorch/csrc/quantizer.cpp -- thread the flag through QuantizationConfigWrapper. In 1x64 mode create_tensor allocates amax / columnwise_amax as (M, N/64) / (N, M/64) FP32 buffers, and the global-amax prepass in quantize_impl (which would clobber those slots with a scalar) is gated off.

Bit-exact PyTorch oracle:

transformer_engine/pytorch/custom_recipes/quantization_nvfp4_1x64.py -- NVFP4Quantizer1x64Ref mirrors the kernel's arithmetic ordering (1e-12 amax floor, folded 1/FP4_MAX in s_dec, all-zero short-circuit) for both directions and emits the matching window-amax tensors.

Tests:

tests/pytorch/nvfp4/test_nvfp4_1x64_quantize_exact.py -- pins the single-tensor 1x64 CUDA kernel to a pure-PyTorch oracle (NVFP4Quantizer1x64Ref) byte-for-byte, so any drift in the kernel's algorithm shows up immediately.
tests/pytorch/nvfp4/test_nvfp4_1x64_accuracy.py -- never touches the kernel; compares per-1x64-window scaling against per-tensor scaling at the pure-PyTorch level on RMSE/max-abs/SNR to justify why 1x64 is worth shipping.
tests/pytorch/nvfp4/test_nvfp4_1x64_split_quantize.py --exercises tex.split_quantize with 1x64 enabled and asserts bit-exact agreement on two fronts: against the oracle (algorithm) and against per-chunk quantizer(chunk) driving the same kernel (dispatcher wiring).

Reference:

reference_hierarchical_nvfp4/ -- numpy + torch design-time references and a compare_64x64_global_vs_1x64.py comparison script. Happy to drop or relocate to docs//scratch.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Cael Ling <caell@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Cael Ling <caell@nvidia.com>

for more information, see https://pre-commit.ci

cael-ling and others added 10 commits April 25, 2026 03:47

Add NVFP4 rowwise 1x64 local-encode quantization

e16712a

Signed-off-by: Cael Ling <caell@nvidia.com>

Add bit-exact reference + tests for NVFP4 rowwise 1x64 local-encode

c801693

Signed-off-by: Cael Ling <caell@nvidia.com>

Add accuracy comparison test for NVFP4 rowwise 1x64 vs per-tensor

7a39ad6

Signed-off-by: Cael Ling <caell@nvidia.com>

Simplify accuracy advantage test assertion

f80c36b

Signed-off-by: Cael Ling <caell@nvidia.com>

Fix NVFP4 rowwise 1x64 kernel zero-block NaN path

72180d9

Signed-off-by: Cael Ling <caell@nvidia.com>

Extend NVFP4 1x64 cast to fused rowwise + columnwise output

92350db

Signed-off-by: Cael Ling <caell@nvidia.com>

Drop colwise group-quantize work from this PR

89aaa0b

Signed-off-by: Cael Ling <caell@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

7c35e32

for more information, see https://pre-commit.ci

Add NVFP4 1x64 local-encode support to split_quantize

1f1b16c

Signed-off-by: Cael Ling <caell@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

68fdbd6

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add NVFP4 1x64 Local Encode Recipe #2941

Add NVFP4 1x64 Local Encode Recipe #2941
cael-ling wants to merge 10 commits intoNVIDIA:mainfrom
cael-ling:feature/nvfp4-1x64-local-encode

cael-ling commented Apr 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cael-ling commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cael-ling commented Apr 29, 2026 •

edited

Loading