Skip to content

Add NVFP4 1x64 Local Encode Recipe #2941

Draft
cael-ling wants to merge 10 commits intoNVIDIA:mainfrom
cael-ling:feature/nvfp4-1x64-local-encode
Draft

Add NVFP4 1x64 Local Encode Recipe #2941
cael-ling wants to merge 10 commits intoNVIDIA:mainfrom
cael-ling:feature/nvfp4-1x64-local-encode

Conversation

@cael-ling
Copy link
Copy Markdown
Contributor

@cael-ling cael-ling commented Apr 29, 2026

Description

Adds a hierarchical NVFP4 cast: the FP32 encoding scale S_enc is computed per 1x64 K-window instead of from a single per-tensor amax, and the four 1x16 sub-blocks inside a window share their parent's S_enc. FP4 grid, E4M3 decode scale, and the saturating cast itself are unchanged.

The motivation is outliers. When activations have a few rows or K-positions that blow up the per-tensor amax, the rest of the tensor gets squashed into a tiny corner of the FP4 grid. Letting each 64-element K-window pick its own S_enc keeps the calm regions' resolution.

The byte layout is identical to the production NVFP4 path (FP4 packed bytes + 1x16 E4M3 scales + transposed columnwise twin), but the per-block scales are now relative to a per-window S_enc rather than a per-tensor one. That makes them semantically incompatible with the existing cuBLAS NVFP4 GEMM — a matching GEMM-side change is needed, and that's a follow-up PR.

The kernel emits both rowwise and (transposed) columnwise outputs in one fused pass per 64x64 tile, with the same shape as quantize_transpose_nvfp4 so the columnwise twin is wgrad-ready for a future 1x64-aware backward. Either direction can be skipped if you only ask for the other. Since downstream needs the per-window S_enc to dequantize, the kernel also writes per-window FP32 amax tensors, on the existing amax / columnwise_amax slots — shapes upgrade from (1,) to (M, N/64) and (N, M/64) in 1x64 mode. (Taking max over those gives you the old per-tensor amax if you still need it.)

Gated by NVTE_NVFP4_ROWWISE_1X64_LOCAL_ENCODE=1 (env-var name kept for backward-compat with the original rowwise-only kernel). Non-RHT, non-2D, non-SR, with M and N multiples of 64; all four are NVTE_CHECK'd at dispatch so a bad config can't silently fall back to the per-tensor kernel.

Filing as draft. Looking for early feedback on the kernel and on whether reference_hierarchical_nvfp4/ belongs in-tree.

Caveats

  • The per-1x16 E4M3 scales look byte-identical to the per-tensor NVFP4 path, but they're now encoded relative to a per-window S_enc. The cuBLAS NVFP4 GEMM cannot consume this output as-is; matching GEMM-side changes are a follow-up PR.
  • In 1x64 mode Tensor::amax / Tensor::columnwise_amax are no longer scalar (1,) buffers. Anything that reads those slots assuming a scalar must guard on the env-var first.

Fixes # (issue)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Kernel:

PyTorch:

Bit-exact PyTorch oracle:

Tests:

Reference:

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

cael-ling and others added 10 commits April 25, 2026 03:47
Signed-off-by: Cael Ling <caell@nvidia.com>
Signed-off-by: Cael Ling <caell@nvidia.com>
Signed-off-by: Cael Ling <caell@nvidia.com>
Signed-off-by: Cael Ling <caell@nvidia.com>
Signed-off-by: Cael Ling <caell@nvidia.com>
Signed-off-by: Cael Ling <caell@nvidia.com>
Signed-off-by: Cael Ling <caell@nvidia.com>
Signed-off-by: Cael Ling <caell@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant