Add NVFP4 1x64 Local Encode Recipe #2941
Draft
cael-ling wants to merge 10 commits intoNVIDIA:mainfrom
Draft
Conversation
Signed-off-by: Cael Ling <caell@nvidia.com>
Signed-off-by: Cael Ling <caell@nvidia.com>
Signed-off-by: Cael Ling <caell@nvidia.com>
Signed-off-by: Cael Ling <caell@nvidia.com>
Signed-off-by: Cael Ling <caell@nvidia.com>
Signed-off-by: Cael Ling <caell@nvidia.com>
Signed-off-by: Cael Ling <caell@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Cael Ling <caell@nvidia.com>
for more information, see https://pre-commit.ci
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Adds a hierarchical NVFP4 cast: the FP32 encoding scale S_enc is computed per 1x64 K-window instead of from a single per-tensor amax, and the four 1x16 sub-blocks inside a window share their parent's S_enc. FP4 grid, E4M3 decode scale, and the saturating cast itself are unchanged.
The motivation is outliers. When activations have a few rows or K-positions that blow up the per-tensor amax, the rest of the tensor gets squashed into a tiny corner of the FP4 grid. Letting each 64-element K-window pick its own S_enc keeps the calm regions' resolution.
The byte layout is identical to the production NVFP4 path (FP4 packed bytes + 1x16 E4M3 scales + transposed columnwise twin), but the per-block scales are now relative to a per-window S_enc rather than a per-tensor one. That makes them semantically incompatible with the existing cuBLAS NVFP4 GEMM — a matching GEMM-side change is needed, and that's a follow-up PR.
The kernel emits both rowwise and (transposed) columnwise outputs in one fused pass per 64x64 tile, with the same shape as quantize_transpose_nvfp4 so the columnwise twin is wgrad-ready for a future 1x64-aware backward. Either direction can be skipped if you only ask for the other. Since downstream needs the per-window S_enc to dequantize, the kernel also writes per-window FP32 amax tensors, on the existing amax / columnwise_amax slots — shapes upgrade from (1,) to (M, N/64) and (N, M/64) in 1x64 mode. (Taking max over those gives you the old per-tensor amax if you still need it.)
Gated by NVTE_NVFP4_ROWWISE_1X64_LOCAL_ENCODE=1 (env-var name kept for backward-compat with the original rowwise-only kernel). Non-RHT, non-2D, non-SR, with M and N multiples of 64; all four are NVTE_CHECK'd at dispatch so a bad config can't silently fall back to the per-tensor kernel.
Filing as draft. Looking for early feedback on the kernel and on whether reference_hierarchical_nvfp4/ belongs in-tree.
Caveats
Fixes # (issue)
Type of change
Changes
Kernel:
PyTorch:
Bit-exact PyTorch oracle:
Tests:
Reference:
Checklist: