[Common/PyTorch/JAX] make offset of ClampedSwiGLU configurable#2938
[Common/PyTorch/JAX] make offset of ClampedSwiGLU configurable#2938hxbai wants to merge 2 commits intoNVIDIA:mainfrom
Conversation
Greptile SummaryThis PR adds a configurable Confidence Score: 4/5Safe to merge once the breaking public C API change (insertion of glu_linear_offset before cudaStream_t) is acknowledged in the PR checklist and a compatibility plan is confirmed. The implementation is mathematically correct across all kernel paths and backward passes; default values preserve backward compatibility at the Python/PyTorch level. The unresolved concern is the ABI-breaking change to the public versioned C header (flagged in a prior review thread), which is still unchecked in the PR checklist and undocumented as a breaking change. transformer_engine/common/include/transformer_engine/activation.h — the changed public C API signatures are the only file that needs careful attention from a compatibility standpoint. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["ClampedSwiGLU / ScaledClampedQGeGLU\n(Python API — glu_linear_offset param added)"] --> B{glu_linear_offset == 1.0\nAND alpha == 1.702?}
B -- Yes --> C["cuDNN Fusion Path\n(fuse_grouped_mlp_ops)"]
B -- No --> D["TE Kernel Path"]
D --> E["tex.clamped_swiglu / tex.clamped_dswiglu\n(PyTorch pybind11 — glu_linear_offset added)"]
D --> F["nvte_clamped_swiglu / nvte_clamped_dswiglu\n(JAX FFI — glu_linear_offset added)"]
E --> G["ClampedSwiGLUParam\n{limit, alpha, glu_linear_offset}"]
F --> G
G --> H["vectorized_pointwise.h\nFWD: clamp(x_linear) + glu_linear_offset\nBWD: offset in ∂act, not in ∂gate"]
G --> I["gated_fp8.cuh\ngated_mxfp8.cuh\nFWD+BWD: same offset logic"]
style C fill:#90EE90
style D fill:#ADD8E6
Reviews (3): Last reviewed commit: "fix fusion pattern check" | Re-trigger Greptile |
| * \param[in] glu_linear_offset Offset added to the linear component after clamping (default 1.0). | ||
| * \param[in] stream CUDA stream used for the operation. | ||
| */ |
There was a problem hiding this comment.
nvte_clamped_swiglu and nvte_clamped_dswiglu are public symbols declared in a versioned public header. Inserting glu_linear_offset before cudaStream_t is an ABI-breaking change: any external binary or shared library compiled against the old header will silently pass the stream pointer as the offset and a garbage value as the stream, leading to undefined behavior at runtime rather than a clean compile error if called via a pre-compiled library. This should be acknowledged as a breaking change in the PR checklist, and — if this library follows semantic versioning or a compatibility guarantee — a deprecation/transition path or version bump is needed.
timmoon10
left a comment
There was a problem hiding this comment.
The fused op for grouped MLP is hard-coded for GPT-OSS, so we should make sure not to fuse if glu_linear_offset != 1:
TransformerEngine/transformer_engine/pytorch/ops/_common.py
Lines 180 to 183 in df0025b
|
/te-ci |
Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com>
Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com>
Description
The previous ClampedSwiGLU follows GPT-OSS, which hard-coded the offset 1.0.
DeepSeek-V4 uses ClampedSwiGLU without alpha and offset.
This PR makes the offset of ClampedSwiGLU configurable to support DeepSeek-V4.
Fixes # (issue)
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: