feat(nvidia): add ntops rms norm backend#616
Conversation
| option(WITH_TORCH "Enable PyTorch C++ backend" OFF) | ||
|
|
||
| option(WITH_NINETOOTHED "Enable NineToothed-generated NVIDIA kernels" OFF) | ||
| set(NINETOOTHED_PYTHON_EXECUTABLE "" CACHE FILEPATH "Python executable used to run ninetoothed code generation") |
There was a problem hiding this comment.
这部分主要是用来写 option 的,请把下面这堆 set 给挪到一个专门的 section。
There was a problem hiding this comment.
已改。WITH_NINETOOTHED 仍然放在 option 区,下面这些 cache 变量已经挪到单独的 NineToothed code generation configuration section 里。
| _SUPPORTED_OPS = ("rms_norm",) | ||
|
|
||
|
|
||
| def _import_ninetoothed(source_dir): |
There was a problem hiding this comment.
为啥不直接是 import ninetoothed?且请不要使用缩写,直接使用全称 ninetoothed。
There was a problem hiding this comment.
已改。现在实现移到了 src/native/ninetoothed/codegen.py,_import_ninetoothed 只在可选 source dir 需要时调整 sys.path,随后直接 import ninetoothed,变量名也不再用 nt 缩写。
| return nt | ||
|
|
||
|
|
||
| def _import_ntops(): |
There was a problem hiding this comment.
为啥不直接是 import ntops?
There was a problem hiding this comment.
已改。_rms_norm_premake 里直接 import ntops,不再通过 importlib 或 _import_ntops 包一层。
|
|
||
| _DEFAULT_DTYPES = ("float32", "float16", "bfloat16") | ||
|
|
||
| _DEFAULT_RMS_NORM_SHAPES = ( |
There was a problem hiding this comment.
具体算子相关的内容,应该放到 src 里,scripts 里面只放纯功能性工具或者构建相关脚本。
There was a problem hiding this comment.
已改。scripts/generate_ninetoothed_ops.py 现在只作为构建入口,把 src 加到 sys.path 后委托给 native.ninetoothed.codegen.main();具体算子和生成逻辑放到了 src/native/ninetoothed/codegen.py。
| return importlib.import_module("ntops") | ||
|
|
||
|
|
||
| def _rms_norm_premake_rank2(dim0, dim1, dtype, block_size): |
There was a problem hiding this comment.
同上,这部分应当放在 src 中合适的地方,而不是在 scripts 下。以下同类问题不再赘述,但请一并修改。
There was a problem hiding this comment.
已一并修改。RmsNorm 的 premake 包装、rank/dtype config、manifest 生成都挪到了 src/native/ninetoothed/codegen.py,scripts 下不再放算子细节。
| return arrangement, application, tensors | ||
|
|
||
|
|
||
| def _rms_norm_premake_rank3(dim0, dim1, dim2, dtype, block_size): |
There was a problem hiding this comment.
这为啥要分 rank?不是只有 shape 不一样,那不是传个 shape 就行了嘛?
There was a problem hiding this comment.
已改。之前按 rank 拆函数是为了让 ninetoothed.build 生成不同 launcher 参数;现在改成使用 ntops 自带的动态-rank premake,只按 ndim/dtype 生成同一个 infiniops_ninetoothed_rms_norm dispatcher,Python 侧不再拆 rank2/rank3 premake。
| return arrangement, application, tensors | ||
|
|
||
|
|
||
| def _parse_shape(value): |
There was a problem hiding this comment.
已删除。现在不再按具体 shape 编译,也不需要解析 1x64 这类 shape 字符串;配置改为 INFINIOPS_NINETOOTHED_RMS_NORM_NDIMS / --rms-norm-ndims。
|
|
||
| namespace detail { | ||
|
|
||
| inline int NineToothedRmsNormDTypeIndex(DataType dtype) { |
There was a problem hiding this comment.
这种类似的 helper 不是应该是整个九齿 common 的嘛?不要放在 rms_norm 下面。
There was a problem hiding this comment.
已改。DTypeIndex、SizeArg、FromTensor、FromScalar 都放到 src/native/ninetoothed/tensor.h 作为九齿 common helper;rms_norm/ninetoothed.h 只保留 ExpandedRmsNormWeight 这种算子特有适配和 generated launcher 调用。
fa89de9 to
0ad2354
Compare
Summary
WITH_NINETOOTHED, driven byntops.kernels.rms_norm.premakeandninetoothed.build.RmsNormimplementation that launches generated NineToothed kernels and adapts InfiniOps' 1D weight tensor to the expanded tensor view expected byntops.scripts/generate_ninetoothed_ops.pyas a build entrypoint and move the operator-specific codegen implementation undersrc/native/ninetoothed.ntopspremake path and generated CMake manifest behavior.Motivation
ntopsalready provides the operator premake/application logic, so InfiniOps should only own the integration layer: selecting supported ranks/dtypes, runningninetoothed.build, compiling the generated sources, and dispatching through the existing implementation index mechanism.This PR starts with
RmsNormbecause it is currently available inInfiniTensor/ntops;Swigluis intentionally left out untilntopsexposes a corresponding premake.Closes # N/A
Type of Change
feat- new feature / new operator / new platformfix- bug fixperf- performance improvement (no behavioral change)refactor- code restructuring without behavior changetest- adding or fixing tests onlydocs- documentation onlybuild/ci- build system or CI configurationchore- tooling, formatting, or other non-code changes!in the Conventional Commits prefix or aBREAKING CHANGE:footer)Platforms Affected
WITH_CPU)WITH_NVIDIA)WITH_ILUVATAR)WITH_METAX)WITH_CAMBRICON)WITH_MOORE)WITH_ASCEND)WITH_TORCH)Test Results on Supported Platforms
pytestResultssh nvidia,infiniops-ci/nvidia:latest; builtopswithWITH_NVIDIA=ON,WITH_NINETOOTHED=ON,GENERATE_PYTHON_BINDINGS=ON; ranRmsNormslot 9 smoke vs. PyTorch on rank-2 and rank-3float32, max error2.38e-07,allclose=True.Full `pytest` output (optional)
Benchmark / Performance Impact
N/A. This PR wires a generated backend path and only includes correctness smoke tests; no benchmark claim is made.
Notes for Reviewers
scripts/generate_ninetoothed_ops.pyis now only a build entrypoint; operator-specific generation lives insrc/native/ninetoothed/codegen.py.ntops.kernels.rms_norm.premakeis used directly for the arrangement/application. The generator no longer replaces premake tensors with concrete-shape tensors; it generates by supported ranks viaINFINIOPS_NINETOOTHED_RMS_NORM_NDIMS.NineToothedTensoreven for scalarepsandnum_normalized_elements; the slot 9 adapter wraps those scalars accordingly.src/native/ninetoothed/tensor.h; theRmsNormheader only keeps the 1D-weight expansion and generated launcher call.ntopsis expected to be importable in the selected Python environment. A localninetoothedcheckout is still optionally supported throughNINETOOTHED_SOURCE_DIRfor development.Checklist
Title, Branch, and Commits
feat(nvidia): ...,fix(cuda/gemm): ...).<type>/xxx-yyyy-zzzzwhere<type>matches the PR title's Conventional Commits type and words are joined with hyphens (seeCONTRIBUTING.md§Branches).CONTRIBUTING.md§Pull Requests).master- the branch is rebased cleanly on top of the currentmaster.fixup!/squash!/wipcommits remain.Scope and Design
CONTRIBUTING.md§Code/General).printf/std::cout/print(...)left behind, orTODOwithout an owner and issue link.General Code Hygiene (applies to all languages)
CONTRIBUTING.md§Code/General).CONTRIBUTING.md§Code/General).the `seqlens_k` tensor) (CONTRIBUTING.md§Code/General).CONTRIBUTING.md§Code/General).CONTRIBUTING.md§Code/General; §Python).C++ Specific (if C++ files changed)
clang-format(version 21, per.github/workflows/clang-format.yml) has been run against all modified.h,.cc,.cuh, and.mlufiles; the diff is clean. Local and NVIDIA CI-containerclang-formatwere not available.clang-tidyconcerns (per.clang-tidy) have been reviewed - no new warnings beyond the existing baseline. Not run locally.CONTRIBUTING.md§C++).assertstyle.CONTRIBUTING.md§C++).ninetoothed.buildinto the build directory.ninetoothed.build; this PR adds only the InfiniOps adapter header.CONTRIBUTING.md§C++).CONTRIBUTING.md§C++).CONTRIBUTING.md§C++).RmsNorm.new/delete; RAII / smart pointers / existing allocators are used.Python Specific (if Python files changed)
ruff checkpasses cleanly on CI (see.github/workflows/ruff.yml). Verified ininfiniops-ci/nvidia:latest.ruff format --checkpasses cleanly. Verified ininfiniops-ci/nvidia:latest.CONTRIBUTING.md§Python).pytest.skipmessages without terminal period) are honored where applicable (CONTRIBUTING.md§Python).CONTRIBUTING.md§Python).if,for, and similar control-flow statements (CONTRIBUTING.md§Python).return, except when it directly follows a control-flow statement (CONTRIBUTING.md§Python).Testing
pytestwas run locally on every supported platform that this PR can affect, and the results are recorded in the "Test Results" table above (CONTRIBUTING.md§Pull Requests). Only targeted local and NVIDIA smoke checks were run.tests/followingtests/test_add.py/tests/test_gemm.pypatterns (CONTRIBUTING.md§Adding an Operator).pytest.mark.parametrizeparameters.unittesttest for code generation.Build, CI, and Tooling
pip install .[dev]on at least one affected platform. Not run; targeted CMake/Ninja build was run in the NVIDIA CI image.compile_commands.jsonstill regenerates (CMake optionCMAKE_EXPORT_COMPILE_COMMANDS=ONinpyproject.toml- required by thecode-lintskill andclang-tidy -p).CMakeLists.txtis not broken.clang-format.yml,ruff.yml) are green locally (or expected to be green on CI).ruffwas verified in the NVIDIA CI container;clang-formatwas unavailable locally and in that container.pyproject.toml's[project.optional-dependencies](or justified in the PR description).ntopsis required only when explicitly enablingWITH_NINETOOTHEDcodegen.Documentation
README.md,CONTRIBUTING.md, or inline docs updated when behavior, build flags, or developer workflow changed. Build flags are introduced in CMake only; no user-facing docs were added in this PR.CONTRIBUTING.md§Some Code Explanations).Security and Safety