Skip to content

feat(nvidia): add ntops rms norm backend#616

Draft
voltjia wants to merge 1 commit into
masterfrom
feat/nvidia-ntops-rms-norm
Draft

feat(nvidia): add ntops rms norm backend#616
voltjia wants to merge 1 commit into
masterfrom
feat/nvidia-ntops-rms-norm

Conversation

@voltjia
Copy link
Copy Markdown
Collaborator

@voltjia voltjia commented May 20, 2026

Summary

  • Add optional NineToothed code generation for NVIDIA behind WITH_NINETOOTHED, driven by ntops.kernels.rms_norm.premake and ninetoothed.build.
  • Add an NVIDIA slot 9 RmsNorm implementation that launches generated NineToothed kernels and adapts InfiniOps' 1D weight tensor to the expanded tensor view expected by ntops.
  • Keep scripts/generate_ninetoothed_ops.py as a build entrypoint and move the operator-specific codegen implementation under src/native/ninetoothed.
  • Add a small generator unit test covering the ntops premake path and generated CMake manifest behavior.

Motivation

ntops already provides the operator premake/application logic, so InfiniOps should only own the integration layer: selecting supported ranks/dtypes, running ninetoothed.build, compiling the generated sources, and dispatching through the existing implementation index mechanism.

This PR starts with RmsNorm because it is currently available in InfiniTensor/ntops; Swiglu is intentionally left out until ntops exposes a corresponding premake.

Closes # N/A

Type of Change

  • feat - new feature / new operator / new platform
  • fix - bug fix
  • perf - performance improvement (no behavioral change)
  • refactor - code restructuring without behavior change
  • test - adding or fixing tests only
  • docs - documentation only
  • build / ci - build system or CI configuration
  • chore - tooling, formatting, or other non-code changes
  • Breaking change (requires a ! in the Conventional Commits prefix or a BREAKING CHANGE: footer)

Platforms Affected

  • CPU (WITH_CPU)
  • NVIDIA (WITH_NVIDIA)
  • Iluvatar (WITH_ILUVATAR)
  • MetaX (WITH_METAX)
  • Cambricon (WITH_CAMBRICON)
  • Moore (WITH_MOORE)
  • Ascend (WITH_ASCEND)
  • PyTorch C++ bindings (WITH_TORCH)
  • Build system / CMake / CI
  • Python bindings / user-facing API

Test Results on Supported Platforms

Platform Built pytest Result Notes / Hardware
NVIDIA Yes Targeted smoke passed Remote ssh nvidia, infiniops-ci/nvidia:latest; built ops with WITH_NVIDIA=ON, WITH_NINETOOTHED=ON, GENERATE_PYTHON_BINDINGS=ON; ran RmsNorm slot 9 smoke vs. PyTorch on rank-2 and rank-3 float32, max error 2.38e-07, allclose=True.
Iluvatar No Not run Not touched; no hardware used for this PR.
MetaX No Not run Not touched; no hardware used for this PR.
Cambricon No Not run Not touched; no hardware used for this PR.
Moore No Not run Not touched; no hardware used for this PR.
Ascend No Not run Not touched; no hardware used for this PR.
Full `pytest` output (optional)
$ python3 -m py_compile scripts/generate_ninetoothed_ops.py src/native/ninetoothed/codegen.py tests/test_generate_ninetoothed_ops.py && python3 tests/test_generate_ninetoothed_ops.py && git diff --check
.
----------------------------------------------------------------------
Ran 1 test in 0.011s

OK

Remote Python checks:
$ ruff check scripts/generate_ninetoothed_ops.py src/native/ninetoothed/codegen.py tests/test_generate_ninetoothed_ops.py && ruff format --check scripts/generate_ninetoothed_ops.py src/native/ninetoothed/codegen.py tests/test_generate_ninetoothed_ops.py
All checks passed!
3 files already formatted

Remote NVIDIA build:
- `ssh nvidia`, `infiniops-ci/nvidia:latest`.
- `ntops` installed with `python3 -m pip install --no-deps git+https://github.com/InfiniTensor/ntops.git` after installing `ninetoothed==0.25.0`, `triton==3.7.0`, and `numpy>=1.26.4,<2` in the validation container.
- Built `ops` target with `WITH_NVIDIA=ON`, `WITH_NINETOOTHED=ON`, and `GENERATE_PYTHON_BINDINGS=ON`.
- Ran `ops.RmsNorm.active_implementation_indices('cuda')`: `[0, 9]`.
- Ran `ops.rms_norm(..., implementation_index=9)` on shapes `(13, 4)` and `(2, 3, 17)`, dtype `float32`.
- Compared against `torch.nn.functional.rms_norm`: max errors `2.384185791015625e-07` and `2.384185791015625e-07`, `torch.allclose(...)=True` for both.

Benchmark / Performance Impact

N/A. This PR wires a generated backend path and only includes correctness smoke tests; no benchmark claim is made.

Notes for Reviewers

  • scripts/generate_ninetoothed_ops.py is now only a build entrypoint; operator-specific generation lives in src/native/ninetoothed/codegen.py.
  • ntops.kernels.rms_norm.premake is used directly for the arrangement/application. The generator no longer replaces premake tensors with concrete-shape tensors; it generates by supported ranks via INFINIOPS_NINETOOTHED_RMS_NORM_NDIMS.
  • NineToothed uses NineToothedTensor even for scalar eps and num_normalized_elements; the slot 9 adapter wraps those scalars accordingly.
  • Common NineToothed tensor/dtype/size/scalar adapters live in src/native/ninetoothed/tensor.h; the RmsNorm header only keeps the 1D-weight expansion and generated launcher call.
  • ntops is expected to be importable in the selected Python environment. A local ninetoothed checkout is still optionally supported through NINETOOTHED_SOURCE_DIR for development.
  • Full default pytest and all-platform CI were not run in this iteration.

Checklist

Title, Branch, and Commits

  • PR title follows Conventional Commits (e.g. feat(nvidia): ..., fix(cuda/gemm): ...).
  • Branch name follows <type>/xxx-yyyy-zzzz where <type> matches the PR title's Conventional Commits type and words are joined with hyphens (see CONTRIBUTING.md §Branches).
  • Each commit message follows Conventional Commits.
  • Small PR is a single squashable commit; or, for a large PR, every commit is meaningful, well-formed, and independently reviewable (see CONTRIBUTING.md §Pull Requests).
  • No stray merge commits from master - the branch is rebased cleanly on top of the current master.
  • No fixup! / squash! / wip commits remain.

Scope and Design

  • Changes are minimal - nothing unrelated to the stated motivation was added (CONTRIBUTING.md §Code/General).
  • No dead code, commented-out blocks, debug prints, printf/std::cout/print(...) left behind, or TODO without an owner and issue link.
  • No unrelated formatting churn that would obscure the diff.
  • Public API changes (if any) are intentional, documented, and reflected in affected callers/tests.

General Code Hygiene (applies to all languages)

  • The code is self-explanatory; comments were added only where the why is non-obvious (CONTRIBUTING.md §Code/General).
  • Every modified or added file ends with a single trailing newline (CONTRIBUTING.md §Code/General).
  • No trailing whitespace, tab/space mixing, or stray BOMs.
  • Identifiers in comments and error messages are wrapped in backticks (e.g. the `seqlens_k` tensor) (CONTRIBUTING.md §Code/General).
  • All comments and error messages are in English (CONTRIBUTING.md §Code/General).
  • Comments and error messages are complete sentences - capitalized first letter, terminal punctuation - unless the language/framework convention says otherwise (CONTRIBUTING.md §Code/General; §Python).

C++ Specific (if C++ files changed)

  • Code follows the Google C++ Style Guide strictly.
  • clang-format (version 21, per .github/workflows/clang-format.yml) has been run against all modified .h, .cc, .cuh, and .mlu files; the diff is clean. Local and NVIDIA CI-container clang-format were not available.
  • clang-tidy concerns (per .clang-tidy) have been reviewed - no new warnings beyond the existing baseline. Not run locally.
  • Operator parameter order is inputs first, outputs last; attributes are between inputs and outputs; naming follows PyTorch -> ONNX -> CUDA API precedence (CONTRIBUTING.md §C++).
  • No exceptions are thrown. Error paths use existing project assert style.
  • Error and warning message wording follows the LLVM Coding Standards (CONTRIBUTING.md §C++).
  • N/A - Kernel files are generated by ninetoothed.build into the build directory.
  • N/A - Kernel and kernel launcher are generated by ninetoothed.build; this PR adds only the InfiniOps adapter header.
  • N/A - No constructor initializer lists were added.
  • Exactly one blank line between classes, between classes and functions, and between functions (CONTRIBUTING.md §C++).
  • Exactly one blank line between members (functions and variables) within a class (CONTRIBUTING.md §C++).
  • Exactly one blank line before and after the contents of a namespace (CONTRIBUTING.md §C++).
  • N/A - No new base operator was added; this implements an additional NVIDIA backend slot for existing RmsNorm.
  • No raw new/delete; RAII / smart pointers / existing allocators are used.

Python Specific (if Python files changed)

  • Code is PEP 8 compliant; ruff check passes cleanly on CI (see .github/workflows/ruff.yml). Verified in infiniops-ci/nvidia:latest.
  • ruff format --check passes cleanly. Verified in infiniops-ci/nvidia:latest.
  • Comments are complete English sentences, starting with a capital letter and ending with punctuation; Markdown backticks are used for code references (CONTRIBUTING.md §Python).
  • Framework-specific conventions (e.g. lowercase pytest.skip messages without terminal period) are honored where applicable (CONTRIBUTING.md §Python).
  • No blank line between the function signature and the body when there is no docstring or comment (CONTRIBUTING.md §Python).
  • A blank line is present before and after if, for, and similar control-flow statements (CONTRIBUTING.md §Python).
  • A blank line appears before each return, except when it directly follows a control-flow statement (CONTRIBUTING.md §Python).
  • N/A - No docstrings were added.
  • Type hints are consistent with the surrounding script style.

Testing

  • pytest was run locally on every supported platform that this PR can affect, and the results are recorded in the "Test Results" table above (CONTRIBUTING.md §Pull Requests). Only targeted local and NVIDIA smoke checks were run.
  • For any platform that could not be tested, an explicit reason is given in the table and a reviewer with access has been tagged.
  • New functionality has matching tests under tests/ following tests/test_add.py / tests/test_gemm.py patterns (CONTRIBUTING.md §Adding an Operator).
  • N/A - The new generator unit test does not use dependent pytest.mark.parametrize parameters.
  • N/A - The new generator unit test is a unittest test for code generation.
  • N/A - The new generator unit test does not use default dtype/device parameterization.
  • N/A - No known parallel flakiness.
  • N/A - This is a feature PR, not a bug fix.

Build, CI, and Tooling

  • The project builds cleanly from a fresh directory with pip install .[dev] on at least one affected platform. Not run; targeted CMake/Ninja build was run in the NVIDIA CI image.
  • compile_commands.json still regenerates (CMake option CMAKE_EXPORT_COMPILE_COMMANDS=ON in pyproject.toml - required by the code-lint skill and clang-tidy -p).
  • N/A - No new backend/device was added.
  • Only one CUDA-like GPU backend is selectable at a time - the existing mutual-exclusion check in CMakeLists.txt is not broken.
  • Both CI workflows (clang-format.yml, ruff.yml) are green locally (or expected to be green on CI). ruff was verified in the NVIDIA CI container; clang-format was unavailable locally and in that container.
  • No new runtime dependency was added without updating pyproject.toml's [project.optional-dependencies] (or justified in the PR description). ntops is required only when explicitly enabling WITH_NINETOOTHED codegen.

Documentation

  • README.md, CONTRIBUTING.md, or inline docs updated when behavior, build flags, or developer workflow changed. Build flags are introduced in CMake only; no user-facing docs were added in this PR.
  • New operators, new dispatch helpers, or new public utilities are documented (docstring, header comment, or an addition to CONTRIBUTING.md §Some Code Explanations).
  • N/A - No user-visible breaking change.

Security and Safety

  • No secrets, access tokens, internal URLs, customer data, or personal hardware identifiers have been committed.
  • Third-party code is license-compatible and attributed.
  • No unsafe pointer arithmetic, uninitialized reads, or missing bounds checks were introduced.

Comment thread CMakeLists.txt Outdated
option(WITH_TORCH "Enable PyTorch C++ backend" OFF)

option(WITH_NINETOOTHED "Enable NineToothed-generated NVIDIA kernels" OFF)
set(NINETOOTHED_PYTHON_EXECUTABLE "" CACHE FILEPATH "Python executable used to run ninetoothed code generation")
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这部分主要是用来写 option 的,请把下面这堆 set 给挪到一个专门的 section。

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已改。WITH_NINETOOTHED 仍然放在 option 区,下面这些 cache 变量已经挪到单独的 NineToothed code generation configuration section 里。

Comment thread scripts/generate_ninetoothed_ops.py Outdated
_SUPPORTED_OPS = ("rms_norm",)


def _import_ninetoothed(source_dir):
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为啥不直接是 import ninetoothed?且请不要使用缩写,直接使用全称 ninetoothed

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已改。现在实现移到了 src/native/ninetoothed/codegen.py_import_ninetoothed 只在可选 source dir 需要时调整 sys.path,随后直接 import ninetoothed,变量名也不再用 nt 缩写。

Comment thread scripts/generate_ninetoothed_ops.py Outdated
return nt


def _import_ntops():
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为啥不直接是 import ntops

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已改。_rms_norm_premake 里直接 import ntops,不再通过 importlib_import_ntops 包一层。

Comment thread scripts/generate_ninetoothed_ops.py Outdated

_DEFAULT_DTYPES = ("float32", "float16", "bfloat16")

_DEFAULT_RMS_NORM_SHAPES = (
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

具体算子相关的内容,应该放到 src 里,scripts 里面只放纯功能性工具或者构建相关脚本。

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已改。scripts/generate_ninetoothed_ops.py 现在只作为构建入口,把 src 加到 sys.path 后委托给 native.ninetoothed.codegen.main();具体算子和生成逻辑放到了 src/native/ninetoothed/codegen.py

Comment thread scripts/generate_ninetoothed_ops.py Outdated
return importlib.import_module("ntops")


def _rms_norm_premake_rank2(dim0, dim1, dtype, block_size):
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上,这部分应当放在 src 中合适的地方,而不是在 scripts 下。以下同类问题不再赘述,但请一并修改。

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已一并修改。RmsNorm 的 premake 包装、rank/dtype config、manifest 生成都挪到了 src/native/ninetoothed/codegen.pyscripts 下不再放算子细节。

Comment thread scripts/generate_ninetoothed_ops.py Outdated
return arrangement, application, tensors


def _rms_norm_premake_rank3(dim0, dim1, dim2, dtype, block_size):
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这为啥要分 rank?不是只有 shape 不一样,那不是传个 shape 就行了嘛?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已改。之前按 rank 拆函数是为了让 ninetoothed.build 生成不同 launcher 参数;现在改成使用 ntops 自带的动态-rank premake,只按 ndim/dtype 生成同一个 infiniops_ninetoothed_rms_norm dispatcher,Python 侧不再拆 rank2/rank3 premake。

Comment thread scripts/generate_ninetoothed_ops.py Outdated
return arrangement, application, tensors


def _parse_shape(value):
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个函数是干嘛的?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已删除。现在不再按具体 shape 编译,也不需要解析 1x64 这类 shape 字符串;配置改为 INFINIOPS_NINETOOTHED_RMS_NORM_NDIMS / --rms-norm-ndims


namespace detail {

inline int NineToothedRmsNormDTypeIndex(DataType dtype) {
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这种类似的 helper 不是应该是整个九齿 common 的嘛?不要放在 rms_norm 下面。

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已改。DTypeIndexSizeArgFromTensorFromScalar 都放到 src/native/ninetoothed/tensor.h 作为九齿 common helper;rms_norm/ninetoothed.h 只保留 ExpandedRmsNormWeight 这种算子特有适配和 generated launcher 调用。

@voltjia voltjia force-pushed the feat/nvidia-ntops-rms-norm branch from fa89de9 to 0ad2354 Compare May 20, 2026 11:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant