feat(pt_expt): multi-rank LAMMPS support for GNN models (DPA3 / DPA2 / spin) by wanghan-iapcm · Pull Request #5430 · deepmodeling/deepmd-kit

wanghan-iapcm · 2026-05-01T13:25:12Z

Summary

Adds multi-rank LAMMPS support to the pt_expt (.pt2 / AOTInductor) backend for GNN models — DPA3 (repflows) and DPA2 (repformers), plus spin GNN — at parity with the existing pt (.pth / torch.jit) backend. Without this, multi-rank LAMMPS users with GNN .pt2 models fall back to single-rank-only, and the C++ side crashes on the first ghost exchange when given a non-use_loc_mapping GNN .pt2.

The mechanism mirrors the pt backend's per-layer ghost-atom MPI exchange: each repflow/repformer block exchanges g1 across ranks via border_op so each rank sees up-to-date ghost embeddings. To survive torch.export + AOTInductor packaging, border_op is wrapped as an opaque torch.library.custom_op (deepmd_export::border_op) with a separate border_op_backward C++ symbol for autograd.

Design

Phase 1 — dpmodel plumbing: thread comm_dict: dict | None = None through make_model, base_atomic_model, descriptor wrappers (dpa1/dpa2/dpa3/hybrid/se_*), and the repflows/repformers blocks. Lift the per-layer node_ebd_ext construction into a _exchange_ghosts method (default array-api impl ignores comm_dict).
Phase 2 — pt_expt opaque op + block overrides:
- deepmd::border_op_backward C++ op (additive accumulation into local atom slots — symmetric exchange used by autograd backward).
- deepmd_export::border_op Python custom_op wrapper with register_fake and register_autograd so the op is opaque to torch.export.
- pt_expt/descriptor/repflows.py and repformers.py block subclasses with _exchange_ghosts overrides that call the opaque op (with the spin real/virtual split + concat_switch_virtual when comm_dict[\"has_spin\"] is set).
Phase 3 — dual-artifact AOTInductor export: when _has_message_passing(model) is true, compile two artifacts into the .pt2 ZIP:
- forward_lower_no_comm.pt2 — current behavior (single-rank, mapping-based gather).
- forward_lower_with_comm.pt2 — adds positional comm tensors (send_list, send_proc, recv_proc, send_num, recv_num, communicator, nlocal, nghost) to the trace input list, plus has_spin=tensor([1]) baked in for spin GNN.
- Metadata: has_message_passing + has_comm_artifact flags so the C++ loader picks the right artifact.
Phase 4 — C++ dispatch: DeepPotPTExpt::compute and DeepSpinPTExpt::compute route to the with-comm artifact when lmp_list.nswap > 0. commPTExpt adds build_comm_tensors_positional and build_comm_tensors_positional_with_virtual_atoms (the latter remaps sendlists via fwd_map when NULL atoms drop out of the model's view).
Phase 5 — LAMMPS tests: end-to-end multi-rank tests (DPA3, DPA2, spin DPA3) covering the basic case + the structural edge cases (NULL-type atoms, empty subdomain, all-NULL rank, isolated NULL, nlist rebuild, N>2 decomposition).

Coverage matrix

	DPA3	DPA2	spin DPA3
basic mpi-2	✓	✓	✓
empty subdomain (rank 1 empty)	✓	—	✓
all-NULL rank	✓	—	—
isolated NULL	✓	—	—
NULL atoms straddling boundary	✓	—	✓
NULL across nlist rebuild	✓	—	—
N>2 (2x2x1, 4x1x1, 2x2x2)	✓	—	—
cached mapping_tensor (ago>0)	✓	—	—

Tests compare mpi-N vs same-archive mpi-1 for force / force_mag / virial / energy (atol 1e-8); no hardcoded numerical references.

Plus a unit test (test_has_message_passing.py) pinning the _has_message_passing schema-drift contract.

Co-existence with #5407

This branch was rebased onto upstream master after PR #5407 (.pt2 perf) merged. The merge required:

forward_common_lower_exportable_with_comm (spin and non-spin variants) now applies _pad_nlist_for_export + need_sorted_nlist_for_lower=True workarounds matching perf(pt2): optimize .pt2 C++ inference path #5407's regular variant — keeps the with-comm trace's nnei axis dynamic.
CUDA realize_opcount_threshold=0 workaround applied around BOTH artifact compiles.
do_atomic_virial=True is used for all multi-rank fixtures to avoid AOTI compile-time changes from perf(pt2): optimize .pt2 C++ inference path #5407's default =False. (Multi-rank with do_atomic_virial=False is a known coverage gap — see Limitations.)

Known limitations

Hybrid + GNN + multi-rank: _has_message_passing doesn't recurse into hybrid children → no with-comm artifact produced for hybrid-of-GNN. Multi-rank LAMMPS with such a model would fall back to single-artifact and crash on first ghost exchange. Out of scope for this PR; if a user hits this, they get the same crash they had before.
do_atomic_virial=False under multi-rank: production default. All multi-rank tests use =True (matches perf(pt2): optimize .pt2 C++ inference path #5407 fixture conventions). Not exercised end-to-end yet.
CUDA: the TORCH_LIBRARY_IMPL(deepmd_export, CUDA, m) registration exists and the symbol is callable, but no GPU end-to-end test runs in this PR (CPU-only build environment locally).
DPA2 (repformers) edge cases: only basic mpi-2 is tested for DPA2. NULL / empty-subdomain / nlist-rebuild covered for DPA3 only — these paths are descriptor-agnostic by construction but not exercised end-to-end for DPA2.
Spin DPA3 edge cases: empty subdomain and NULL straddling boundary covered. all-NULL-rank, isolated-NULL, nlist-rebuild not.
Multiple spin-active types (use_spin=[True, True, ...]): only [True, False] tested.
Frame batching nb>1 with comm_dict: _exchange_ghosts uses .squeeze(0) / .unsqueeze(0) (mirrors pt). LAMMPS feeds nb=1 — fine in practice; breaks if reused outside LAMMPS.
Float16/bfloat16: comm_dict path was developed in float64 only.
Cross-backend (jax/paddle/numpy) comm_dict=None neutrality: dpmodel default _exchange_ghosts is the original code lifted into a method, behaviorally equivalent. Not separately re-tested via running consistent tests with comm_dict=None explicitly threaded.
Old .pt2 files lacking has_comm_artifact metadata: C++ defaults to single-artifact when key is missing. Not negative-tested.
AOTI compile failure cleanup: if the with-comm compile fails after the regular artifact is written, the resulting .pt2 is half-formed. RAII / SIGKILL leakage of TempFile in /tmp also pre-exists.

A full catalog of touched-but-untested paths is maintained at memory/gnn_mpi_untested_paths.md (local to the author).

Test plan

pytest source/tests/pt_expt/ — eager-parity, export round-trip, schema-drift unit tests
pytest source/tests/consistent/descriptor/test_dpa3.py source/tests/consistent/descriptor/test_dpa2.py — non-regression for the single-rank path
pytest source/lmp/tests/test_lammps_dpa3_pt2.py source/lmp/tests/test_lammps_dpa2_pt2.py source/lmp/tests/test_lammps_spin_dpa3_pt2.py — multi-rank end-to-end (requires mpirun + mpi4py)
ctest -R PtExpt — C++ tests (single-rank, since multi-rank is exercised via LAMMPS)
CI runs across CPU + CUDA matrix

Summary by CodeRabbit

New Features
- Optional comms/MPI metadata support across inference paths enabling ghost-exchange; dual “with-comm” export/run artifact and tracing entrypoints for message-passing GNNs and spin models, plus improved export/tracing support for comm-enabled ops.
Bug Fixes
- Fixed unsafe neighbor-list memory handling.
Tests
- Added extensive MPI parity, export-with-comm, repflow/repformer parallel-path, spin integration tests and CLI MPI runner scripts.

Lifts the per-layer node_ebd_ext gather inside DescrptBlockRepflows.call and DescrptBlockRepformers.call into a new _exchange_ghosts(...) method so subclasses can override it. Default behaviour is byte-identical to before for non-parallel inference (comm_dict is None). Threads an optional comm_dict kwarg through: - make_model.call_common_lower / forward_common_atomic - {base,dp,linear,pairtab}_atomic_model - dpa1/dpa2/dpa3/hybrid/se_* descriptors - repflows/repformers blocks Non-GNN descriptors accept and ignore comm_dict (noqa-marked unused). DPA2 routes around its pre-block gather when comm_dict is supplied so the repformers' per-layer override drives ghost exchange instead. This is the dpmodel-side groundwork for pt_expt multi-rank LAMMPS support; default behaviour unchanged.

Refactors Border::backward into a free function take/return interface (positional comm tensors + grad_g1, returns grad_in) and registers it as ``torch.ops.deepmd.border_op_backward``. The autograd Function's backward delegates to the new symbol so existing pt-backend behaviour is unchanged; the new symbol is what pt_expt's opaque op wrapper (``deepmd_export::border_op``) dispatches to from its ``register_autograd`` callback. The standalone op is needed because the ``custom_op`` API requires the backward to be expressible as a registered op (it cannot reference the autograd Function directly), and AOTInductor must serialise the call into the compiled .pt2.

…errides Three changes that together let GNN models drive MPI ghost-atom exchange through the pt_expt forward pass: 1. ``deepmd/pt_expt/utils/comm.py`` registers a NEW torch op ``deepmd_export::border_op`` via ``torch.library.custom_op``. The wrapper: - Forwards to the existing ``torch.ops.deepmd.border_op`` (clones the in-place output to satisfy custom_op aliasing rules). - Has a ``register_fake`` impl returning ``empty_like(g1)`` so ``torch.export`` / ``make_fx`` can trace through it. - Has a ``register_autograd`` callback that dispatches to ``torch.ops.deepmd.border_op_backward`` (the standalone op added in the previous commit). The existing ``deepmd::border_op`` is registered as ``CompositeImplicitAutograd`` and therefore tries to decompose into primitive aten ops during export — which fails because the C++ kernel calls ``data_ptr()`` on FakeTensors. The new opaque wrapper sidesteps this by being registered as an opaque op that ``torch.export`` records as a single black-box call. 2. ``deepmd/pt_expt/descriptor/{repflows,repformers}.py`` add pt_expt subclasses of ``DescrptBlockRepflows`` / ``DescrptBlockRepformers`` that override ``_exchange_ghosts``. When ``comm_dict is None`` the override defers to the dpmodel default; otherwise it pads ``node_ebd`` to nall and calls the opaque wrapper. Includes the spin-aware ``has_spin`` path (split real/virtual + concat_switch _virtual) ported from pt's repflows. 3. ``forward_common_lower_exportable_with_comm`` is added on the pt_expt CM (and SpinModel) classes. Same as the existing ``forward_common_lower_exportable`` but accepts the 8 comm tensors as additional positional inputs and reconstructs ``comm_dict`` inside the traced function (spin variant injects ``has_spin`` so the override takes the spin branch). This becomes the new traced entry point for the with-comm AOTI artifact (next commit). Existing pt_expt descriptor wrappers (dpa1, dpa2, se_*) and the ``CM.forward_common_atomic`` override get an extra ``comm_dict`` kwarg that is plumbed straight through to the dpmodel call — no behavioural change for ``comm_dict is None``. Phase 0 de-risk experiment (scratch/derisk_border_op.py) verified that the opaque wrapper survives ``torch.export.export`` + ``aoti_compile_and_package`` + ``aoti_load_package`` round-trips for both forward and backward.

Three small follow-ups uncovered by the spin export-with-comm test: 1. ``dpmodel/model/spin_model.py::call_common_lower`` was missing the ``comm_dict`` kwarg added by the Phase 1 plumbing. Added it and forward to ``backbone_model.call_common_lower`` so spin GNN models can drive parallel inference. 2. ``pt_expt/descriptor/repflows.py`` raises a clear ``RuntimeError`` when ``use_loc_mapping=True`` is combined with a non-None ``comm_dict``. The local-mapping codepath skips per-layer ghost exchange entirely so combining it with ``comm_dict`` would silently drop the parallel behaviour. 3. ``pt_expt/utils/comm.py`` ``_check_underlying_ops_loaded`` is called on first wrapper invocation; surfaces a clearer error when libdeepmd_op_pt.so is unloaded ("rebuild the pt custom-op library") rather than the cryptic "torch.ops.deepmd has no attribute 'border_op'" from torch's dispatcher.

Adds a ``with_comm_dict: bool`` flag to ``_trace_and_export`` and ``_make_sample_inputs``/``_build_dynamic_shapes``. When True, the trace runs through ``forward_common_lower_exportable_with_comm`` (which threads 8 comm tensors as positional inputs and reconstructs ``comm_dict`` inside the traced function), and the resulting export accepts comm tensors as additional positional inputs. Constraints enforced for the with-comm trace: * ``nframes=1`` static (the pt-parity override uses squeeze(0)/unsqueeze(0) which only works for nb=1; LAMMPS always drives one frame anyway). Avoids the regular-variants ``nframes=2`` collision-avoidance bumping (irrelevant when nframes is static — duck-sizing only unifies dynamic dims). * ``nswap`` static at the trace value. ``nswap`` is fixed once at LAMMPS init (depends on the processor grid which doesnt change at runtime), so the dim doesnt need to be dynamic. For GNN models, ``_deserialize_to_file_pt2`` now compiles BOTH the regular and with-comm artifacts and packs the latter inside the .pt2 ZIP at ``extra/forward_lower_with_comm.pt2``. Metadata gains: * ``has_message_passing`` (true if the descriptor has GNN block). * ``has_comm_artifact`` (true iff a with-comm artifact was packed). Old .pt2 files lack these keys; the C++ loader (Phase 4) must default to False when the field is missing. The non-GNN path is unchanged: a single regular artifact + the existing metadata layout, so existing .pt2 readers keep working.

Five new test files covering the GNN MPI plumbing: * test_repflow_parallel.py / test_repformer_parallel.py Eager parity for DescrptBlockRepflows / DescrptBlockRepformers override. Single-rank self-exchange via ctypes pointer-array sendlist; verifies override output equals dpmodel default for both with-mapping and none-mapping variants. Includes a structural test for the spin branch and a guard test that use_loc_mapping=True + comm_dict raises RuntimeError. * test_border_op_backward.py Direct unit tests for torch.ops.deepmd.border_op_backward (float32 + float64) and the autograd path through deepmd_export::border_op. * test_export_with_comm.py Phase 3 round-trip for the dual-artifact .pt2 layout: GNN models produce both regular and forward_lower_with_comm artifacts; both load via aoti_load_package; outputs match for self-exchange. Plus three coverage tests for previously-untested branches: zero-nghost clamp in _make_comm_sample_inputs, hybrid-with-GNN detection in _has_message_passing, .pte with-comm trace round-trip. * test_spin_export_with_comm.py Spin model trace machinery (smoke test on se_e2_a) and end-to-end eager value parity for spin DPA3 models running through SpinModel.call_common_lower with comm_dict.

Without TORCH_LIBRARIES on the test binary, the ``__has_include(<torch/csrc/inductor/aoti_package/model_package_loader.h>)`` check in DeepPotPTExpt.h evaluates to false and the test files compile with BUILD_PT_EXPT=0, causing every pt_expt test case to silently GTEST_SKIP("PyTorch support is not enabled"). The bug was masked by ctest reporting a green run with all skips counted as success. Adding ``target_link_libraries(runUnitTests_cc "${TORCH_LIBRARIES}")`` under the existing ``ENABLE_PYTORCH`` branch makes the AOTI header visible to the test compilation. After this fix, the 148 pt_expt tests actually run instead of being silently skipped.

Phase 4 of the GNN MPI plumbing. When a .pt2 archive carries a nested forward_lower_with_comm.pt2 (added by Phase 3 for GNN models), the C++ inference path now optionally extracts and loads it as a second AOTInductor module. Each compute() call dispatches between the regular and with-comm artifacts based on lmp_list.nswap: LAMMPS sets nswap=0 in single-rank mode and >0 in multi-rank, so single-rank inference keeps using the regular artifact (mapping-tensor gather) and multi-rank routes to the with-comm artifact (MPI ghost exchange). Three additions: 1. commonPTExpt.h adds: - TempFile RAII handle for the extracted nested artifact (mkstemp, unlinked at destruction). - TempFile::from_zip_entry reads a ZIP entry from the outer .pt2 and writes it to a temp file (atomic, 0600). - build_comm_tensors_positional packs the 8 comm tensors in canonical positional order (send_list, send_proc, recv_proc, send_num, recv_num, communicator, nlocal, nghost) for the with-comm AOTI module input vector. 2. DeepPotPTExpt: - Reads has_comm_artifact from metadata.json (defaults false for old .pt2 files lacking the field). - When true, extracts extra/forward_lower_with_comm.pt2 to a TempFile and loads it as with_comm_loader. - run_model_with_comm appends the 8 comm tensors to the base inputs and dispatches to with_comm_loader->run. - compute() chooses regular vs with-comm based on nswap. 3. DeepSpinPTExpt: - Same pattern; the Phase 3 export injects has_spin=1 into the traced graph comm_dict, so the C++ side passes the same 8 comm tensors as the non-spin case. nlocal/nghost carry the real-atom counts (the spin override halves them internally to get the atom-doubled counts). All 148 existing pt_expt C++ tests pass — the with-comm path is gated behind nswap > 0 so single-rank tests dont exercise it (that coverage is Phase 5 multi-rank LAMMPS test).

Phase 5 — final integration after Phases 1-4 land the dpmodel plumbing, opaque op wrappers, two-mode AOTI export and C++ dispatch. Three pieces had to fall into place to make multi-rank LAMMPS actually run a GNN .pt2: 1. Move deepmd_export op schema declarations to C++. torch.library.custom_op only registers the op in the Python process, but a LAMMPS run loads the .pt2 in pure C++ (no Python interpreter). Add TORCH_LIBRARY_FRAGMENT(deepmd_export, m) + TORCH_LIBRARY_IMPL blocks under explicit CPU/CUDA dispatch keys in source/op/pt/comm.cc; the C++ impls clone the underlying deepmd::* op outputs to satisfy AOTI no-aliasing. Python comm.py now layers register_fake + register_autograd on top of the C++-defined ops instead of defining new ones. 2. Call deepmd::load_op_library at DeepPot/SpinPTExpt init so libdeepmd_op_pt.so loads before AOTIModelPackageLoader; the LAMMPS plugin doesnt pre-load it. Without this, a multi-rank GNN .pt2 aborts at pair_style time with a missing-schema error. 3. Gate dual-artifact production on use_loc_mapping=False. _has_message_passing now walks into the GNN block to inspect use_loc_mapping; if True, only the regular artifact is produced (the override would raise on parallel mode anyway). gen_dpa3.py produces a second deeppot_dpa3_mpi.pt2 with use_loc_mapping=False so the new mpirun test has a real dual-artifact .pt2 to load. Plus the multi-rank test itself: - run_mpi_pair_deepmd_dpa3_pt2.py: subprocess driver. Uses PyLammps + processors 2 1 1 so nswap > 0 on every rank, forcing the C++ side to dispatch to the with-comm artifact. Forces are gathered via lammps.lmp.gather_atoms (rank-local atoms[i] doesnt see other ranks); pe via lammps.eval on rank 0. - test_pair_deepmd_mpi_dpa3 in test_lammps_dpa3_pt2.py: invokes the driver under mpirun -n 2, asserts energy + per-atom forces match the single-rank reference within atol=1e-8. Also: register_fake for the backward op too. Without it, make_fx tracing autograd.grad inside forward_common_lower_exportable hits the same FakeTensor data_ptr error we solved for forward in Phase 0. All 31 pt_expt LAMMPS tests pass.

…gaps Three fixes targeting the limitations from the previous Phase 5 commit: 1. NULL-type atoms (build_comm_dict_with_virtual_atoms equivalent). When ``select_real_atoms_coord`` filters atoms with atype < 0, the LAMMPS-supplied sendlist still indexes the original atom array. ``DeepPotPTExpt::compute`` (and Spin) now check ``has_null_atoms = (nall_real < nall)`` and route to the new ``build_comm_tensors_positional_with_virtual_atoms`` helper in commonPTExpt.h, which calls ``remap_comm_sendlist`` to translate indices through ``fwd_map`` (mirrors what ``commonPT.h::build_comm_dict_with_virtual_atoms`` does for the torch.jit pt-backend). Untested numerically (no test fixture produces NULL-type atoms in multi-rank); code path is structurally identical to the validated pt-backend equivalent. 2. nlist-rebuild test (test_pair_deepmd_mpi_dpa3_nlist_rebuild). Runs 50 MD steps under mpirun -n 2 with neigh_modify every=10, forcing >=5 neighbor-list rebuilds. Validates the with-comm dispatch path stays consistent across rebuilds (the comm tensors are reconstructed when ``ago == 0`` triggers). Asserts forces stay finite and bounded; no exact-value comparison since round- off accumulates over the trajectory and cross-rank ordering can shift the LSBs. 3. Spin multi-rank dispatch wiring (DeepSpinPTExpt::compute). Same has_null_atoms branch as DeepPotPTExpt. Code path structurally identical to the validated DeepPotPTExpt path; no spin-specific multi-rank test yet (would need a spin DPA3 .pt2 with use_loc_mapping=False to exercise it end-to-end). Note: virial check via LAMMPS compute pressure NULL virial caused PyLammps multi-rank deadlock; deferred to a follow-up. Forces ARE the autograd output of energy through the with-comm graph, so force parity already validates the with-comm backward path. All 26 pt_expt LAMMPS tests pass (including the new multi-rank ones); 9 model_devi_pt2 tests confirm DeepPotModelDevi delegates correctly through the dispatch.

- run_mpi_pair_deepmd_dpa3_pt2.py: gather atom ids alongside forces and sort by id explicitly. Output ordering is now robust to subdomain layout, empty subdomains, or future LAMMPS gather_atoms changes. Add atom_modify map yes so single-rank dispatch on the dual-artifact .pt2 (uses mapping) works; expose --processors so the runner can produce a same-archive single-rank reference. - test_pair_deepmd_mpi_dpa3_nlist_rebuild: replace the finite/bounded sanity check with a value comparison against a single-rank reference of the same trajectory (mpirun -n 1, processors "1 1 1"). 25 MD steps cross two nlist rebuilds, atol=1e-6 forces / rel=1e-8 energy. This catches a wrong-but-finite force from a dispatch bug that the previous assertion would have missed.

…with virial - common.cc: NeighborListData::copy_from_nlist used &ilist[0] / &jlist[ii][0] for the memcpy destination, which is OOB on an empty vector (libstdc++ debug-mode assertion) and undefined behaviour in general. Switch to .data() and skip the copy when the count is zero. Surfaced by the new empty-subdomain MPI test where rank 1 owns nloc=0 atoms; the same latent bug also applied to atoms with no neighbours. - run_mpi_pair_deepmd_dpa3_pt2.py: also gather the per-atom virial via ``compute centroid/stress/atom NULL pair`` and ``lmp.gather("c_virial", 1, 9)``. Output rows are now (3 force) + (9 virial) per atom, id-sorted. - test_lammps_dpa3_pt2.py: * test_pair_deepmd_mpi_dpa3 now asserts virial against expected_v (with the same column permutation as test_pair_deepmd_virial), closing the previous "virial multi-rank deferred" gap. * test_pair_deepmd_mpi_dpa3_nlist_rebuild now also compares virial between the multi-rank and single-rank reference runs. * New test_pair_deepmd_mpi_dpa3_empty_subdomain: 30 x 13 x 13 box with all atoms in x in [0.25, 12.83]; under processors "2 1 1" rank 1 owns zero local atoms. Compares forces + virial + energy against a same-archive single-rank reference.

Brings in PR deepmodeling#5407 (perf(pt2): optimize .pt2 C++ inference path) and unrelated upstream changes. Conflict resolution: - source/api_cc/src/common.cc: kept .data() form (strictly safer than upstream's &vec[0]) for the empty-subdomain / zero-neighbour guards added on this branch. Both branches added the same guard concept. - source/api_cc/src/commonPTExpt.h: deleted the obsolete buildTypeSortedNlist (PR 5407 replaced it with createNlistTensor in commonPT.h, which the C++ side now uses). Kept this branch's TempFile RAII + build_comm_tensors_positional[_with_virtual_atoms] for the Phase 4 with-comm dispatch. - source/api_cc/tests/CMakeLists.txt: kept the TORCH_LIBRARIES link with the explanatory comment from this branch (BUILD_PT_EXPT detection requires torch headers in the test binary). - deepmd/pt_expt/utils/serialization.py: combined both signatures. _trace_and_export now takes both with_comm_dict (this branch) and do_atomic_virial (PR 5407). _build_dynamic_shapes accepts both with_comm_dict and model_nnei. Apply PR 5407's CUDA realize_opcount_threshold=0 workaround around BOTH artifact compiles. Switch the .pte path's _trace_and_export call to kwargs so the new with_comm_dict positional doesn't shift the bind. - deepmd/pt_expt/model/{make_model,spin_model}.py: align forward_common_lower_exportable_with_comm with PR 5407's regular variant — apply _pad_nlist_for_export and need_sorted_nlist_for_lower= True so the with-comm trace keeps its nnei axis dynamic. Without these, the with-comm trace generated a "nnei <= sum(sel)" upper-bound guard that conflicted with PR 5407's "nnei >= sum(sel)" lower bound. - source/tests/infer/gen_dpa3.py: regular pt2 keeps PR 5407's do_atomic_virial=True; the multi-rank pt2 also gets do_atomic_virial= True so the new MPI virial test has reference values to compare against. Verified post-merge: - All 33 LAMMPS PT2 tests pass (including 9 DPA3 tests with 3 MPI tests). - All 3 ctest targets (lib + cc + c) pass. - source/tests/pt_expt/ pytest suite passes.

- source/lmp/tests/test_lammps_dpa2_pt2.py (NEW): runs DPA2 .pt2 under mpirun -n 2 with the with-comm artifact and asserts pe + per-atom forces + per-atom virial match a same-archive single-rank reference. Closes the recorded gap "DPA2 multi-rank dispatch never exercised end-to-end" (gnn_mpi_untested_paths.md). The runner script (run_mpi_pair_deepmd_dpa3_pt2.py) is descriptor-agnostic so no new driver is needed. - source/tests/infer/gen_dpa2.py: drop dead config_mpi block accidentally added during planning. DPA2's repformer has no use_loc_mapping knob (unlike DPA3), so the single deeppot_dpa2.pt2 already carries the dual-artifact layout — _has_message_passing returns True for any DPA2 model. - source/tests/pt_expt/conftest.py: ``import deepmd.pt`` at conftest evaluation time so libdeepmd_op_pt.so is loaded and ``deepmd_export::{border_op, border_op_backward}`` are registered before any pt_expt test module imports ``deepmd.pt_expt.utils`` (which transitively imports ``comm.py`` and its ``_check_underlying_ops_loaded()`` runtime check). Previously this worked only when the test was collected alongside earlier modules that happened to import deepmd.pt first; running the spin/export tests in isolation crashed at collection. - source/tests/pt_expt/model/test_spin_export_with_comm.py: fix pre-existing test data bug — model has sel=[20,20,20] (sum=60) but the trace test was passing nlist with width 6, tripping the _format_nlist post-condition assertion. Now uses the correct sum(sel) width. Surfaced once the conftest fix above made the test reliably runnable in isolation.

…it test Closes two gaps from the GNN-MPI untested-paths catalog: - ``test_pair_deepmd_mpi_dpa3_decomposition`` (parametrized): runs DPA3 .pt2 under three additional processor grids — ``4@2x2x1`` (2D), ``4@4x1x1`` (1D-deep chain), and ``8@2x2x2`` (3D). All three must match the gen_dpa3.py reference for energy / per-atom force / per-atom virial within atol=1e-8. The 2x2x2 split puts several subdomains empty, so this also exercises the ``copy_from_nlist`` empty-subdomain guard in a 3D layout. - ``source/tests/pt_expt/utils/test_has_message_passing.py``: pins ``_has_message_passing`` against schema drift. The detection chain (``model.atomic_model.descriptor`` -> ``descriptor.has_message_passing()`` -> ``block.use_loc_mapping``) is brittle to attribute renames in the dpmodel descriptor layer; a silent regression would disable the with-comm artifact and break multi-rank LAMMPS for GNN users with no test failure to flag it. The test asserts the documented value for 5 baseline configs (se_e2_a, dpa1, dpa3 use_loc_mapping=True/False, dpa2) plus two stub-model defensive cases. The runner helper ``_run_mpi_subprocess`` gains an optional ``processors`` arg so the new parametrized test can dictate the LAMMPS ``processors`` grid; existing tests keep their previous defaults (``2 1 1`` for nprocs=2, ``1 1 1`` for nprocs=1).

Closes the recorded gap "NULL-type atoms under mpirun" — until now ``build_comm_tensors_positional_with_virtual_atoms`` and the ``fwd_map``-based comm-tensor remap had never been exercised in a multi-rank LAMMPS run despite being reachable any time a user runs a model on a system with atom types outside its ``type_map``. Fixture (``data_dpa3_pt2_null_type.lmp``): the canonical 6 real atoms (types 1, 2) plus 2 LAMMPS type-3 atoms placed at (5.5, 6, 6) and (7.5, 7, 7) — straddling the x=6.5 rank boundary under ``processors 2 1 1`` and within rcut (=6) of multiple real atoms. The pair_coeff ``* * O H NULL`` maps LAMMPS type 3 to deepmd atype=-1, so ``select_real_atoms_coord`` filters them and ``DeepPotPTExpt::compute`` takes the ``build_comm_tensors_positional_with_virtual_atoms`` branch. The NULL atoms appear in cross-rank sendlists because both sit in the boundary's rcut window, so the remap must: - drop the -1 fwd_map slots from each swap's sendlist; - decrement sendnum/recvnum by the number dropped; - translate surviving indices into real-atom space. Test asserts: - forces on the 6 real atoms match the no-NULL baseline ``expected_f`` (atol 1e-8); - NULL atom forces are zero (atol 1e-12) — deepmd is the only pair_style and skips them; - total potential energy matches ``expected_e``; - per-atom virial on real atoms matches ``expected_v``. Runner script (``run_mpi_pair_deepmd_dpa3_pt2.py``) gains two optional flags: ``--pair-coeff`` (override the default ``"* *"``) and ``--mass3`` (mass for a third LAMMPS atom type). Existing tests keep their previous defaults unchanged. The ``_run_mpi_subprocess`` helper gains a ``runner_args`` kwarg to forward arbitrary flags to the runner; existing call sites are unaffected.

…ist-rebuild) Closes three more entries from the GNN-MPI untested-paths catalog, all variations on the multi-rank NULL-type filter path: - ``test_pair_deepmd_mpi_dpa3_null_isolated``: large box (30 x 13 x 13) puts a NULL atom at x=7.5, in rank 0's subdomain interior. With rcut=6 the boundary rcut-windows on rank 0 are x in [0, 6] (PBC of the right wall via x=30) and [9, 15] (rank 1's left wall); atoms in (6, 9) are local but never appear in any sendlist. Exercises ``has_null_atoms == True`` with a no-op remap (sendlists contain no NULL entries to drop) — complementary to ``test_pair_deepmd_mpi_dpa3_null_type`` which exercises the remap-with-NULLs case. - ``test_pair_deepmd_mpi_dpa3_all_null_rank``: rank 1 owns ONLY NULL atoms (intersection of empty-subdomain and NULL-type paths). After ``select_real_atoms_coord`` rank 1 has nloc_real=0, so the ``copy_from_nlist`` empty-subdomain guard must fire AND the ``_with_virtual_atoms`` remap must handle a sendlist whose entire local section was NULL. - ``test_pair_deepmd_mpi_dpa3_null_type_nlist_rebuild``: rebuilds the nlist 3 times in 3 MD steps using ``neigh_modify every 1``. NULL atoms drift across the boundary so sendlist composition changes per rebuild — validates that the remap re-runs correctly on every ago=0 trigger and stays consistent with the cached ``mapping_tensor`` / ``firstneigh_tensor`` in ``DeepPotPTExpt::compute``. Also speeds up ``test_pair_deepmd_mpi_dpa3_nlist_rebuild`` (existing non-NULL test) by switching from ``every 10`` + 25 steps to ``every 1`` + 3 steps — same 3 rebuilds, ~1/3 the wall time. Runner script gains a ``--neigh-every`` flag (default 10). All three new tests compare mpirun -n 2 against an mpirun -n 1 reference on the same fixture.

Two related changes that strengthen the NULL-type rebuild test and trim the decomposition variant set: - ``test_pair_deepmd_mpi_dpa3_null_type_nlist_rebuild`` now sets a high initial velocity (v_x = 2000 A/ps) on LAMMPS type-3 atoms via the runner's new ``--null-vx`` flag and a per-type ``velocity`` command. With timestep 0.0005 ps each NULL atom moves 1.0 A per step — enough to physically cross the x=6.5 rank boundary in step 1 (NULL @ 5.5 -> 6.5 -> 7.5 -> 8.5). NULL atoms therefore migrate ranks across rebuilds, exercising the case where a NULL's fwd_map index moves between the local-section and ghost-section of per-rank sendlists. Real atoms keep v=0 so their dynamics are stable; the deepmd model never sees NULL atoms (filtered by ``select_real_atoms_coord``) so unphysical NULL velocity is harmless. mpi-2 vs mpi-1 reference match within atol=1e-6 / rel=1e-8. - ``test_pair_deepmd_mpi_dpa3_decomposition``: drop the ``[4-2 2 1]`` variant. Its 2D coverage is fully subsumed by ``[8-2 2 2]`` (which is 3D, so 2D face exchange is a strict subset). The two remaining variants — ``[4-4 1 1]`` for 1D-deep sendlist chains and ``[8-2 2 2]`` for 3D borders — are complementary and not subsumable. Saves ~5.5s of suite wall time. Runner script gains a ``--null-vx`` flag (no-op when not passed, so existing tests are unaffected).

Strengthens ``test_pair_deepmd_mpi_dpa3_null_type_nlist_rebuild`` so the rebuilds see non-trivial sendlist composition changes: - NULL atoms now move in OPPOSITE directions via the new ``--null-vx-split`` flag. NULL id=7 (at x=5.5) gets v_x=-2000 A/ps -> drifts left (and via PBC into rank 1's far ghost region). NULL id=8 (at x=7.5) gets v_x=+2000 A/ps -> drifts right (deeper into rank 1's domain). The +/- split means each rebuild sees one NULL entering rank 0's sendlist while the other leaves it. - Real atoms get thermal velocities at T=10000 K via the new ``--real-temp`` flag (LAMMPS ``velocity realgroup create T seed``). Each real atom gets a different random direction, so the sendlist composition is also perturbed by real-atom motion (small but detectable under ``every 1`` rebuilds). NULL atoms still don't contribute to the deepmd model (filtered by ``select_real_atoms_coord``), so their unphysical velocity is harmless. Real-atom thermal motion at T=10000 K corresponds to RMS speed ~3-9 A/ps (mass-weighted) -> ~0.005-0.015 A motion per step; small enough that NVE stays stable but enough to perturb sendlists. Both new flags are no-ops when not passed; existing tests are unaffected.

Extends ``test_pair_deepmd_mpi_dpa3_empty_subdomain`` to run 5 MD steps with ``neigh_modify every 100`` instead of a single ``lammps.run(0)``. This forces: step 0 -> ago=0 (full rebuild; mapping_tensor + firstneigh_tensor populated for the first time on the empty-subdomain rank) step 1 -> ago=1 (cache HIT — mapping_tensor and firstneigh_tensor reused) step 2 -> ago=2 (cache hit) step 3 -> ago=3 (cache hit) step 4 -> ago=4 (cache hit) step 5 -> ago=5 (cache hit) Closes the catalog gap "Empty subdomain under PR 5407's mapping_tensor cache". Previously only step 0 was tested, which is always ago=0; the cache-hit branch in DeepPotPTExpt::compute on a rank with nloc=0 was unexercised. Compares mpi-2 vs mpi-1 trajectory with the same atol=1e-6 / rel=1e-8 tolerances as the other rebuild tests.

Adds the smallest reachable test that exercises the full spin GNN multi-rank dispatch chain (Tier-1 #2 in gnn_mpi_untested_paths). - gen_spin.py: also produce deeppot_dpa3_spin_mpi.pt2 (DPA3 + use_loc_mapping=False + use_spin=[True, False]) so a dual-artifact spin GNN .pt2 exists for testing. - run_mpi_pair_deepmd_spin_dpa3_pt2.py: MPI runner that drives the spin pair_style and gathers force / force_mag / virial across ranks. fm goes via 'compute property/atom fmx fmy fmz' since the legacy extract/gather_atoms registry doesn't expose 'fm'. - test_lammps_spin_dpa3_pt2.py: mpirun -n 2 vs same-archive mpirun -n 1 reference for energy / force / force_mag / virial (atol 1e-8). A divergence is necessarily a problem in DeepSpinPTExpt multi-rank dispatch, the spin branch of _exchange_ghosts, the C++ deepmd_export::border_op invocation, or the comm-tensor builder. - _build_dynamic_shapes: bump nall_dim min from 1 to 4 when has_spin. Without this, torch.export raises CONSTRAINT_VIOLATION on the pre-doubling nall axis when tracing GNN spin with with_comm_dict (the suggested fix in the error matches min=4). Eager parity (test_spin_dpa3_eager_parity) and trace-only validation already existed; this PR closes the gap by adding AOTI compile + LAMMPS load + real MPI exchange. Known limitations: - Single configuration tested (4 atoms, 2 ranks, type_map ["Ni", "O"], use_spin=[True, False]). No NULL-type, empty-subdomain, nlist-rebuild variants for spin yet -- the non-spin DPA3 path covers those code branches and the spin override differs only in the real/virtual split, which the new test exercises. - do_atomic_virial=True only (matches all current multi-rank tests; Tier-1 #3 still open). - N=2 only; no decomposition/N>2 spin variant. - CPU only.

Resolves the two spin-specific gaps left open by the previous commit: - test_pair_deepmd_mpi_dpa3_spin_empty_subdomain: elongated 30 A box + processors '2 1 1' leaves rank 1 with nloc=0. Exercises the copy_from_nlist empty-rank guard for the spin path (the with-comm artifact still runs on rank 1 with nloc_real=0). - test_pair_deepmd_mpi_dpa3_spin_null_type: 2 NULL (LAMMPS type-3, deepmd atype=-1) atoms straddling the x=6.5 rank boundary, within rcut of real atoms on both sides. Goes through DeepSpinPTExpt with nall_real < nall, triggering the has_null_atoms branch that calls build_comm_tensors_positional_with_virtual_atoms (fwd_map-based sendlist remap) for spin. Asserts NULL atoms get zero forces from the deepmd model and real-atom values match the mpi-1 reference. Both compare mpi-2 vs same-archive mpi-1 (atol 1e-8) so any divergence is necessarily in the multi-rank dispatch, not in tracing precision. Runner generalised with --pair-coeff and --mass3 flags (mirrors the non-spin DPA3 runner).

gen_spin.py rebuilds the yaml from the deterministic config + seed in _build_dpa3_mpi_yaml when missing, and the multi-rank test compares mpi-2 vs same-archive mpi-1 (no hardcoded numerical references depend on these weights). Unlike deeppot_dpa_spin.yaml -- whose committed weights are pinned because C++ tests hardcode reference values against them -- nothing requires deeppot_dpa3_spin_mpi.yaml to be checkpointed in git.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 695b7368a8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-01T13:33:26Z

+# Register opaque deepmd_export::border_op wrapper (used by GNN MPI
+# parallel inference; see comm.py module docstring).
 # Register fake tensor implementations for custom tabulate ops
+from deepmd.pt_expt.utils import comm  # noqa: F401


Defer comm registration until op library is loaded

Importing deepmd.pt_expt.utils now unconditionally imports comm, which immediately executes _check_underlying_ops_loaded() and raises if deepmd_export::border_op is not already registered. In practice this makes pt_expt import order-sensitive: any code path that imports pt_expt modules before deepmd.pt (including normal pt_expt entrypoints) can fail at import time even when not using multi-rank GNN comm ops. This is a regression from prior lazy/optional registration behavior and can break startup for single-rank workflows.

Useful? React with 👍 / 👎.

+      nghost_tensor);
+}
+
+TORCH_LIBRARY_FRAGMENT(deepmd, m) {


+// ============================================================================
+
+namespace {
+torch::Tensor border_op_export(const torch::Tensor& sendlist_tensor,


+  return out.empty() ? torch::empty_like(g1_tensor) : out[0].clone();
+}
+
+torch::Tensor border_op_backward_export(


+}
+}  // namespace
+
+TORCH_LIBRARY_FRAGMENT(deepmd_export, m) {


+// Register CPU + CUDA implementations under explicit dispatch keys so
+// torch.export sees opaque external calls (vs CompositeImplicitAutograd
+// which gets decomposed during trace).
+TORCH_LIBRARY_IMPL(deepmd_export, CPU, m) {


+  m.impl("border_op_backward", border_op_backward_export);
+}
+#if defined(GOOGLE_CUDA) || defined(TENSORFLOW_USE_ROCM)
+TORCH_LIBRARY_IMPL(deepmd_export, CUDA, m) {


The dpmodel layer threads a new ``comm_dict=None`` kwarg through ``forward_common_atomic`` (model and atomic-model levels) so the pt_expt backend can wire MPI ghost-atom exchange for GNN multi-rank LAMMPS. The JAX backend overrides ``forward_common_atomic`` with explicit kwarg lists; without accepting ``comm_dict``, ``dp convert-backend ... savedmodel`` fails at trace time: TypeError: jax_model.forward_common_atomic() got an unexpected keyword argument 'comm_dict' Affected the entire CI matrix on PR deepmodeling#5430 (every Python shard goes through the savedmodel build prep). Fix: add ``comm_dict: dict | None = None`` to each JAX override and ``del comm_dict`` (the JAX path has no MPI ghost exchange). Files touched: dp_atomic_model, linear_atomic_model, pairtab_atomic_model (atomic-model level), plus base_model, dp_model, dp_zbl_model (model level). Paddle's forward_common_atomic already accepts comm_dict and needs no change.

DDP-spawned worker subprocesses re-import modules from scratch and never run the test conftest's ``import deepmd.pt``, so when ``pt_expt.utils.comm`` is imported the underlying ``deepmd_export::{border_op,border_op_backward}`` ops are not yet registered and the import-time guard raises: RuntimeError: torch.ops.deepmd_export.{border_op,border_op_backward} are not registered. Build libdeepmd_op_pt.so and ensure deepmd.pt is imported before this module. Repro: test_training_ddp.py::TestDDPRestart::test_ddp_restart on every Python CI shard. Fix: ``_check_underlying_ops_loaded`` now triggers ``import deepmd.pt`` as a side effect when the ops aren't yet registered. ``deepmd/pt/cxx_op.py`` loads ``libdeepmd_op_pt.so`` which registers the schemas. The original RuntimeError stays as a fallback if ``import deepmd.pt`` itself fails. Verified locally: importing ``deepmd.pt_expt.utils.comm`` in a fresh process (without explicit ``import deepmd.pt`` first) now succeeds and ``torch.ops.deepmd_export.border_op`` is available.

coderabbitai

🧹 Nitpick comments (1)

deepmd/pt_expt/utils/comm.py (1)

59-65: 💤 Low value

Consider logging the swallowed exception for debuggability.

The intent to fall through to a clearer RuntimeError is sound, but completely discarding the original exception loses diagnostic information when deepmd.pt fails to import for non-obvious reasons.

♻️ Optional: log the exception before passing

+import logging
+
+log = logging.getLogger(__name__)
+
 ...
         try:
             import deepmd.pt  # noqa: F401
-        except Exception:
+        except Exception as e:
             # If deepmd.pt itself fails to import, fall through to the
             # explicit RuntimeError below — clearer than re-raising a
             # potentially-unrelated import error.
+            log.debug("Failed to import deepmd.pt: %s", e)
             pass

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@deepmd/pt_expt/utils/comm.py` around lines 59 - 65, The except block that
swallows import errors for "import deepmd.pt" should log the caught exception
before falling through to the intended RuntimeError; add a module logger
(logging.getLogger(__name__)) if one doesn’t exist and call
logger.exception(...) or logger.error(..., exc_info=True) inside the except in
comm.py so the original traceback is retained while preserving the current
behavior of raising the clearer RuntimeError later.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@deepmd/pt_expt/utils/comm.py`:
- Around line 59-65: The except block that swallows import errors for "import
deepmd.pt" should log the caught exception before falling through to the
intended RuntimeError; add a module logger (logging.getLogger(__name__)) if one
doesn’t exist and call logger.exception(...) or logger.error(..., exc_info=True)
inside the except in comm.py so the original traceback is retained while
preserving the current behavior of raising the clearer RuntimeError later.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 7c3534dd-c387-44d4-b633-46145c68bbfe

📥 Commits

Reviewing files that changed from the base of the PR and between 3c9ee65 and 87c9f3f.

📒 Files selected for processing (1)

deepmd/pt_expt/utils/comm.py

After 87c9f3f ``deepmd.pt_expt.utils.comm`` self-bootstraps ``libdeepmd_op_pt.so`` via ``_check_underlying_ops_loaded()``, so the explicit ``import deepmd.pt`` preloads in conftest.py and test_border_op_backward.py are no longer needed. Closes 2 of the 13 GitHub Advanced Security CodeQL "unused import" alerts on the PR. The remaining 5 Python alerts (other tests' ``import deepmd.pt_expt.utils.comm`` for opaque-op registration) and 6 C++ alerts (TORCH_LIBRARY_* / border_op_export reachable only through macro-expanded static initialization) are CodeQL false positives that need to be dismissed in the GitHub Security UI rather than fixed in source.

Applies the substantive coderabbitai suggestions from the PR review. Defensive guards (no behavioral change for existing callers): - dpmodel/descriptor/{repflows,repformers}.py: raise ValueError when the default `_exchange_ghosts` is hit with `mapping_tiled=None` and `use_loc_mapping=False` instead of returning a cryptic array-backend error. - pt_expt/descriptor/{repflows,repformers}.py: refuse `comm_dict` path when `nf != 1`. The squeeze(0)/unsqueeze(0) dance only works for a single frame; failing here surfaces the unsupported case loudly instead of producing a malformed border_op tensor. Init robustness: - api_cc/src/{DeepPotPTExpt,DeepSpinPTExpt}.cc: wrap the optional with-comm artifact load in try/catch. If `has_comm_artifact` is set in metadata but the nested artifact fails to extract or compile, log and fall back to single-rank-only dispatch instead of aborting init -- the hard error then surfaces only when multi-rank actually needs the missing artifact. Code hygiene: - dpmodel/descriptor/hybrid.py: rename unused unpacks (`g2/h2/sw` -> `_g2/_h2/_sw`) for ruff RUF059 cleanliness. - tests/infer/gen_dpa3.py: deepcopy `config_mpi` before passing to `get_model()` so `data_mpi["model_def_script"]` retains the intended MPI export config even if the call mutates its argument. - tests/pt_expt/model/test_export_with_comm.py: mirror the zero-ghost clamp from `serialization.py::_make_comm_sample_inputs` in the test helper, so no zero-length sendlist pointer is ever materialised. Also update `extra/...` -> `model/extra/...` archive paths to match PT2_EXTRA_PREFIX after the upstream/master merge. Verified locally: pt_expt python (24/24), ctest (3/3, 498 tests including 198 PtExpt), LAMMPS multi-rank GNN (19/19) all green.

codecov · 2026-05-02T09:46:18Z

Codecov Report

❌ Patch coverage is 76.53759% with 103 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.49%. Comparing base (0a481de) to head (4f8240e).

Files with missing lines	Patch %	Lines
source/api_cc/src/DeepSpinPTExpt.cc	61.53%	14 Missing and 6 partials ⚠️
deepmd/pt_expt/utils/serialization.py	77.38%	19 Missing ⚠️
source/api_cc/src/DeepPotPTExpt.cc	61.22%	13 Missing and 6 partials ⚠️
source/api_cc/src/commonPTExpt.h	67.24%	14 Missing and 5 partials ⚠️
deepmd/pt_expt/descriptor/repformers.py	66.66%	9 Missing ⚠️
deepmd/pt_expt/utils/comm.py	76.00%	6 Missing ⚠️
source/op/pt/comm.cc	90.62%	1 Missing and 2 partials ⚠️
deepmd/dpmodel/descriptor/make_base_descriptor.py	50.00%	1 Missing ⚠️
deepmd/dpmodel/descriptor/repflows.py	90.90%	1 Missing ⚠️
deepmd/dpmodel/descriptor/repformers.py	90.00%	1 Missing ⚠️
... and 5 more

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5430      +/-   ##
==========================================
+ Coverage   82.47%   82.49%   +0.02%     
==========================================
  Files         825      828       +3     
  Lines       87721    88117     +396     
  Branches     4206     4229      +23     
==========================================
+ Hits        72344    72693     +349     
- Misses      14093    14140      +47     
  Partials     1284     1284

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Removes the private-attribute fishing in ``deepmd/pt_expt/utils/serialization.py`` (which read ``descriptor.repflows.use_loc_mapping`` and friends) and replaces it with a public method on ``BaseDescriptor``: ``has_message_passing_- across_ranks()``. Why --- The old helper conflated two questions: 1. "Is this a GNN-style descriptor?" (existing ``has_message_passing()``) 2. "Do per-layer node embeddings need MPI exchange across rank boundaries to be correct under multi-rank LAMMPS?" Only #2 governs whether to compile a with-comm AOTI artifact. The old function answered #2 by special-casing the ``repflows``/``repformers`` attribute names and ``use_loc_mapping`` flag — silent breakage on any rename and never recursing into hybrid wrappers (Tier-1 #1 in the gnn_mpi_untested_paths catalog). Note: every LAMMPS pair_style already exchanges ghost-atom *coords and forces* via the standard pair-style comm topology — that's not GNN- specific. The new method asks specifically about per-layer atomic feature exchange (the ``node_ebd`` tensor that flows between message- passing layers), which is the actual concern that gates the with-comm artifact. How --- ``BaseDescriptor.has_message_passing_across_ranks()`` returns ``False`` by default. GNN paths override: - ``DescrptBlockRepflows``: ``not self.use_loc_mapping`` - ``DescrptBlockRepformers``: ``True`` (no ``use_loc_mapping`` opt-out exists) - ``DescrptDPA3`` / ``DescrptDPA2``: delegate to their block - ``DescrptHybrid``: ``any(child.has_message_passing_across_ranks() ...)`` (closes the structural side of catalog Tier-1 #1) Non-GNN dpmodel descriptors (``se_e2_a``, ``se_r``, ``se_t``, ``se_t_tebd``, ``dpa1``) get explicit ``return False`` overrides pinning the contract; pt and pd backend descriptors inherit the default (no edits needed there). The serialization helper ``_has_message_passing`` is renamed to ``_needs_with_comm_artifact`` and just calls ``descriptor.has_message_passing_across_ranks()``. The metadata key ``has_message_passing`` is dropped from the .pt2 archive (C++ readers only consume ``has_comm_artifact``). Per-descriptor tests -------------------- The standalone ``source/tests/pt_expt/utils/test_has_message_passing.py`` is deleted; per-descriptor coverage of *both* APIs is added to existing descriptor test files at ``source/tests/pt_expt/descriptor/``: | File | has_message_passing | has_message_passing_across_ranks | |--------------|---------------------|----------------------------------| | se_e2_a | False | False | | dpa1 | False | False | | dpa3 | True | not use_loc_mapping | | dpa2 | True | True | | hybrid | depends on child | True if any child needs it | Bonus: also includes a CUDA segfault fix ---------------------------------------- While running the post-refactor verification, the CUDA-runner CI exposed a latent bug in ``source/op/pt/comm.cc`` (forward + backward kernels): when built with ``USE_MPI`` but invoked single-rank (world_size==0), ``cuda_aware`` defaults to 0 and the CPU-fallback ``recv_g1_tensor.to(kCPU)`` block (guarded by ``world_size >= 1``) is skipped — the tensor stays on CUDA. The inner self-send branch then did host ``memcpy`` on what were still CUDA pointers and segfaulted. Fix: gate the host-memcpy / CPU-copy-back paths on ``world_size >= 1 && cuda_aware == 0`` so single-rank deployments correctly use ``gpuMemcpy DeviceToDevice``. Mirrored in three sites (forward inner, forward post-loop, backward inner, backward post-loop). Float32 multi-rank fixture + test --------------------------------- Adds ``test_lammps_dpa3_pt2_fp32.py`` and a paired ``deeppot_dpa3_mpi_fp32.pt2`` fixture (gen_dpa3.py addition). Validates that the comm_dict path is dtype-agnostic in practice (template dispatch on ``g1.dtype()``, ``register_fake``'s ``empty_like(g1)``, and ``MPI_FLOAT`` exchange) — not just by inspection. Compares mpi-2 vs same-archive mpi-1 with float32-appropriate tolerances (atol 1e-4 / rel 1e-3 for force/virial; rel 1e-5 for energy). Verified locally (CPU build): pt_expt python 965 passed / 32 skipped, ctest 3/3 (498 C++ tests), LAMMPS multi-rank 20/20 (DPA3 + DPA2 + spin DPA3 + DPA3 fp32). Trade-off note -------------- The plan called for ``has_message_passing_across_ranks()`` to be abstract on ``BaseDescriptor`` (mirroring ``has_message_passing``). Implementing that requires touching all 49 subclasses across pt and pd backends — well outside the scope of "GNN MPI for pt_expt". Kept the method concrete with a ``return False`` default; pt and pd backend descriptors inherit that. They can override later if they grow a multi-rank GNN path of their own.

+@pytest.mark.parametrize(
+    "child_factory,expected_hmp,expected_hmp_ar",
+    [
+        (lambda: _se_e2_a_child(), False, False),


The C++ ``border_op`` host code dereferences ``send_proc``, ``recv_proc``, ``send_num``, ``recv_num`` (and ``send_list`` / ``communicator`` / ``nlocal`` / ``nghost``) directly via ``data_ptr<int>()`` from host code — see ``source/op/pt/comm.cc`` forward_t/backward_t. Production code in ``source/api_cc/src/commonPTExpt.h::build_comm_tensors_positional`` explicitly creates them on ``torch::kCPU``. The test ``_build_self_comm_dict`` helper was constructing them on ``device`` (which on a CUDA build is ``cuda:0``). On CPU-only builds this happened to work; on a CUDA-enabled build the host read of ``recvnum[iswap]`` walks a CUDA pointer and segfaults. This is a test bug, not a runtime contract change. Fix by forcing the control tensors to CPU regardless of caller-supplied device, matching production semantics, and document why in the docstring. Reproduces the intermittent CUDA CI segfault on PR deepmodeling#5430: ``test_repflow_parallel.py`` was the failure point in https://github.com/deepmodeling/deepmd-kit/actions/runs/25264766026

Same bug as the previous commit in ``test_repflow_parallel.py``: ``_build_self_comm_dict`` constructs the control tensors (send_proc / recv_proc / send_num / recv_num / send_list / communicator / nlocal / nghost) on the caller-supplied ``device``, which is ``cuda`` on a CUDA build. The C++ ``border_op`` host code dereferences these via ``data_ptr<int>()`` from the host, so a CUDA-device control tensor segfaults the read. Production code in ``commonPTExpt.h::build_comm_tensors_positional`` explicitly builds them on CPU. Force CPU regardless of the caller-supplied device, matching the production contract. This was the second segfault revealed on PR deepmodeling#5430 CI after 08805b6 fixed test_repflow_parallel.py: test_repflow_parallel.py .... [ 13%] Segmentation fault (core dumped) test_repformer_parallel.py

# Conflicts: # source/lmp/tests/test_lammps_dpa3_pt2.py

The CUDA self-send branch in ``Border::forward_t`` and ``backward_t`` was guarded by ``if (world_size >= 1 && cuda_aware == 0)`` to choose between host ``memcpy`` and ``gpuMemcpy(D2D)``. The intent was "world_size >= 1 means MPI is initialised so the pre-loop CPU fallback ran and the buffer is now on CPU; otherwise it's still on its original device (assumed CUDA)". That assumption is wrong for one important case: a USE_MPI build called from Python with CPU tensors and no MPI init (``world_size == 0``). Unit tests in ``source/tests/pt_expt/utils/test_border_op_- backward.py`` do exactly this — they construct CPU comm tensors and a CPU ``grad_g1``, never call MPI_Init, and expect the kernel to do plain CPU accumulation. The old guard fell through to ``gpuMemcpy (...DeviceToDevice)`` on host pointers. CUDA returns ``cudaErrorInvalidValue`` from that call; the return code is unchecked and ``recv_g1`` is left uninitialised. Subsequent ``index_add_`` then writes garbage into ``d_local_g1_tensor`` — the test sees mixed denormals + sigmoid-shaped values from leaked buffer memory. Same bug bit ``test_spin_export_with_comm.py::test_spin_dpa3_eager- _parity``: it compares the no-comm path against the comm_dict path for a spin DPA3, and the comm_dict path went through the broken self-send. Energy diverged by ~0.1 instead of being bit-identical. Fix: dispatch the self-send memcpy on the actual device of the buffer (``recv_g1_tensor.is_cuda()``). The post-loop copy-back to ``g1.device()`` is changed analogously to use ``!is_alias_of(g1)`` — the buffer was moved if and only if the pre-loop CPU fallback created a fresh tensor. Both checks are precise correctness conditions that work for every combination of (USE_MPI on/off, GOOGLE_CUDA on/off, MPI initialised or not, CUDA or CPU tensors). Verified on remote with CUDA build + USE_MPI: test_border_op_backward.py 5 passed test_spin_export_with_comm.py 1 passed test_repflow_parallel.py + sibling 6 passed broader pt_expt sweep 58 passed

The empty-subdomain spin LAMMPS test (``processors 2 1 1`` with all atoms on rank 0, rank 1 nloc=0) failed at MPI_Finalize with "Communicator (handle=0x44000000) being freed has 2 unmatched message(s)". Test outputs were correct; the failure was purely in the MPI cleanup path. Root cause is the asymmetric ghost-exchange pattern that arises when one rank only Sends and the other only Irecvs at a given swap (no local atoms means nothing to send back). Under MPICH eager protocol: * The sender's MPI_Send returns once the message is queued in the eager buffer; the receiver's ACK round-trip is processed asynchronously by MPI's progress engine. * In symmetric swaps the sender also calls MPI_Wait on its own Irecv, which advances the progress engine and drains pending ACKs. * In asymmetric swaps the sender makes no further MPI call inside border_op, so the ACK stays unprocessed. The "in-flight" counter remains nonzero, and MPI_Finalize reports it as unmatched. Fix: add a single ``MPI_Barrier(world)`` at the end of ``Border::forward_t`` and ``Border::backward_t``. The Barrier forces a round-trip on every rank, which advances every rank's progress engine and drains pending ACKs. Cost is one collective per ghost-exchange call; on a 2-rank, 6-swap, 4-atom case this is in the noise vs the surrounding model forward. Verified on remote (CUDA + MPICH): test_lammps_spin_dpa3_pt2.py ... [3 passed] test_lammps_dpa3_pt2.py ............... [15 passed] Restores the multi-rank LAMMPS spin GNN with empty-subdomain support (PR deepmodeling#5430 CI's last failing case).

Han Wang added 23 commits April 25, 2026 23:55

github-actions Bot added Python OP C++ LAMMPS labels May 1, 2026

dosubot Bot added the enhancement label May 1, 2026

wanghan-iapcm requested a review from iProzd May 1, 2026 13:26

chatgpt-codex-connector Bot reviewed May 1, 2026

View reviewed changes

github-advanced-security AI found potential problems May 1, 2026

View reviewed changes

Han Wang added 2 commits May 2, 2026 07:51

coderabbitai Bot reviewed May 2, 2026

View reviewed changes

wanghan-iapcm added the Test CUDA Trigger test CUDA workflow label May 2, 2026

github-actions Bot removed the Test CUDA Trigger test CUDA workflow label May 2, 2026

wanghan-iapcm added the Test CUDA Trigger test CUDA workflow label May 2, 2026

github-actions Bot removed the Test CUDA Trigger test CUDA workflow label May 2, 2026

github-advanced-security AI found potential problems May 2, 2026

View reviewed changes

Comment thread source/tests/pt_expt/descriptor/test_hybrid.py

@pytest.mark.parametrize(

"child_factory,expected_hmp,expected_hmp_ar",

[

(lambda: _se_e2_a_child(), False, False),

wanghan-iapcm added the Test CUDA Trigger test CUDA workflow label May 2, 2026

github-actions Bot removed the Test CUDA Trigger test CUDA workflow label May 2, 2026

wanghan-iapcm added the Test CUDA Trigger test CUDA workflow label May 3, 2026

github-actions Bot removed the Test CUDA Trigger test CUDA workflow label May 3, 2026

wanghan-iapcm added the Test CUDA Trigger test CUDA workflow label May 3, 2026

github-actions Bot removed the Test CUDA Trigger test CUDA workflow label May 3, 2026

Merge remote-tracking branch 'upstream/master' into feat-pt-expt-gnn-mpi

8463af8

# Conflicts: # source/lmp/tests/test_lammps_dpa3_pt2.py

wanghan-iapcm added the Test CUDA Trigger test CUDA workflow label May 3, 2026

github-actions Bot removed the Test CUDA Trigger test CUDA workflow label May 3, 2026

wanghan-iapcm added the Test CUDA Trigger test CUDA workflow label May 4, 2026

github-actions Bot removed the Test CUDA Trigger test CUDA workflow label May 4, 2026

wanghan-iapcm added the Test CUDA Trigger test CUDA workflow label May 4, 2026

github-actions Bot removed the Test CUDA Trigger test CUDA workflow label May 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(pt_expt): multi-rank LAMMPS support for GNN models (DPA3 / DPA2 / spin)#5430

feat(pt_expt): multi-rank LAMMPS support for GNN models (DPA3 / DPA2 / spin)#5430
wanghan-iapcm wants to merge 34 commits intodeepmodeling:masterfrom
wanghan-iapcm:feat-pt-expt-gnn-mpi

wanghan-iapcm commented May 1, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 1, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

codecov Bot commented May 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wanghan-iapcm commented May 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design

Coverage matrix

Co-existence with #5407

Known limitations

Test plan

Summary by CodeRabbit

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wanghan-iapcm commented May 1, 2026 •

edited by coderabbitai Bot

Loading

codecov Bot commented May 2, 2026 •

edited

Loading