feat: support `send` and `receive` with OpenMPI backend implementation by GordonYang1 · Pull Request #13 · InfiniTensor/InfiniCCL

GordonYang1 · 2026-05-19T02:50:56Z

Summary

This PR introduces blocking point-to-point Send / Recv support for InfiniCCL with an OpenMPI-based backend implementation, along with an example program for functionality verification and basic performance evaluation.

The public APIs follow the NCCL point-to-point parameter order through infiniSend() and infiniRecv(). The current implementation uses host-staging for device buffers and blocking MPI_Send / MPI_Recv internally.

Changes

Public Send/Recv API
- add public API declarations for:
  - infiniSend();
  - infiniRecv().
- expose point-to-point communication through the common communicator interface.
Base Send/Recv Wrappers
- add src/base/send.h;
- add src/base/recv.h;
- validate communicator handles, datatype values, peer rank ranges, and non-null buffers for non-zero counts before dispatching to backend implementations;
- return infiniInvalidArgument for invalid user inputs.
OpenMPI-based Send/Recv Implementation
- add src/ompi/impl/send.h;
- add src/ompi/impl/recv.h;
- implement blocking point-to-point communication with MPI_Send and MPI_Recv;
- use temporary host-staging buffers for device memory transfer;
- split large byte counts into INT_MAX-bounded MPI chunks.
Send/Recv Example
- add examples/send_recv.cc;
- align the example structure and output style with the existing example programs such as all_reduce, all_gather, reduce_scatter, broadcast, and all_to_all;
- verify accelerator-memory Send/Recv correctness from rank 0 to rank 1;
- report basic timing and bandwidth metrics in the same style as other examples.

Known Issues & Future Work

The current OpenMPI Send/Recv implementation is blocking and does not overlap communication with computation. Future work may introduce non-blocking point-to-point APIs and stream-aware asynchronous execution.
The current implementation allocates temporary host-staging buffers using malloc/free on every invocation. This may introduce overhead in high-frequency workloads. Future work may add reusable buffer pools, allocator caching, and pinned host memory support to improve transfer efficiency and reduce allocation overhead.
The current implementation uses a fixed MPI tag (0) internally. Future extensions may expose tags or add request-based APIs if more advanced point-to-point patterns are needed.
The current implementation performs GPU-to-Host and Host-to-GPU copies for point-to-point communication. While functionally correct, this is not optimal for GPU-intensive workloads. Future work may implement zero-copy GPU-GPU transfers or GPUDirect RDMA where supported.
The current example intentionally follows the lightweight style of the existing example programs. Future dedicated tests may add broader point-to-point coverage such as ping-pong, invalid peer validation, zero-count calls, and large-count stress cases.

Logs & Screenshots

all_reduce test (MetaX-NVIDIA heterogeneous)
all_reduce.log

all_gather test (MetaX-NVIDIA heterogeneous)
all_gather.log

reduce_scatter test (MetaX-NVIDIA heterogeneous)
reduce_scatter.log

broadcast test (MetaX-NVIDIA heterogeneous)
broadcast.log

all_to_all test (MetaX-NVIDIA heterogeneous)
all_to_all.log

send_recv test (MetaX-NVIDIA heterogeneous)
send_recv.log

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 92412fdd4f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-19T02:56:41Z

+    Runtime<kDev>::Memcpy(host_buf, send_buff, total_bytes,
+                          Runtime<kDev>::MemcpyDeviceToHost);
+    Runtime<kDev>::StreamSynchronize(
+        static_cast<Runtime<kDev>::Stream>(stream));


Propagate failed device-to-host sends

When the send buffer is an invalid/stale device pointer or the selected runtime reports a stream error, this path still continues to MPI_Send and returns success because both Runtime<kDev>::Memcpy and StreamSynchronize results are ignored. That can send uninitialized staging data and hide the actual failure; these runtime calls should be checked and returned, as the existing collective implementations do with CHECK_STATUS.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-19T02:56:41Z

+    Runtime<kDev>::Memcpy(recv_buff, host_buf, total_bytes,
+                          Runtime<kDev>::MemcpyHostToDevice);


Propagate failed host-to-device receives

When the receive destination is an invalid/stale device pointer or the H2D copy fails, infiniRecv still frees the staging buffer and returns success because the Runtime<kDev>::Memcpy result is ignored. In that scenario the MPI receive completed but the user's device buffer was not updated, so callers get silent data corruption instead of an error status.

Useful? React with 👍 / 👎.

chatgpt-codex-connector Bot reviewed May 19, 2026

View reviewed changes

GordonYang1 force-pushed the feat/support-send-recv-minimal branch 2 times, most recently from 41e0b95 to fca4f46 Compare May 20, 2026 02:26

feat: support send and receive

b11f3e1

GordonYang1 force-pushed the feat/support-send-recv-minimal branch 2 times, most recently from 4181112 to b379193 Compare May 20, 2026 04:53

GordonYang1 changed the title ~~feat: support send and receive~~ feat: support ‘send’ and ‘receive’ with OpenMPI backend implementation May 20, 2026

GordonYang1 changed the title ~~feat: support ‘send’ and ‘receive’ with OpenMPI backend implementation~~ feat: support send and receive with OpenMPI backend implementation May 20, 2026

style: align send recv example with contributing guide

4531920

GordonYang1 force-pushed the feat/support-send-recv-minimal branch from b379193 to 4531920 Compare May 20, 2026 05:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support `send` and `receive` with OpenMPI backend implementation#13

feat: support `send` and `receive` with OpenMPI backend implementation#13
GordonYang1 wants to merge 2 commits into
InfiniTensor:masterfrom
GordonYang1:feat/support-send-recv-minimal

GordonYang1 commented May 19, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 19, 2026

Uh oh!

chatgpt-codex-connector Bot May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		Runtime<kDev>::Memcpy(recv_buff, host_buf, total_bytes,
		Runtime<kDev>::MemcpyHostToDevice);

Conversation

GordonYang1 commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Known Issues & Future Work

Logs & Screenshots

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

GordonYang1 commented May 19, 2026 •

edited

Loading