Skip to content

feat: support send and receive with OpenMPI backend implementation#13

Open
GordonYang1 wants to merge 2 commits into
InfiniTensor:masterfrom
GordonYang1:feat/support-send-recv-minimal
Open

feat: support send and receive with OpenMPI backend implementation#13
GordonYang1 wants to merge 2 commits into
InfiniTensor:masterfrom
GordonYang1:feat/support-send-recv-minimal

Conversation

@GordonYang1
Copy link
Copy Markdown
Collaborator

@GordonYang1 GordonYang1 commented May 19, 2026

Summary

This PR introduces blocking point-to-point Send / Recv support for InfiniCCL with an OpenMPI-based backend implementation, along with an example program for functionality verification and basic performance evaluation.

The public APIs follow the NCCL point-to-point parameter order through infiniSend() and infiniRecv(). The current implementation uses host-staging for device buffers and blocking MPI_Send / MPI_Recv internally.

Changes

  • Public Send/Recv API

    • add public API declarations for:
      • infiniSend();
      • infiniRecv().
    • expose point-to-point communication through the common communicator interface.
  • Base Send/Recv Wrappers

    • add src/base/send.h;
    • add src/base/recv.h;
    • validate communicator handles, datatype values, peer rank ranges, and non-null buffers for non-zero counts before dispatching to backend implementations;
    • return infiniInvalidArgument for invalid user inputs.
  • OpenMPI-based Send/Recv Implementation

    • add src/ompi/impl/send.h;
    • add src/ompi/impl/recv.h;
    • implement blocking point-to-point communication with MPI_Send and MPI_Recv;
    • use temporary host-staging buffers for device memory transfer;
    • split large byte counts into INT_MAX-bounded MPI chunks.
  • Send/Recv Example

    • add examples/send_recv.cc;
    • align the example structure and output style with the existing example programs such as all_reduce, all_gather, reduce_scatter, broadcast, and all_to_all;
    • verify accelerator-memory Send/Recv correctness from rank 0 to rank 1;
    • report basic timing and bandwidth metrics in the same style as other examples.

Known Issues & Future Work

  • The current OpenMPI Send/Recv implementation is blocking and does not overlap communication with computation. Future work may introduce non-blocking point-to-point APIs and stream-aware asynchronous execution.
  • The current implementation allocates temporary host-staging buffers using malloc/free on every invocation. This may introduce overhead in high-frequency workloads. Future work may add reusable buffer pools, allocator caching, and pinned host memory support to improve transfer efficiency and reduce allocation overhead.
  • The current implementation uses a fixed MPI tag (0) internally. Future extensions may expose tags or add request-based APIs if more advanced point-to-point patterns are needed.
  • The current implementation performs GPU-to-Host and Host-to-GPU copies for point-to-point communication. While functionally correct, this is not optimal for GPU-intensive workloads. Future work may implement zero-copy GPU-GPU transfers or GPUDirect RDMA where supported.
  • The current example intentionally follows the lightweight style of the existing example programs. Future dedicated tests may add broader point-to-point coverage such as ping-pong, invalid peer validation, zero-count calls, and large-count stress cases.

Logs & Screenshots

all_reduce test (MetaX-NVIDIA heterogeneous)
all_reduce.log

all_gather test (MetaX-NVIDIA heterogeneous)
all_gather.log

reduce_scatter test (MetaX-NVIDIA heterogeneous)
reduce_scatter.log

broadcast test (MetaX-NVIDIA heterogeneous)
broadcast.log

all_to_all test (MetaX-NVIDIA heterogeneous)
all_to_all.log

send_recv test (MetaX-NVIDIA heterogeneous)
send_recv.log

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 92412fdd4f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/ompi/impl/send.h
Comment on lines +44 to +47
Runtime<kDev>::Memcpy(host_buf, send_buff, total_bytes,
Runtime<kDev>::MemcpyDeviceToHost);
Runtime<kDev>::StreamSynchronize(
static_cast<Runtime<kDev>::Stream>(stream));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Propagate failed device-to-host sends

When the send buffer is an invalid/stale device pointer or the selected runtime reports a stream error, this path still continues to MPI_Send and returns success because both Runtime<kDev>::Memcpy and StreamSynchronize results are ignored. That can send uninitialized staging data and hide the actual failure; these runtime calls should be checked and returned, as the existing collective implementations do with CHECK_STATUS.

Useful? React with 👍 / 👎.

Comment thread src/ompi/impl/recv.h
Comment on lines +59 to +60
Runtime<kDev>::Memcpy(recv_buff, host_buf, total_bytes,
Runtime<kDev>::MemcpyHostToDevice);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Propagate failed host-to-device receives

When the receive destination is an invalid/stale device pointer or the H2D copy fails, infiniRecv still frees the staging buffer and returns success because the Runtime<kDev>::Memcpy result is ignored. In that scenario the MPI receive completed but the user's device buffer was not updated, so callers get silent data corruption instead of an error status.

Useful? React with 👍 / 👎.

@GordonYang1 GordonYang1 force-pushed the feat/support-send-recv-minimal branch 2 times, most recently from 41e0b95 to fca4f46 Compare May 20, 2026 02:26
@GordonYang1 GordonYang1 force-pushed the feat/support-send-recv-minimal branch 2 times, most recently from 4181112 to b379193 Compare May 20, 2026 04:53
@GordonYang1 GordonYang1 changed the title feat: support send and receive feat: support ‘send’ and ‘receive’ with OpenMPI backend implementation May 20, 2026
@GordonYang1 GordonYang1 changed the title feat: support ‘send’ and ‘receive’ with OpenMPI backend implementation feat: support send and receive with OpenMPI backend implementation May 20, 2026
@GordonYang1 GordonYang1 force-pushed the feat/support-send-recv-minimal branch from b379193 to 4531920 Compare May 20, 2026 05:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant