Skip to content

Conversation

@hzhou
Copy link
Contributor

@hzhou hzhou commented Aug 4, 2025

Pull Request Description

A new pipeline implementation using the ofi native RNDV path.

[skip warnings]

Reason

They promised to handle everything in the best ways, but when they fall short ...

Design

  • direct path: data_sz < MPIR_CVAR_CH4_OFI_EAGER_THRESHOLD
    • [optional packing]
    • fi_tinject
    • fi_tsend
    • fi_tsendmsg (iov)
  • RNDV path
    • pipeline
    • rdma read - host contig send
    • rdma write - host contig recv
    • direct - up to MPIDI_OFI_global.max_msg_size

TODO

  • pipeline
  • rdma read
  • rdma write
  • multi-nic striping
  • Auto protocol selection
  • correct status counts
  • squash fixup commits

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

@hzhou hzhou force-pushed the 2507_pipeline branch 5 times, most recently from b6ce6e2 to 9030579 Compare August 6, 2025 19:50
@hzhou
Copy link
Contributor Author

hzhou commented Aug 6, 2025

Currently on Aurora, without GPU Direct RDMA:

  • default path (before this PR)

    • H2H: 23.9GB/sec -- full RDMA bandwidth
    • D2D: 3.3GB/sec -- limited by D2H/H2D local pack/unpack
    • D2H: 10.5GB/sec -- limited by D2H sender pack
    • H2D: 3.3GB/sec -- limited by H2D receiver unpack, penalized by non-repeating recv pack buffer
  • pipeline

    • H2H: 5.9GB/sec - limited by host memcpy, lack of threads/offloading
    • D2D: 23.7GB/sec - near full bandwidth
    • D2H: 5.9GB/sec - limited by recv memcpy
    • H2D: 9.9GB/sec - limited by send memcpy
  • rndv read

    • H2H: 22.9GB/sec - near full bandwidth
    • D2D: failed at fi_mr_reg, need select pipeline
    • D2H: failed at fi_mr_reg
    • H2D: hangs pipelined read , 23.7GB/sec!
  • rndv write

    • H2H: 21.2GB/sec
    • D2D: x
    • D2H: 24.0GB/sec !
    • H2D: x
  • auto:

    • H2H: 23.9 GB/sec -- direct, unless num_nics > 1
    • D2D: 23.7 GB/sec -- pipeline
    • D2H: 24.1 GB/sec -- write
    • H2D: 23.6 GB/sec -- read
  • TODO: add pipelined read/write

@hzhou hzhou mentioned this pull request Aug 7, 2025
4 tasks
@hzhou
Copy link
Contributor Author

hzhou commented Aug 7, 2025

  • rndv read 2 NICs: 24.0 GB/sec

    • looks like libfabric is limiting it
    • I hit 47.7GB/sec if I limit chunks per NIC to 1. Somehow if I issue multiple fi_read per mr, it reduced to 24GB/sec 😕
  • Hitting maximum pair-wise bandwidth of 23.9GB/sec per NIC up to 4 pairs

@hzhou hzhou force-pushed the 2507_pipeline branch 4 times, most recently from 9381aff to 7a0941d Compare August 7, 2025 03:59
@hzhou hzhou force-pushed the 2507_pipeline branch 14 times, most recently from a0bf9d0 to cc7e39c Compare August 11, 2025 14:39
@hzhou hzhou marked this pull request as ready for review August 11, 2025 14:40
hzhou added 3 commits August 20, 2025 11:39
Simply posting a large buffer in e.g. fi_trecv may take a big latency,
undermining the benefit of doing pipelining. Since we won't directly
send a message larger than MPIDI_OFI_EAGER_THRESH, let's limit the
recv buffer size to this limit.
Add a new pipeline send/recv to the native rndv path.
A reimplementation of the huge protocol. It supports the multinic
striping feature..

To support rdma read into non-contig buffer or when directly read into
gpu buffer is not supported, we should allocate pipeline chunk buffer
and perform Ilocalcopy_gpu after the read.
@hzhou
Copy link
Contributor Author

hzhou commented Aug 20, 2025

test:mpich/ch4/ofi
test:mpich/ch4/ofi/more
test:mpich/ch4/gpu/ofi - bcast timeout

@hzhou
Copy link
Contributor Author

hzhou commented Aug 21, 2025

test:mpich/ch4/gpu/ofi ✔️

p->u.recv.num_infly = 0;

/* issue fi_read */
mpi_errno = MPIR_Async_things_add(rndvread_read_poll, rreq, NULL);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just my ignorance of the async things interface, but why do we only "add" items in the rndv read protocol, but we "spawn" them in the pipelined send protocol?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is creating new async task within an async callback, we need use "spawn". This is to prevent corrupting the task list and avoiding recursive lock.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it's the case that all the tasks for rndv read and write protocols are created ahead of time? But with the in flight limits of pipeline, they are spawned as other chunks complete?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. In rndvread, first we create a parent task rndvread_read_poll. The parent task watches "infly" chunks, allocates chunk buffers and issues chunk read request until all chunks are dispatched. Similarly rndvwrite. ... Wait, in rndvwrite, the async copy in send_copy_poll is issued from rndvwrite_write_poll, thus it should use _spawn. Good catch! Let me add a fixup patch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so the read ops are not themselves individual async tasks rather they are something the real poll task monitors. But the rndv write code generates async copy tasks that do need to be tracked as their own async tasks within the already started task 😅.

Copy link
Contributor Author

@hzhou hzhou Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly. Are you available for offline chat?

Copy link
Contributor Author

@hzhou hzhou Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read ops are async tasks but they are ofi async tasks and managed by libfabric. The callback for ofi async tasks are those event functions. Our MPIX Async facility manages "local" async tasks (or all async tasks that requires manual polls). In this PR, both the parent chunk-launching task and individual chunk copy tasks are managed by MPIX Async facility

@hzhou
Copy link
Contributor Author

hzhou commented Aug 21, 2025

test:mpich/custom
netmod: ch4:ofi
env: MPIR_CVAR_CH4_OFI_EAGER_THRESHOLD=16384
✔️

Copy link
Contributor

@raffenet raffenet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I have any other concerns as long as tests clear and the one suggested change is incorporated.

hzhou added 6 commits August 21, 2025 16:51
Similar structure as pipeline and rndv read.

Some refactoring to avoid some code duplication.
The original gpu pipeline code can't work without application
exclusively use gpu to gpu for all large messages.

The new code in the RNDV paths should handle all cases.

Update tests and replace the use of gpu pipeline with the new rndv
pipeline algorithm.

Update document doc/mpich/tuning_parameters.md regarding gpu pipeline.
The huge protocol, including the multi-nic striping option, is fully
replaced by the new rndv read algorithm.
Removing leftover constants from removed features (huge messages and gpu
pipeline).
This is effectively replaced with MPIR_CVAR_CH4_OFI_EAGER_THRESHOLD.
hzhou added 8 commits August 21, 2025 16:52
Pick between pipeline, rdma read, and rdma write based on whether sender
side and recv side require pipelined packing.

We are completely separating the ofi native rndv path from the mpidig path.
The native path will not touch the MPIDIG_REQUEST fields, but only use
the other union members.

All the native rndv unions will share common fields and initialized in
MPIDI_OFI_send and MPIDI_OFI_recv_rndv_event.

Auto protocol will determine best rndv protocols based on whether it is
beneficial to do pipeline packing. When both sides don't need pack, we
will use the direct method instead.

Direct methond uses am_tag_{send,recv}. Use special constants -1 for
handler_id so it does not call mpidig callbacks.
Especially in the pipeline protocols, deadlocks may happen if recv
data size disagree with send data size. Add a data size sync message if
necessary.

Update recv status count and set error if trucation happens.

For direct protocol, the truncation error is caught in am_tag_recv. Just
need make sure to transfer status.MPI_ERROR upon completion.
Setting MPIDI_OFI_REQUEST(*request, am_req) to NULL may overwrite the
rndv fields if the MPIDI_OFI_send uses the rndv path. This is because
the rndv fields are union to MPIDI_OFI_REQUEST fields. Initialize am_req
right after request creation instead.
MPIX async progress is invoked without vci critical section. Make sure
to enter VCI CS when we need access genq private pool, call libfabric,
or free request.
Registered host memory works with RDMA and should be preferred vs.
pipeline.
Explicitly include ofi_impl.h in ofi_rndv.c for noinline build option.

Move MPIDI_OFI_gpu_get_{send,recv}_engine_type to ofi_impl.h. Move both
CVAR commont blocks to ofi_init.c.
Pass the request object to MPIDI_NM_am_can_do_tag so the netmod can
decide based on message size etc.

Do the same for MPIDI_SHM_am_can_do_tag for consistency.
@hzhou hzhou merged commit 059ea00 into pmodels:main Aug 21, 2025
4 checks passed
@hzhou hzhou deleted the 2507_pipeline branch August 21, 2025 21:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants