ch4/ofi: add new pipeline implementation #7529

hzhou · 2025-08-04T16:14:34Z

Pull Request Description

A new pipeline implementation using the ofi native RNDV path.

[skip warnings]

Reason

They promised to handle everything in the best ways, but when they fall short ...

Design

direct path: data_sz < MPIR_CVAR_CH4_OFI_EAGER_THRESHOLD
- [optional packing]
- fi_tinject
- fi_tsend
- fi_tsendmsg (iov)
RNDV path
- pipeline
- rdma read - host contig send
- rdma write - host contig recv
- direct - up to MPIDI_OFI_global.max_msg_size

TODO

Author Checklist

Provide Description
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits Follow Good Practice
Commits are self-contained and do not do two things at once.
Commit message is of the form: module: short description
Commit message explains what's in the commit.
Passes All Tests
Whitespace checker. Warnings test. Additional tests via comments.
Contribution Agreement
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.

hzhou · 2025-08-06T22:36:22Z

Currently on Aurora, without GPU Direct RDMA:

default path (before this PR)
- H2H: 23.9GB/sec -- full RDMA bandwidth
- D2D: 3.3GB/sec -- limited by D2H/H2D local pack/unpack
- D2H: 10.5GB/sec -- limited by D2H sender pack
- H2D: 3.3GB/sec -- limited by H2D receiver unpack, penalized by non-repeating recv pack buffer
pipeline
- H2H: 5.9GB/sec - limited by host memcpy, lack of threads/offloading
- D2D: 23.7GB/sec - near full bandwidth
- D2H: 5.9GB/sec - limited by recv memcpy
- H2D: 9.9GB/sec - limited by send memcpy
rndv read
- H2H: 22.9GB/sec - near full bandwidth
- D2D: failed at fi_mr_reg, need select pipeline
- D2H: failed at fi_mr_reg
- H2D: ~~hangs~~ pipelined read , 23.7GB/sec!
rndv write
- H2H: 21.2GB/sec
- D2D: x
- D2H: 24.0GB/sec !
- H2D: x
auto:
- H2H: 23.9 GB/sec -- direct, unless num_nics > 1
- D2D: 23.7 GB/sec -- pipeline
- D2H: 24.1 GB/sec -- write
- H2D: 23.6 GB/sec -- read
TODO: add pipelined read/write

hzhou · 2025-08-07T01:14:57Z

rndv read 2 NICs: 24.0 GB/sec
- looks like libfabric is limiting it
- I hit 47.7GB/sec if I limit chunks per NIC to 1. Somehow if I issue multiple fi_read per mr, it reduced to 24GB/sec 😕
Hitting maximum pair-wise bandwidth of 23.9GB/sec per NIC up to 4 pairs

Simply posting a large buffer in e.g. fi_trecv may take a big latency, undermining the benefit of doing pipelining. Since we won't directly send a message larger than MPIDI_OFI_EAGER_THRESH, let's limit the recv buffer size to this limit.

Add a new pipeline send/recv to the native rndv path.

A reimplementation of the huge protocol. It supports the multinic striping feature.. To support rdma read into non-contig buffer or when directly read into gpu buffer is not supported, we should allocate pipeline chunk buffer and perform Ilocalcopy_gpu after the read.

hzhou · 2025-08-20T16:40:07Z

test:mpich/ch4/ofi
test:mpich/ch4/ofi/more
test:mpich/ch4/gpu/ofi - bcast timeout

hzhou · 2025-08-21T02:42:54Z

test:mpich/ch4/gpu/ofi ✔️

raffenet · 2025-08-21T18:04:46Z

src/mpid/ch4/netmod/ofi/ofi_rndv_read.c

+    p->u.recv.num_infly = 0;
+
+    /* issue fi_read */
+    mpi_errno = MPIR_Async_things_add(rndvread_read_poll, rreq, NULL);


this is just my ignorance of the async things interface, but why do we only "add" items in the rndv read protocol, but we "spawn" them in the pipelined send protocol?

If it is creating new async task within an async callback, we need use "spawn". This is to prevent corrupting the task list and avoiding recursive lock.

So it's the case that all the tasks for rndv read and write protocols are created ahead of time? But with the in flight limits of pipeline, they are spawned as other chunks complete?

Not really. In rndvread, first we create a parent task rndvread_read_poll. The parent task watches "infly" chunks, allocates chunk buffers and issues chunk read request until all chunks are dispatched. Similarly rndvwrite. ... Wait, in rndvwrite, the async copy in send_copy_poll is issued from rndvwrite_write_poll, thus it should use _spawn. Good catch! Let me add a fixup patch.

OK, so the read ops are not themselves individual async tasks rather they are something the real poll task monitors. But the rndv write code generates async copy tasks that do need to be tracked as their own async tasks within the already started task 😅.

Not exactly. Are you available for offline chat?

Read ops are async tasks but they are ofi async tasks and managed by libfabric. The callback for ofi async tasks are those event functions. Our MPIX Async facility manages "local" async tasks (or all async tasks that requires manual polls). In this PR, both the parent chunk-launching task and individual chunk copy tasks are managed by MPIX Async facility

src/mpid/ch4/netmod/ofi/ofi_events.h

hzhou · 2025-08-21T20:03:06Z

test:mpich/custom
netmod: ch4:ofi
env: MPIR_CVAR_CH4_OFI_EAGER_THRESHOLD=16384
✔️

src/mpid/ch4/netmod/ofi/ofi_rndv.c

raffenet

I don't think I have any other concerns as long as tests clear and the one suggested change is incorporated.

Similar structure as pipeline and rndv read. Some refactoring to avoid some code duplication.

The original gpu pipeline code can't work without application exclusively use gpu to gpu for all large messages. The new code in the RNDV paths should handle all cases. Update tests and replace the use of gpu pipeline with the new rndv pipeline algorithm. Update document doc/mpich/tuning_parameters.md regarding gpu pipeline.

The huge protocol, including the multi-nic striping option, is fully replaced by the new rndv read algorithm.

Removing leftover constants from removed features (huge messages and gpu pipeline).

This is effectively replaced with MPIR_CVAR_CH4_OFI_EAGER_THRESHOLD.

Add missing error check.

Pick between pipeline, rdma read, and rdma write based on whether sender side and recv side require pipelined packing. We are completely separating the ofi native rndv path from the mpidig path. The native path will not touch the MPIDIG_REQUEST fields, but only use the other union members. All the native rndv unions will share common fields and initialized in MPIDI_OFI_send and MPIDI_OFI_recv_rndv_event. Auto protocol will determine best rndv protocols based on whether it is beneficial to do pipeline packing. When both sides don't need pack, we will use the direct method instead. Direct methond uses am_tag_{send,recv}. Use special constants -1 for handler_id so it does not call mpidig callbacks.

Especially in the pipeline protocols, deadlocks may happen if recv data size disagree with send data size. Add a data size sync message if necessary. Update recv status count and set error if trucation happens. For direct protocol, the truncation error is caught in am_tag_recv. Just need make sure to transfer status.MPI_ERROR upon completion.

Setting MPIDI_OFI_REQUEST(*request, am_req) to NULL may overwrite the rndv fields if the MPIDI_OFI_send uses the rndv path. This is because the rndv fields are union to MPIDI_OFI_REQUEST fields. Initialize am_req right after request creation instead.

MPIX async progress is invoked without vci critical section. Make sure to enter VCI CS when we need access genq private pool, call libfabric, or free request.

Registered host memory works with RDMA and should be preferred vs. pipeline.

Explicitly include ofi_impl.h in ofi_rndv.c for noinline build option. Move MPIDI_OFI_gpu_get_{send,recv}_engine_type to ofi_impl.h. Move both CVAR commont blocks to ofi_init.c.

Pass the request object to MPIDI_NM_am_can_do_tag so the netmod can decide based on message size etc. Do the same for MPIDI_SHM_am_can_do_tag for consistency.

hzhou force-pushed the 2507_pipeline branch 5 times, most recently from b6ce6e2 to 9030579 Compare August 6, 2025 19:50

hzhou mentioned this pull request Aug 7, 2025

ch4/ofi: refactor gpu pipeline #6891

Closed

4 tasks

hzhou force-pushed the 2507_pipeline branch 4 times, most recently from 9381aff to 7a0941d Compare August 7, 2025 03:59

abrooks98 requested review from abrooks98 and zhenggb72 August 7, 2025 14:52

hzhou force-pushed the 2507_pipeline branch 14 times, most recently from a0bf9d0 to cc7e39c Compare August 11, 2025 14:39

hzhou marked this pull request as ready for review August 11, 2025 14:40

hzhou mentioned this pull request Aug 11, 2025

[Aurora] GPU pipelining failure/performance degradation #7464

Open

hzhou added 3 commits August 20, 2025 11:39

ch4/ofi: avoid posting large buffer in recv

70e6d15

Simply posting a large buffer in e.g. fi_trecv may take a big latency, undermining the benefit of doing pipelining. Since we won't directly send a message larger than MPIDI_OFI_EAGER_THRESH, let's limit the recv buffer size to this limit.

ch4/ofi: add rndv pipeline protocol

7bdf40b

Add a new pipeline send/recv to the native rndv path.

hzhou force-pushed the 2507_pipeline branch from e44a8c1 to 80a9c60 Compare August 20, 2025 16:39

raffenet reviewed Aug 21, 2025

View reviewed changes

src/mpid/ch4/netmod/ofi/ofi_events.h Show resolved Hide resolved

hzhou force-pushed the 2507_pipeline branch from b78ca83 to f94fa77 Compare August 21, 2025 20:01

raffenet reviewed Aug 21, 2025

View reviewed changes

src/mpid/ch4/netmod/ofi/ofi_rndv.c Outdated Show resolved Hide resolved

raffenet approved these changes Aug 21, 2025

View reviewed changes

hzhou added 6 commits August 21, 2025 16:51

ch4/ofi: add rndv write protocol

228e722

Similar structure as pipeline and rndv read. Some refactoring to avoid some code duplication.

ch4/ofi: remove the huge protocol

a80ced9

The huge protocol, including the multi-nic striping option, is fully replaced by the new rndv read algorithm.

ch4/ofi: removing leftover constants

b5fec23

Removing leftover constants from removed features (huge messages and gpu pipeline).

ch4/ofi: remove MPIR_CVAR_CH4_OFI_EAGER_MAX_MSG_SIZE

6acf08b

This is effectively replaced with MPIR_CVAR_CH4_OFI_EAGER_THRESHOLD.

ch4/ofi: fix warnings in MPIDI_NM_progress

d649898

Add missing error check.

hzhou force-pushed the 2507_pipeline branch from f94fa77 to 2999223 Compare August 21, 2025 21:51

hzhou added 8 commits August 21, 2025 16:52

ch4/ofi: fix thread critical sections in rndv algorithms

f634d90

MPIX async progress is invoked without vci critical section. Make sure to enter VCI CS when we need access genq private pool, call libfabric, or free request.

test: add tests to cover ofi rndv protocols

e1a40a8

ch4/ofi: fix MPIDI_OFI_rndv_need_pack for reg_host

0335526

Registered host memory works with RDMA and should be preferred vs. pipeline.

ch4/ofi: fix noinline build

f4b1e3e

Explicitly include ofi_impl.h in ofi_rndv.c for noinline build option. Move MPIDI_OFI_gpu_get_{send,recv}_engine_type to ofi_impl.h. Move both CVAR commont blocks to ofi_init.c.

ch4: pass rreq to MPIDI_NM_am_can_do_tag

5e8794d

Pass the request object to MPIDI_NM_am_can_do_tag so the netmod can decide based on message size etc. Do the same for MPIDI_SHM_am_can_do_tag for consistency.

hzhou force-pushed the 2507_pipeline branch from 2999223 to 5e8794d Compare August 21, 2025 21:52

hzhou merged commit 059ea00 into pmodels:main Aug 21, 2025
4 checks passed

hzhou deleted the 2507_pipeline branch August 21, 2025 21:59

ch4/ofi: add new pipeline implementation #7529

ch4/ofi: add new pipeline implementation #7529

Uh oh!

Conversation

hzhou commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Description

Reason

Design

TODO

Author Checklist

Uh oh!

hzhou commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hzhou commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hzhou commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hzhou commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

raffenet Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

hzhou Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

raffenet Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

hzhou Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

raffenet Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

hzhou Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hzhou Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hzhou commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

raffenet left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hzhou commented Aug 4, 2025 •

edited

Loading

hzhou commented Aug 6, 2025 •

edited

Loading

hzhou commented Aug 7, 2025 •

edited

Loading

hzhou commented Aug 20, 2025 •

edited

Loading

hzhou commented Aug 21, 2025 •

edited

Loading

hzhou Aug 21, 2025 •

edited

Loading

hzhou Aug 21, 2025 •

edited

Loading

hzhou commented Aug 21, 2025 •

edited

Loading