-
Notifications
You must be signed in to change notification settings - Fork 311
ch4/ofi: add new pipeline implementation #7529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b6ce6e2 to
9030579
Compare
|
Currently on Aurora, without GPU Direct RDMA:
|
|
9381aff to
7a0941d
Compare
a0bf9d0 to
cc7e39c
Compare
Simply posting a large buffer in e.g. fi_trecv may take a big latency, undermining the benefit of doing pipelining. Since we won't directly send a message larger than MPIDI_OFI_EAGER_THRESH, let's limit the recv buffer size to this limit.
Add a new pipeline send/recv to the native rndv path.
A reimplementation of the huge protocol. It supports the multinic striping feature.. To support rdma read into non-contig buffer or when directly read into gpu buffer is not supported, we should allocate pipeline chunk buffer and perform Ilocalcopy_gpu after the read.
|
test:mpich/ch4/ofi |
|
test:mpich/ch4/gpu/ofi ✔️ |
| p->u.recv.num_infly = 0; | ||
|
|
||
| /* issue fi_read */ | ||
| mpi_errno = MPIR_Async_things_add(rndvread_read_poll, rreq, NULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is just my ignorance of the async things interface, but why do we only "add" items in the rndv read protocol, but we "spawn" them in the pipelined send protocol?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is creating new async task within an async callback, we need use "spawn". This is to prevent corrupting the task list and avoiding recursive lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it's the case that all the tasks for rndv read and write protocols are created ahead of time? But with the in flight limits of pipeline, they are spawned as other chunks complete?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really. In rndvread, first we create a parent task rndvread_read_poll. The parent task watches "infly" chunks, allocates chunk buffers and issues chunk read request until all chunks are dispatched. Similarly rndvwrite. ... Wait, in rndvwrite, the async copy in send_copy_poll is issued from rndvwrite_write_poll, thus it should use _spawn. Good catch! Let me add a fixup patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so the read ops are not themselves individual async tasks rather they are something the real poll task monitors. But the rndv write code generates async copy tasks that do need to be tracked as their own async tasks within the already started task 😅.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not exactly. Are you available for offline chat?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Read ops are async tasks but they are ofi async tasks and managed by libfabric. The callback for ofi async tasks are those event functions. Our MPIX Async facility manages "local" async tasks (or all async tasks that requires manual polls). In this PR, both the parent chunk-launching task and individual chunk copy tasks are managed by MPIX Async facility
|
test:mpich/custom |
raffenet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I have any other concerns as long as tests clear and the one suggested change is incorporated.
Similar structure as pipeline and rndv read. Some refactoring to avoid some code duplication.
The original gpu pipeline code can't work without application exclusively use gpu to gpu for all large messages. The new code in the RNDV paths should handle all cases. Update tests and replace the use of gpu pipeline with the new rndv pipeline algorithm. Update document doc/mpich/tuning_parameters.md regarding gpu pipeline.
The huge protocol, including the multi-nic striping option, is fully replaced by the new rndv read algorithm.
Removing leftover constants from removed features (huge messages and gpu pipeline).
This is effectively replaced with MPIR_CVAR_CH4_OFI_EAGER_THRESHOLD.
Add missing error check.
Pick between pipeline, rdma read, and rdma write based on whether sender
side and recv side require pipelined packing.
We are completely separating the ofi native rndv path from the mpidig path.
The native path will not touch the MPIDIG_REQUEST fields, but only use
the other union members.
All the native rndv unions will share common fields and initialized in
MPIDI_OFI_send and MPIDI_OFI_recv_rndv_event.
Auto protocol will determine best rndv protocols based on whether it is
beneficial to do pipeline packing. When both sides don't need pack, we
will use the direct method instead.
Direct methond uses am_tag_{send,recv}. Use special constants -1 for
handler_id so it does not call mpidig callbacks.
Especially in the pipeline protocols, deadlocks may happen if recv data size disagree with send data size. Add a data size sync message if necessary. Update recv status count and set error if trucation happens. For direct protocol, the truncation error is caught in am_tag_recv. Just need make sure to transfer status.MPI_ERROR upon completion.
Setting MPIDI_OFI_REQUEST(*request, am_req) to NULL may overwrite the rndv fields if the MPIDI_OFI_send uses the rndv path. This is because the rndv fields are union to MPIDI_OFI_REQUEST fields. Initialize am_req right after request creation instead.
MPIX async progress is invoked without vci critical section. Make sure to enter VCI CS when we need access genq private pool, call libfabric, or free request.
Registered host memory works with RDMA and should be preferred vs. pipeline.
Explicitly include ofi_impl.h in ofi_rndv.c for noinline build option.
Move MPIDI_OFI_gpu_get_{send,recv}_engine_type to ofi_impl.h. Move both
CVAR commont blocks to ofi_init.c.
Pass the request object to MPIDI_NM_am_can_do_tag so the netmod can decide based on message size etc. Do the same for MPIDI_SHM_am_can_do_tag for consistency.
Pull Request Description
A new pipeline implementation using the ofi native RNDV path.
[skip warnings]
Reason
They promised to handle everything in the best ways, but when they fall short ...
Design
data_sz<MPIR_CVAR_CH4_OFI_EAGER_THRESHOLDMPIDI_OFI_global.max_msg_sizeTODO
Author Checklist
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short descriptionCommit message explains what's in the commit.
Whitespace checker. Warnings test. Additional tests via comments.
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.