-
Notifications
You must be signed in to change notification settings - Fork 310
comm: reimplement nonblocking contextid allocation using MPIX Async #7648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
hzhou
wants to merge
12
commits into
pmodels:main
Choose a base branch
from
hzhou:2510_idup_nb
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
a73f4ba to
8e59515
Compare
Contributor
Author
|
test:mpich/ch3/most ✔️ |
Since we need enter thread CS before calling async poll functions, and we may have a recursive situation and when the callback make blocking MPI calls and invoke MPI progress within the poll function. To allow that, we need skip async progress when we re-entering the progress.
Re-implement the nonblocking contextid allocation algorithm using MPIX Async.
Add internal nonblocking collective interfaces that accept an explicit tag. This allows asynchronous algorithms to internally call nonblocking collectives but not tied to a specific schedule framework. Specifically, it allows nonblocking algorithms using the MPIX Async interface.
Use the nonblocking collective interface with an explicit tag to in the nonblocking context_id allocation algorithm.
The basic general request relies on external progress mechanism to complete the request rather than on the extension with wait_fn. We can create generalized request using MPIX Async mechanism and MPID_Progress_wait will complete the request.
MPIR_SCHED_KIND_GENERALIZED no longer needed.
It's easier for debugging when we can track the iteration number between retries.
Refactor between the blocking and nonblocking algorithm to avoid duplications and inconsistencies. Fix the potential missed thread-safety in the nonblocking code.
Ch3 need be informed whether it can enter a blocking receive during progress or does it need continuously poll the progress.
Re-organize code for better readability. Re-do the comments to remove stale parts and reflect the current code.
The dynamic_sendrecv is used in MPI_Intercomm_create. The mismatching between threads are protected by the user provided tag, thus it is okay to yield during the blocking progress. Without the yield, MPI_Intercomm_create may block another thread's progress when the remote processes are not present (blocked by other communications). In the dynamic process accept/connect path, we force peer_comm's context id to 0. This is okay because the leader exchange is established with a specific pair of addresses and there is no other communications yet during leader_exchange.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Pull Request Description
The nonblocking contextid allocation algorithm currently is implemented using
Sched, It requires a few hacks and it is very difficult to debug. Re-implement it using MPIX Async API instead.NOTE: Hopefully, this will resolve the outstanding test xfails. Now that I understands the algorithm better, if we still encounter lock contention issue, we can try insert heavy yield when we know we are not getting the masks.
[skip warnings]
Author Checklist
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short descriptionCommit message explains what's in the commit.
Whitespace checker. Warnings test. Additional tests via comments.
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.