Skip to content

Conversation

@raffenet
Copy link
Contributor

@raffenet raffenet commented May 21, 2025

Pull Request Description

Add a test to verify the non-collective behavior of
MPI_Session_init. First, run a collective init/finalize sequence over
all launched processes. Second, try to re-initialize and finalize a new
session from just a single process and ensure it does not hang.

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

@raffenet
Copy link
Contributor Author

test:mpich/ch4/most

@raffenet raffenet force-pushed the session-non-coll branch from 8bd3d15 to 05bfcfb Compare May 21, 2025 19:43
@raffenet
Copy link
Contributor Author

test:mpich/ch4/most


MPI_Session_init(MPI_INFO_NULL, MPI_ERRORS_ABORT, &session);
MPI_Group_from_session_pset(session, "mpi://world", &group);
MPI_Group_rank(group, &rank);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also check size and the tests require size > 1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new version does this

@raffenet raffenet force-pushed the session-non-coll branch from 05bfcfb to 694d395 Compare May 22, 2025 18:34
@raffenet raffenet requested a review from hzhou May 28, 2025 15:17
@raffenet
Copy link
Contributor Author

test:mpich/ch4/most

@raffenet raffenet requested a review from hzhou June 20, 2025 19:40
@raffenet
Copy link
Contributor Author

test:mpich/ch4/most

@raffenet
Copy link
Contributor Author

test:mpich/ch4/most

@raffenet
Copy link
Contributor Author

The test fails with ch4:ofi when the PSM3 provider is selected. FI_PROVIDER=^psm3 allows it to pass.
ch4:ucx fails because of a PMI world barrier in MPIDI_UCX_mpi_finalize_hook.

Will address these issues in a follow-up patch.

@raffenet
Copy link
Contributor Author

The test fails with ch4:ofi when the PSM3 provider is selected. FI_PROVIDER=^psm3 allows it to pass.

This is actually the same issue. When psm3 is in use, the ch4:ofi netmod will do a PMI barrier in finalize.

@raffenet
Copy link
Contributor Author

test:mpich/ch4/most
test:mpich/ch3/most

@raffenet
Copy link
Contributor Author

test:mpich/ch4/most
test:mpich/ch3/most

@raffenet
Copy link
Contributor Author

test:mpich/ch4/most
test:mpich/ch3/most

@raffenet
Copy link
Contributor Author

test:mpich/ch4/most
test:mpich/ch3/most

Add a test to verify the local only behavior of MPI_Session_init. To do
so, we launch an extra process via mpiexec that will never call
MPI_Session_init.
If there is no MPI_COMM_WORLD, we should skip calling a global PMI
barrier during finalization. It is not guaranteed that world processes
have initialized MPI (thus PMI), so a barrier could hang.
@raffenet
Copy link
Contributor Author

test:mpich/ch4/most
test:mpich/ch3/most

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants