Skip to content

Conversation

@nmorey
Copy link
Contributor

@nmorey nmorey commented Nov 8, 2025

The number of received bytes in release_gather_release is badly cast between int and MPI_Aint. On most arch this is not an issue, but for Big-Endian 64b arch (s390x) it ends up losing the actual value.
Fix the issue but writing the whole MPI_AInt in the shm_buf instead of just an int.

This bug was found on 4.3.2 while debugging on s390x with ch4:ofi:

> mpiexec -np 4     ./file_info -fname test
Abort(476133135) on node 1 (rank 1 in comm 0): Fatal error in internal_Bcast: Other MPI error, error stack:
internal_Bcast(116)........................: MPI_Bcast(buffer=0x1004174, count=1, MPI_INT, 0, MPI_COMM_WORLD) failed
MPID_Bcast(295)............................: 
MPIDI_Bcast_allcomm_composition_json(239)..: 
MPIDI_Bcast_intra_composition_alpha(292)...: 
MPIDI_POSIX_mpi_bcast(278).................: 
MPIDI_POSIX_mpi_bcast_release_gather(127)..: 
MPIDI_POSIX_mpi_release_gather_release(225): message sizes do not match across processes in the collective routine: Received 0 but expected 4

@nmorey nmorey force-pushed the dev/master/aint branch 2 times, most recently from 5e84157 to e793354 Compare November 18, 2025 18:34
…her_release

The number of received bytes in release_gather_release is badly cast between
int and MPI_Aint. On most arch this is not an issue, but for Big-Endian 64b arch (s390x)
it ends up losing the actual value as we only copy the first 4 MSB.
Fix the issue by writing the whole MPI_AInt in the shm_buf instead of just an int.

Signed-off-by: Nicolas Morey <[email protected]>
Copy link
Contributor

@hzhou hzhou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG. Thanks for the fix!

@hzhou
Copy link
Contributor

hzhou commented Nov 19, 2025

test:mpich/ch3/most
test:mpich/ch4/most

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants