Skip to content

Conversation

@gppezzi
Copy link
Collaborator

@gppezzi gppezzi commented Feb 11, 2026

$ export UENV=prgenv-gnu/25.11:v1
$ reframe -C config/cscs.py -c checks -n MPIIntranodePinned  -r

@gppezzi gppezzi self-assigned this Feb 11, 2026
@gppezzi gppezzi marked this pull request as ready for review February 11, 2026 16:09
@gppezzi gppezzi requested a review from jgphpc February 11, 2026 16:09
Copy link
Contributor

@msimberg msimberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but I'm a bit biased. Importantly, thank you @gppezzi for adding this!

How does this work in terms of the reference? The performance currently does not match the reference, so will the test fail? Or pass with xfail? I don't know enough about reframe to judge how to best handle something like this.

@jgphpc jgphpc requested a review from Copilot February 11, 2026 16:13
@jgphpc jgphpc changed the title Reproducer for slow performance with pinned memory (Known issue on Alps) Reproducer for slow performance with pinned memory Feb 11, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new ReFrame check plus a small CUDA+MPI microbenchmark meant to reproduce slow intranode performance when using pinned host memory on Alps (VCUE-1292).

Changes:

  • Introduces a CUDA+MPI reproducer (intranode_pinned_host_comm.cpp) that times MPI Send/Recv using selectable memory types (host/pinned_host/device).
  • Adds a minimal CMake build for the reproducer.
  • Adds a ReFrame regression test (MPIIntranodePinned) to build and run the reproducer and extract a timing metric.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 10 comments.

File Description
checks/prgenv/cuda/src/intranode_pinned_host_comm.cpp New MPI/CUDA reproducer program that allocates selected memory type and measures intranode Send/Recv timings.
checks/prgenv/cuda/src/CMakeLists.txt New CMake build definition for the reproducer executable and required MPI/CUDA runtime links.
checks/prgenv/cuda/cuda_mpi_intranode_pinned.py New ReFrame test that builds/runs the reproducer and reports a performance value.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jgphpc
Copy link
Collaborator

jgphpc commented Feb 11, 2026

reference

srun -t1 --cpu-bind=sockets ./intranode_pinned_host_comm 2 5 ... + prgenv-gnu/25.11:v1 gives me:

  • on daint:
-N1 -n2 + host 134217728 = 0: [1:4] time:  0.00298300
-N2 -n2 + host 134217728 = 0: [1:4] time: 0.00592534 

-N1 -n2 + pinned_host 134217728 = 0: [1:4] time: 0.0157693
-N2 -n2 + pinned_host 134217728 = 0: [1:4] time: 0.00559603
  • but on starlex:
-N1 -n2 + host 134217728 = 0: [1:4] time: 0.0112828
-N2 -n2 + host 134217728 = 0: [1:4] time: 0.0109703

-N1 -n2 + pinned_host 134217728 = 0: [1:4] time: 0.0113354
-N2 -n2 + pinned_host 134217728 = 0: [1:4] time: 0.00553771

so not sure what the reference should be.

@jgphpc
Copy link
Collaborator

jgphpc commented Feb 11, 2026

cscs-ci run alps-daint-uenv;MY_UENV=prgenv-gnu/25.11:v1

@gppezzi
Copy link
Collaborator Author

gppezzi commented Feb 12, 2026

Looks good to me, but I'm a bit biased. Importantly, thank you @gppezzi for adding this!

How does this work in terms of the reference? The performance currently does not match the reference, so will the test fail? Or pass with xfail? I don't know enough about reframe to judge how to best handle something like this.

@jpcoles suggested that we keep this as a failure, to remind us to ping the vendor about the issue.

If we close the case, then we can disable the test...

@msimberg
Copy link
Contributor

Looks good to me, but I'm a bit biased. Importantly, thank you @gppezzi for adding this!
How does this work in terms of the reference? The performance currently does not match the reference, so will the test fail? Or pass with xfail? I don't know enough about reframe to judge how to best handle something like this.

@jpcoles suggested that we keep this as a failure, to remind us to ping the vendor about the issue.

If we close the case, then we can disable the test...

Is this a case where xfail would be useful? There's absolutely nothing we can do currently to fix this test, so having it report failures on reframe pipelines all the time is just noise IMO. We ideally want failures to report something that we actually need to act on. If we mark it xfail it's allowed to fail without creating noise, but if the issue is fixed the test should become xpass and cause a pipeline to fail, so we know when the behaviour has changed. If/when that happens, we can then require that the test passes to make sure it doesn't regress. Does that sound reasonable? I'm not 100% sure this is how xfail works in reframe, but it's how I expect it to work at least.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants