Reproducer for slow performance with pinned memory #519

gppezzi · 2026-02-11T15:57:35Z

Reproducer for slow performance with pinned memory (Known issue on Alps - VCUE-1292)
https://github.com/eth-cscs/alps-gh200-reproducers
Share the command line used to run the test

$ export UENV=prgenv-gnu/25.11:v1
$ reframe -C config/cscs.py -c checks -n MPIIntranodePinned  -r

msimberg

Looks good to me, but I'm a bit biased. Importantly, thank you @gppezzi for adding this!

How does this work in terms of the reference? The performance currently does not match the reference, so will the test fail? Or pass with xfail? I don't know enough about reframe to judge how to best handle something like this.

Copilot

Pull request overview

Adds a new ReFrame check plus a small CUDA+MPI microbenchmark meant to reproduce slow intranode performance when using pinned host memory on Alps (VCUE-1292).

Changes:

Introduces a CUDA+MPI reproducer (intranode_pinned_host_comm.cpp) that times MPI Send/Recv using selectable memory types (host/pinned_host/device).
Adds a minimal CMake build for the reproducer.
Adds a ReFrame regression test (MPIIntranodePinned) to build and run the reproducer and extract a timing metric.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 10 comments.

File	Description
`checks/prgenv/cuda/src/intranode_pinned_host_comm.cpp`	New MPI/CUDA reproducer program that allocates selected memory type and measures intranode Send/Recv timings.
`checks/prgenv/cuda/src/CMakeLists.txt`	New CMake build definition for the reproducer executable and required MPI/CUDA runtime links.
`checks/prgenv/cuda/cuda_mpi_intranode_pinned.py`	New ReFrame test that builds/runs the reproducer and reports a performance value.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

checks/prgenv/cuda/src/intranode_pinned_host_comm.cpp

checks/prgenv/cuda/src/CMakeLists.txt

checks/prgenv/cuda/cuda_mpi_intranode_pinned.py

checks/prgenv/cuda/src/intranode_pinned_host_comm.cpp

checks/prgenv/cuda/cuda_mpi_intranode_pinned.py

checks/prgenv/cuda/src/intranode_pinned_host_comm.cpp

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

checks/prgenv/cuda/cuda_mpi_intranode_pinned.py

jgphpc · 2026-02-11T20:28:56Z

reference

srun -t1 --cpu-bind=sockets ./intranode_pinned_host_comm 2 5 ... + prgenv-gnu/25.11:v1 gives me:

on daint:

-N1 -n2 + host 134217728 = 0: [1:4] time:  0.00298300
-N2 -n2 + host 134217728 = 0: [1:4] time: 0.00592534 

-N1 -n2 + pinned_host 134217728 = 0: [1:4] time: 0.0157693
-N2 -n2 + pinned_host 134217728 = 0: [1:4] time: 0.00559603

but on starlex:

-N1 -n2 + host 134217728 = 0: [1:4] time: 0.0112828
-N2 -n2 + host 134217728 = 0: [1:4] time: 0.0109703

-N1 -n2 + pinned_host 134217728 = 0: [1:4] time: 0.0113354
-N2 -n2 + pinned_host 134217728 = 0: [1:4] time: 0.00553771

so not sure what the reference should be.

jgphpc · 2026-02-11T20:32:31Z

cscs-ci run alps-daint-uenv;MY_UENV=prgenv-gnu/25.11:v1

checks/prgenv/cuda/cuda_mpi_intranode_pinned.py

gppezzi · 2026-02-12T12:53:37Z

Looks good to me, but I'm a bit biased. Importantly, thank you @gppezzi for adding this!

How does this work in terms of the reference? The performance currently does not match the reference, so will the test fail? Or pass with xfail? I don't know enough about reframe to judge how to best handle something like this.

@jpcoles suggested that we keep this as a failure, to remind us to ping the vendor about the issue.

If we close the case, then we can disable the test...

msimberg · 2026-02-12T13:21:18Z

Looks good to me, but I'm a bit biased. Importantly, thank you @gppezzi for adding this!
How does this work in terms of the reference? The performance currently does not match the reference, so will the test fail? Or pass with xfail? I don't know enough about reframe to judge how to best handle something like this.

@jpcoles suggested that we keep this as a failure, to remind us to ping the vendor about the issue.

If we close the case, then we can disable the test...

Is this a case where xfail would be useful? There's absolutely nothing we can do currently to fix this test, so having it report failures on reframe pipelines all the time is just noise IMO. We ideally want failures to report something that we actually need to act on. If we mark it xfail it's allowed to fail without creating noise, but if the issue is fixed the test should become xpass and cause a pipeline to fail, so we know when the behaviour has changed. If/when that happens, we can then require that the test passes to make sure it doesn't regress. Does that sound reasonable? I'm not 100% sure this is how xfail works in reframe, but it's how I expect it to work at least.

gppezzi added 4 commits February 11, 2026 15:56

draft version intra-node pinned memory test

d88a352

original reproducer

ad01d90

.

982c663

bug fixes, update perf value

b1a839d

gppezzi self-assigned this Feb 11, 2026

gppezzi added 2 commits February 11, 2026 17:03

remove bloat

376f6da

add eol

a71ed00

gppezzi marked this pull request as ready for review February 11, 2026 16:09

gppezzi requested a review from jgphpc February 11, 2026 16:09

msimberg reviewed Feb 11, 2026

View reviewed changes

jgphpc requested a review from Copilot February 11, 2026 16:13

Copilot started reviewing on behalf of jgphpc February 11, 2026 16:13 View session

jgphpc changed the title ~~Reproducer for slow performance with pinned memory (Known issue on Alps)~~ Reproducer for slow performance with pinned memory Feb 11, 2026

Copilot AI reviewed Feb 11, 2026

View reviewed changes

style

abd47e0

jgphpc requested a review from Copilot February 11, 2026 20:14

Copilot started reviewing on behalf of jgphpc February 11, 2026 20:15 View session

Copilot AI reviewed Feb 11, 2026

View reviewed changes

checks/prgenv/cuda/cuda_mpi_intranode_pinned.py Outdated Show resolved Hide resolved

checks/prgenv/cuda/cuda_mpi_intranode_pinned.py Show resolved Hide resolved

1

f0eb622

regex

b077657

msimberg reviewed Feb 12, 2026

View reviewed changes

checks/prgenv/cuda/cuda_mpi_intranode_pinned.py Outdated Show resolved Hide resolved

add prgenv feature

06ac615

Reproducer for slow performance with pinned memory #519

Are you sure you want to change the base?

Reproducer for slow performance with pinned memory #519

Conversation

gppezzi commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

msimberg left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

jgphpc commented Feb 11, 2026

Uh oh!

jgphpc commented Feb 11, 2026

Uh oh!

Uh oh!

gppezzi commented Feb 12, 2026

Uh oh!

msimberg commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gppezzi commented Feb 11, 2026 •

edited

Loading