Fix `cuda::memcpy async` edge cases and add more tests by bernhardmgruber · Pull Request #6608 · NVIDIA/cccl

bernhardmgruber · 2025-11-12T22:04:58Z

Broken example in [BUG] cuda::memcpy_async hangs in some examples #6601 does not hang anymore
Codegen for the example in cuda::memcpy_async with cuda::barrier implementation is inefficient on sm90+ #5995 is still optimal, we just have more code now for computing the thread rank of the CG group
Codegen of the above using a custom 1D thread block and is_thread_block_group_v optimal

I pulled the core fix out into #6710, so it can ship on time for 3.2.

Fixes: #6601

copy-pr-bot · 2025-11-12T22:05:03Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

bernhardmgruber · 2025-11-12T22:05:17Z

/ok to test cca4271

bernhardmgruber · 2025-11-12T22:07:52Z

libcudacxx/include/cuda/__memcpy_async/cp_async_bulk_shared_global.h

+  const unsigned int tid             = threadIdx.z * blockDim.y * blockDim.x + threadIdx.y * blockDim.x + threadIdx.x;
  const unsigned int warp_id         = tid / 32;
  const unsigned int uniform_warp_id = __shfl_sync(0xFFFFFFFF, warp_id, 0); // broadcast from lane 0
  return uniform_warp_id == 0 && ::cuda::ptx::elect_sync(0xFFFFFFFF); // elect a leader thread among warp 0


The old logic is wrong for any _Group that is not a full thread block.

bernhardmgruber · 2025-11-12T22:08:38Z

libcudacxx/include/cuda/__memcpy_async/cp_async_bulk_shared_global.h

+[[nodiscard]] _CCCL_DEVICE _CCCL_FORCEINLINE bool
+__elect_from_group(const cooperative_groups::thread_block& __g) noexcept
 {
-  // cooperative groups maps a multidimensional thread id into the thread rank the same way as warps do
-  const unsigned int tid             = __g.thread_rank();
+  // Cannot call __g.thread_rank(), because we only forward declared the thread_block type
+  // cooperative groups (and we here) maps a multidimensional thread id into the thread rank the same way as warps do
+  const unsigned int tid             = threadIdx.z * blockDim.y * blockDim.x + threadIdx.y * blockDim.x + threadIdx.x;


@pciolkosz if we had a cooperative_groups::thread_block<1> or some other way to detect that the block is 1D, we could save a lot of special register reads here!

Alternatively, we could just add a cuda::thread_block_group<1> which would fulfill the Group concept and give us efficient codegen here. @miscco and @pciolkosz what do you think?

bernhardmgruber · 2025-11-13T09:39:48Z

/ok to test ce7f528

libcudacxx/include/cuda/__memcpy_async/cp_async_bulk_shared_global.h

libcudacxx/test/libcudacxx/cuda/memcpy_async/group_memcpy_async.h

libcudacxx/include/cuda/__memcpy_async/cp_async_bulk_shared_global.h

bernhardmgruber · 2025-11-13T11:09:15Z

libcudacxx/test/libcudacxx/cuda/memcpy_async/group_memcpy_async.h

+  // use 2 groups of 4 threads to copy 8 items each, but spread them 16 bytes
+  auto tiled_groups = cg::tiled_partition<4>(cg::this_thread_block());
+  if (threadIdx.x < 8)
+  {
+    static_assert(thread_block_size >= 8);
+    printf("%u copying 8 items at meta group rank %u\n", threadIdx.x, tiled_groups.meta_group_rank());
+    cuda::memcpy_async(
+      tiled_groups,
+      &dest->data[tiled_groups.meta_group_rank() * 16],
+      &source->data[tiled_groups.meta_group_rank() * 16],
+      sizeof(T) * 8,
+      *bar);


Remark: the possibility of this is incredibly clever and unholy at the same time.

libcudacxx/test/libcudacxx/cuda/memcpy_async/group_memcpy_async_16b.pass.cpp

docs/libcudacxx/extended_api/asynchronous_operations/memcpy_async.rst

libcudacxx/include/cuda/__memcpy_async/cp_async_bulk_shared_global.h

libcudacxx/test/libcudacxx/cuda/memcpy_async/group_memcpy_async.h

docs/libcudacxx/extended_api/asynchronous_operations/memcpy_async.rst

fbusato · 2025-11-14T23:45:07Z

docs/libcudacxx/extended_api/asynchronous_operations/memcpy_async.rst

+
+Additionally:
+
   - If *Shape* is :ref:`cuda::aligned_size_t <libcudacxx-extended-api-memory-aligned-size>`, ``source``


question. Are these constraints evaluated in assertions?

We already assert that pointers are aligned. I added now that the pipeline is not quit.

I cannot easily check whether the parameters are the same across all threads of a group and whether all threads of that group also called the API. It may be possible with some block-wide operations, but seems a bit much for an assertion.

libcudacxx/include/cuda/__memcpy_async/elect_one.h

libcudacxx/include/cuda/__memcpy_async/group_traits.h

error: A __device__ variable template cannot have a const qualified type on Windows

bernhardmgruber · 2025-12-19T08:58:50Z

libcudacxx/test/libcudacxx/cuda/memcpy_async/group_memcpy_async_16b.pass.cpp

 int main(int argc, char** argv)
 {
-  NV_IF_TARGET(NV_IS_HOST, cuda_thread_count = 4;)
+  NV_IF_TARGET(NV_IS_HOST, cuda_thread_count = thread_block_size;)


I could finally reproduce and hunt down this bug, and the problematic line is here. nvrtcc (a driver executable for nvrtc) searches the input source for a line like cuda_thread_count = ... where ... is supposed to be an interger literal. Because I put a named constant here, nvrtc ran the tests with a block size of 1, which lead to the hang in the kernel.

Here is a PR to save us such a long hunt next time: #7035

This reverts commit 4fc9b4e.

github-actions · 2025-12-19T15:11:06Z

🥳 CI Workflow Results

🟩 Finished in 1h 03m: Pass: 100%/91 | Total: 16h 09m | Max: 37m 09s | Hits: 99%/211081

See results here.

(cherry picked from commit 7d389d4)

github-actions · 2025-12-19T15:11:35Z

Successfully created backport PR for branch/3.2.x:

[Backport branch/3.2.x] Fix cuda::memcpy async edge cases and add more tests #7036

(cherry picked from commit 7d389d4) Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>

github-project-automation bot added this to CCCL Nov 12, 2025

github-project-automation bot moved this to Todo in CCCL Nov 12, 2025

bernhardmgruber changed the title ~~Fix cuda::memcpy async edge cases~~ Fix cuda::memcpy async edge cases and add more tests Nov 12, 2025

cccl-authenticator-app bot moved this from Todo to In Progress in CCCL Nov 12, 2025

bernhardmgruber commented Nov 12, 2025

View reviewed changes

This comment has been minimized.

Sign in to view

bernhardmgruber force-pushed the fix_memcpy_async branch 2 times, most recently from 9ee0408 to ce7f528 Compare November 13, 2025 09:38

miscco reviewed Nov 13, 2025

View reviewed changes

bernhardmgruber commented Nov 13, 2025

View reviewed changes

libcudacxx/include/cuda/__memcpy_async/cp_async_bulk_shared_global.h Outdated Show resolved Hide resolved

bernhardmgruber commented Nov 13, 2025

View reviewed changes

libcudacxx/test/libcudacxx/cuda/memcpy_async/group_memcpy_async_16b.pass.cpp Outdated Show resolved Hide resolved

miscco reviewed Nov 13, 2025

View reviewed changes

bernhardmgruber force-pushed the fix_memcpy_async branch from c4a1509 to c23d96d Compare November 13, 2025 12:17

bernhardmgruber marked this pull request as ready for review November 13, 2025 12:17

bernhardmgruber requested review from a team as code owners November 13, 2025 12:17

bernhardmgruber requested review from alliepiper and griwes November 13, 2025 12:17

cccl-authenticator-app bot moved this from In Progress to In Review in CCCL Nov 13, 2025

miscco approved these changes Nov 13, 2025

View reviewed changes

docs/libcudacxx/extended_api/asynchronous_operations/memcpy_async.rst Show resolved Hide resolved

bernhardmgruber force-pushed the fix_memcpy_async branch from 97cddd0 to 3099002 Compare November 13, 2025 15:45

This comment has been minimized.

Sign in to view

fbusato reviewed Nov 14, 2025

View reviewed changes

bernhardmgruber added 12 commits December 19, 2025 03:54

asdfklnasgn

a8d3d36

Augment this_thread_block_1D with is_thread_block_group_v

a19c87c

Naming and uglyness

15c31b0

Missing include

fe48e6e

Move traits to new header, move elect to other header

e7e6cff

Fix

0b13b98

Docs fix

c3cf3f4

Update wording

72861a3

Try to fix MSVC error

d2932c9

error: A __device__ variable template cannot have a const qualified type on Windows

Reviewer feedback: Full qualification

7f315c0

Assert pipeline is active

331c0c9

increase libcu++ CI timeout

4fc9b4e

bernhardmgruber force-pushed the fix_memcpy_async branch 2 times, most recently from 3115170 to 4fc9b4e Compare December 19, 2025 02:55

This comment has been minimized.

Sign in to view

bernhardmgruber commented Dec 19, 2025

View reviewed changes

Fix

7f0682d

bernhardmgruber mentioned this pull request Dec 19, 2025

Error out when nvrtcc cannot parse cuda_thread_count #7035

Merged

bernhardmgruber enabled auto-merge (squash) December 19, 2025 10:49

This comment has been minimized.

Sign in to view

bernhardmgruber added 2 commits December 19, 2025 15:05

another fix

98f633e

Revert "increase libcu++ CI timeout"

b33bd9f

This reverts commit 4fc9b4e.

bernhardmgruber merged commit 7d389d4 into NVIDIA:main Dec 19, 2025
102 of 103 checks passed

github-project-automation bot moved this from In Review to Done in CCCL Dec 19, 2025

github-actions bot pushed a commit that referenced this pull request Dec 19, 2025

Fix cuda::memcpy async edge cases and add more tests (#6608)

fdd4f08

(cherry picked from commit 7d389d4)

github-actions bot mentioned this pull request Dec 19, 2025

[Backport branch/3.2.x] Fix cuda::memcpy async edge cases and add more tests #7036

Merged

bernhardmgruber added a commit that referenced this pull request Dec 19, 2025

Fix cuda::memcpy async edge cases and add more tests (#6608) (#7036)

6d35189

(cherry picked from commit 7d389d4) Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>

fbusato pushed a commit to fbusato/cccl that referenced this pull request Dec 23, 2025

Fix cuda::memcpy async edge cases and add more tests (NVIDIA#6608)

c40c68d


		Additionally:

		- If Shape is :ref:`cuda::aligned_size_t <libcudacxx-extended-api-memory-aligned-size>`, ``source``

Conversation

bernhardmgruber commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Nov 12, 2025

Uh oh!

bernhardmgruber commented Nov 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

bernhardmgruber commented Nov 13, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

This comment has been minimized.

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

github-actions bot commented Dec 19, 2025

🥳 CI Workflow Results

🟩 Finished in 1h 03m: Pass: 100%/91 | Total: 16h 09m | Max: 37m 09s | Hits: 99%/211081

Uh oh!

Uh oh!

github-actions bot commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bernhardmgruber commented Nov 12, 2025 •

edited

Loading