Skip to content

Fix cuda::memcpy async edge cases and add more tests#6608

Merged
bernhardmgruber merged 23 commits intoNVIDIA:mainfrom
bernhardmgruber:fix_memcpy_async
Dec 19, 2025
Merged

Fix cuda::memcpy async edge cases and add more tests#6608
bernhardmgruber merged 23 commits intoNVIDIA:mainfrom
bernhardmgruber:fix_memcpy_async

Conversation

@bernhardmgruber
Copy link
Contributor

@bernhardmgruber bernhardmgruber commented Nov 12, 2025

I pulled the core fix out into #6710, so it can ship on time for 3.2.

Fixes: #6601

@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Nov 12, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@bernhardmgruber bernhardmgruber changed the title Fix cuda::memcpy async edge cases Fix cuda::memcpy async edge cases and add more tests Nov 12, 2025
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Progress in CCCL Nov 12, 2025
@bernhardmgruber
Copy link
Contributor Author

/ok to test cca4271

const unsigned int tid = threadIdx.z * blockDim.y * blockDim.x + threadIdx.y * blockDim.x + threadIdx.x;
const unsigned int warp_id = tid / 32;
const unsigned int uniform_warp_id = __shfl_sync(0xFFFFFFFF, warp_id, 0); // broadcast from lane 0
return uniform_warp_id == 0 && ::cuda::ptx::elect_sync(0xFFFFFFFF); // elect a leader thread among warp 0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old logic is wrong for any _Group that is not a full thread block.

Comment on lines +61 to +66
[[nodiscard]] _CCCL_DEVICE _CCCL_FORCEINLINE bool
__elect_from_group(const cooperative_groups::thread_block& __g) noexcept
{
// cooperative groups maps a multidimensional thread id into the thread rank the same way as warps do
const unsigned int tid = __g.thread_rank();
// Cannot call __g.thread_rank(), because we only forward declared the thread_block type
// cooperative groups (and we here) maps a multidimensional thread id into the thread rank the same way as warps do
const unsigned int tid = threadIdx.z * blockDim.y * blockDim.x + threadIdx.y * blockDim.x + threadIdx.x;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pciolkosz if we had a cooperative_groups::thread_block<1> or some other way to detect that the block is 1D, we could save a lot of special register reads here!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we could just add a cuda::thread_block_group<1> which would fulfill the Group concept and give us efficient codegen here. @miscco and @pciolkosz what do you think?

@github-actions

This comment has been minimized.

@bernhardmgruber bernhardmgruber force-pushed the fix_memcpy_async branch 2 times, most recently from 9ee0408 to ce7f528 Compare November 13, 2025 09:38
@bernhardmgruber
Copy link
Contributor Author

/ok to test ce7f528

Comment on lines +132 to +143
// use 2 groups of 4 threads to copy 8 items each, but spread them 16 bytes
auto tiled_groups = cg::tiled_partition<4>(cg::this_thread_block());
if (threadIdx.x < 8)
{
static_assert(thread_block_size >= 8);
printf("%u copying 8 items at meta group rank %u\n", threadIdx.x, tiled_groups.meta_group_rank());
cuda::memcpy_async(
tiled_groups,
&dest->data[tiled_groups.meta_group_rank() * 16],
&source->data[tiled_groups.meta_group_rank() * 16],
sizeof(T) * 8,
*bar);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remark: the possibility of this is incredibly clever and unholy at the same time.

@bernhardmgruber bernhardmgruber marked this pull request as ready for review November 13, 2025 12:17
@bernhardmgruber bernhardmgruber requested review from a team as code owners November 13, 2025 12:17
@cccl-authenticator-app cccl-authenticator-app bot moved this from In Progress to In Review in CCCL Nov 13, 2025
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.


Additionally:

- If *Shape* is :ref:`cuda::aligned_size_t <libcudacxx-extended-api-memory-aligned-size>`, ``source``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question. Are these constraints evaluated in assertions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already assert that pointers are aligned. I added now that the pipeline is not quit.

I cannot easily check whether the parameters are the same across all threads of a group and whether all threads of that group also called the API. It may be possible with some block-wide operations, but seems a bit much for an assertion.

@bernhardmgruber bernhardmgruber force-pushed the fix_memcpy_async branch 2 times, most recently from 3115170 to 4fc9b4e Compare December 19, 2025 02:55
@github-actions

This comment has been minimized.

int main(int argc, char** argv)
{
NV_IF_TARGET(NV_IS_HOST, cuda_thread_count = 4;)
NV_IF_TARGET(NV_IS_HOST, cuda_thread_count = thread_block_size;)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could finally reproduce and hunt down this bug, and the problematic line is here. nvrtcc (a driver executable for nvrtc) searches the input source for a line like cuda_thread_count = ... where ... is supposed to be an interger literal. Because I put a named constant here, nvrtc ran the tests with a block size of 1, which lead to the hang in the kernel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a PR to save us such a long hunt next time: #7035

@github-actions

This comment has been minimized.

@github-actions
Copy link
Contributor

🥳 CI Workflow Results

🟩 Finished in 1h 03m: Pass: 100%/91 | Total: 16h 09m | Max: 37m 09s | Hits: 99%/211081

See results here.

@bernhardmgruber bernhardmgruber merged commit 7d389d4 into NVIDIA:main Dec 19, 2025
102 of 103 checks passed
@github-project-automation github-project-automation bot moved this from In Review to Done in CCCL Dec 19, 2025
github-actions bot pushed a commit that referenced this pull request Dec 19, 2025
@github-actions
Copy link
Contributor

Successfully created backport PR for branch/3.2.x:

bernhardmgruber added a commit that referenced this pull request Dec 19, 2025
(cherry picked from commit 7d389d4)

Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>
fbusato pushed a commit to fbusato/cccl that referenced this pull request Dec 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

[BUG] cuda::memcpy_async hangs in some examples

3 participants