Fix cuda::memcpy async edge cases and add more tests#6608
Fix cuda::memcpy async edge cases and add more tests#6608bernhardmgruber merged 23 commits intoNVIDIA:mainfrom
cuda::memcpy async edge cases and add more tests#6608Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
cuda::memcpy async edge casescuda::memcpy async edge cases and add more tests
|
/ok to test cca4271 |
| const unsigned int tid = threadIdx.z * blockDim.y * blockDim.x + threadIdx.y * blockDim.x + threadIdx.x; | ||
| const unsigned int warp_id = tid / 32; | ||
| const unsigned int uniform_warp_id = __shfl_sync(0xFFFFFFFF, warp_id, 0); // broadcast from lane 0 | ||
| return uniform_warp_id == 0 && ::cuda::ptx::elect_sync(0xFFFFFFFF); // elect a leader thread among warp 0 |
There was a problem hiding this comment.
The old logic is wrong for any _Group that is not a full thread block.
| [[nodiscard]] _CCCL_DEVICE _CCCL_FORCEINLINE bool | ||
| __elect_from_group(const cooperative_groups::thread_block& __g) noexcept | ||
| { | ||
| // cooperative groups maps a multidimensional thread id into the thread rank the same way as warps do | ||
| const unsigned int tid = __g.thread_rank(); | ||
| // Cannot call __g.thread_rank(), because we only forward declared the thread_block type | ||
| // cooperative groups (and we here) maps a multidimensional thread id into the thread rank the same way as warps do | ||
| const unsigned int tid = threadIdx.z * blockDim.y * blockDim.x + threadIdx.y * blockDim.x + threadIdx.x; |
There was a problem hiding this comment.
@pciolkosz if we had a cooperative_groups::thread_block<1> or some other way to detect that the block is 1D, we could save a lot of special register reads here!
There was a problem hiding this comment.
Alternatively, we could just add a cuda::thread_block_group<1> which would fulfill the Group concept and give us efficient codegen here. @miscco and @pciolkosz what do you think?
This comment has been minimized.
This comment has been minimized.
9ee0408 to
ce7f528
Compare
|
/ok to test ce7f528 |
libcudacxx/include/cuda/__memcpy_async/cp_async_bulk_shared_global.h
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/__memcpy_async/cp_async_bulk_shared_global.h
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/__memcpy_async/cp_async_bulk_shared_global.h
Outdated
Show resolved
Hide resolved
libcudacxx/test/libcudacxx/cuda/memcpy_async/group_memcpy_async.h
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/__memcpy_async/cp_async_bulk_shared_global.h
Outdated
Show resolved
Hide resolved
| // use 2 groups of 4 threads to copy 8 items each, but spread them 16 bytes | ||
| auto tiled_groups = cg::tiled_partition<4>(cg::this_thread_block()); | ||
| if (threadIdx.x < 8) | ||
| { | ||
| static_assert(thread_block_size >= 8); | ||
| printf("%u copying 8 items at meta group rank %u\n", threadIdx.x, tiled_groups.meta_group_rank()); | ||
| cuda::memcpy_async( | ||
| tiled_groups, | ||
| &dest->data[tiled_groups.meta_group_rank() * 16], | ||
| &source->data[tiled_groups.meta_group_rank() * 16], | ||
| sizeof(T) * 8, | ||
| *bar); |
There was a problem hiding this comment.
Remark: the possibility of this is incredibly clever and unholy at the same time.
libcudacxx/test/libcudacxx/cuda/memcpy_async/group_memcpy_async_16b.pass.cpp
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/__memcpy_async/cp_async_bulk_shared_global.h
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/__memcpy_async/cp_async_bulk_shared_global.h
Outdated
Show resolved
Hide resolved
c4a1509 to
c23d96d
Compare
97cddd0 to
3099002
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
|
||
| Additionally: | ||
|
|
||
| - If *Shape* is :ref:`cuda::aligned_size_t <libcudacxx-extended-api-memory-aligned-size>`, ``source`` |
There was a problem hiding this comment.
question. Are these constraints evaluated in assertions?
There was a problem hiding this comment.
We already assert that pointers are aligned. I added now that the pipeline is not quit.
I cannot easily check whether the parameters are the same across all threads of a group and whether all threads of that group also called the API. It may be possible with some block-wide operations, but seems a bit much for an assertion.
error: A __device__ variable template cannot have a const qualified type on Windows
3115170 to
4fc9b4e
Compare
This comment has been minimized.
This comment has been minimized.
| int main(int argc, char** argv) | ||
| { | ||
| NV_IF_TARGET(NV_IS_HOST, cuda_thread_count = 4;) | ||
| NV_IF_TARGET(NV_IS_HOST, cuda_thread_count = thread_block_size;) |
There was a problem hiding this comment.
I could finally reproduce and hunt down this bug, and the problematic line is here. nvrtcc (a driver executable for nvrtc) searches the input source for a line like cuda_thread_count = ... where ... is supposed to be an interger literal. Because I put a named constant here, nvrtc ran the tests with a block size of 1, which lead to the hang in the kernel.
There was a problem hiding this comment.
Here is a PR to save us such a long hunt next time: #7035
This comment has been minimized.
This comment has been minimized.
This reverts commit 4fc9b4e.
🥳 CI Workflow Results🟩 Finished in 1h 03m: Pass: 100%/91 | Total: 16h 09m | Max: 37m 09s | Hits: 99%/211081See results here. |
(cherry picked from commit 7d389d4)
|
Successfully created backport PR for |
(cherry picked from commit 7d389d4) Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>
cuda::memcpy_asynchangs in some examples #6601 does not hang anymorecuda::memcpy_asyncwithcuda::barrierimplementation is inefficient on sm90+ #5995 is still optimal, we just have more code now for computing the thread rank of the CG groupis_thread_block_group_voptimalI pulled the core fix out into #6710, so it can ship on time for 3.2.
Fixes: #6601