Add experimental cuda_async_pinned_memory_resource#2164
Add experimental cuda_async_pinned_memory_resource#2164bdice wants to merge 3 commits intorapidsai:mainfrom
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
40dfa09 to
e99ed6e
Compare
Adds a new cuda_async_pinned_memory_resource that provides stream-ordered pinned (page-locked) host memory allocation using CUDA 13.0's cudaMemGetDefaultMemPool API with cudaMemAllocationTypePinned. This parallels the cuda_async_managed_memory_resource added in rapidsai#2056 and addresses part of rapidsai#2054. Key features: - Uses default pinned memory pool for stream-ordered allocation - Accessible from both host and device - Requires CUDA 13.0+ (matches managed version for consistency) Implementation includes: - C++ header and implementation in cuda_async_pinned_memory_resource.hpp - Runtime capability check in runtime_capabilities.hpp - C++ tests in cuda_async_pinned_mr_tests.cpp - Python bindings in experimental module - Python tests in test_cuda_async_pinned_memory_resource.py
e99ed6e to
e671b34
Compare
Enables pinned memory pool support on CUDA 12.6+ using cudaMemPoolCreate for CUDA 12.6-12.x and cudaMemGetDefaultMemPool for CUDA 13.0+. Uses unique_ptr with a deleter for automatic pool cleanup. Updates version requirements: 12.6+ for pinned.
nirandaperera
left a comment
There was a problem hiding this comment.
I have some questions on the mem pool location type.
| // CUDA 12.6-12.x: Create a new pinned memory pool (needs cleanup) | ||
| cudaMemPoolProps pool_props{}; | ||
| pool_props.allocType = cudaMemAllocationTypePinned; | ||
| pool_props.location.type = cudaMemLocationTypeDevice; |
There was a problem hiding this comment.
This is making a location as DEVICE. Is this correct?
In CCCL pinned mem pool, its marked as host/ host_numa
https://github.com/NVIDIA/cccl/blob/main/libcudacxx/include/cuda/__memory_resource/pinned_memory_pool.h#L113-L154
There was a problem hiding this comment.
I'm wondering what it means by pinned device memory 🤔
There was a problem hiding this comment.
Yeah, this is wrong, this allocates device memory.
| } | ||
| }; | ||
|
|
||
| TEST_F(AsyncPinnedMRTest, BasicAllocateDeallocate) |
There was a problem hiding this comment.
I feel like all the test cases can be parameterized/ templated for both sync and async allocation and deallocation operations
| cudaMemPool_t pool_handle = mr.pool_handle(); | ||
| EXPECT_NE(pool_handle, nullptr); | ||
| } | ||
|
|
There was a problem hiding this comment.
Should we also add a device -> pinned host stream ordered copy? Maybe using a device_vector and checking if the copy results in the same
| // CUDA 13.0+: Use the default pinned memory pool (no cleanup needed) | ||
| cudaMemLocation location{.type = cudaMemLocationTypeDevice, | ||
| .id = rmm::get_current_cuda_device().value()}; | ||
| RMM_CUDA_TRY( | ||
| cudaMemGetDefaultMemPool(pool_handle_.get(), &location, cudaMemAllocationTypePinned)); |
There was a problem hiding this comment.
This provides a mempool that allocates on device.
If you want a mempool that allocates on host and is page-locked, you need to do:
// Note, if we don't specify HostNuma (we might want to...) then .id is ignored
cudaMemLocation location{.type = cudaMemLocationTypeHost, .id = 0};
// Non-_migratable_ memory allocated on host.
cudaMemGetDefaultMemPool(&handle, &location, cudaMemAllocationTypePinned);
cudaMemAccessDesc desc{};
desc.location.type = cudaMemLocationTypeDevice;
desc.location.id = rmm::get_current_cuda_device().value();
desc.flags = cudaMemAccessFlagsProtReadWrite;
cudaMemPoolSetAccess(handle, &desc, 1);
Note moreover that if you don't set the accessibility then the allocations from this resource are not device accessible.
| // CUDA 12.6-12.x: Create a new pinned memory pool (needs cleanup) | ||
| cudaMemPoolProps pool_props{}; | ||
| pool_props.allocType = cudaMemAllocationTypePinned; | ||
| pool_props.location.type = cudaMemLocationTypeDevice; |
There was a problem hiding this comment.
Yeah, this is wrong, this allocates device memory.
| // Pinned memory should be accessible from host | ||
| // Write from host | ||
| EXPECT_NO_THROW({ | ||
| for (int i = 0; i < 100; ++i) { | ||
| ptr[i] = i; | ||
| } | ||
| }); | ||
|
|
||
| // Verify we can read back | ||
| EXPECT_EQ(ptr[0], 0); | ||
| EXPECT_EQ(ptr[50], 50); |
There was a problem hiding this comment.
We need to test that memory is accessible from device too (via some kernel probably, or maybe DtoD memcpy?)
| RMM_EXPECTS(rmm::detail::runtime_async_pinned_alloc::is_supported(), | ||
| "cuda_async_pinned_memory_resource requires CUDA 12.6 or higher runtime"); | ||
|
|
||
| pool_handle_.reset(new cudaMemPool_t{}); |
There was a problem hiding this comment.
As below, no need to manage this handle through a smart pointer, this class can do that.
| } | ||
| }; | ||
|
|
||
| std::unique_ptr<cudaMemPool_t, pool_deleter> pool_handle_; |
There was a problem hiding this comment.
This this is an owning object, it seems unnecessary to also have a unique_ptr. Prefer to store a raw cudaMemPool_t handle and deal with this in the dtor.
Description
Contributes to #2054.
Adds a new
cuda_async_pinned_memory_resourcethat provides stream-ordered pinned (page-locked) host memory allocation using CUDA 13.0'scudaMemGetDefaultMemPoolAPI withcudaMemAllocationTypePinned.This parallels the
cuda_async_managed_memory_resourceadded in #2056.Key Features
Implementation
cpp/include/rmm/mr/cuda_async_pinned_memory_resource.hppruntime_async_pinned_allocstruct toruntime_capabilities.hppcpp/tests/mr/cuda_async_pinned_mr_tests.cppwith tests for allocation, host accessibility, and pool equalitypython/rmm/rmm/tests/test_cuda_async_pinned_memory_resource.pyFollow-up Tasks
pinned_host_memory_resourceChecklist