Add experimental cuda_async_managed_memory_resource.#2056
Add experimental cuda_async_managed_memory_resource.#2056rapids-bot[bot] merged 31 commits intorapidsai:branch-25.12from
Conversation
|
There is parallel design work happening in NVIDIA/cccl#5998. I don't want to offer this as a stable API in RMM given that we are fairly close to being able to use CCCL's MR implementations directly. I will mark this as |
TomAugspurger
left a comment
There was a problem hiding this comment.
Python changes look good, though I had one small question.
| location.type = cudaMemLocationTypeDevice; | ||
| location.id = rmm::get_current_cuda_device().value(); | ||
| cudaMemAllocationType type = cudaMemAllocationTypeManaged; | ||
| RMM_CUDA_TRY(cudaMemGetDefaultMemPool(&managed_pool_handle, &location, type)); |
There was a problem hiding this comment.
Should this use cudaMemGetMemPool (the current memory pool for this device) instead of cudaMemGetDefaultMemPool (the default memory pool for this device)?
CCCL is using the default memory pool so we should probably match. https://github.com/NVIDIA/cccl/blob/de213a108b12aa5fdd4b7c8889aec4120734b4f1/cudax/include/cuda/experimental/__memory_resource/managed_memory_resource.cuh#L63
| * @brief Determine at runtime if the CUDA driver/runtime supports the stream-ordered | ||
| * managed memory allocator functions. | ||
| * | ||
| * Stream-ordered managed memory pools were introduced in CUDA 13.0. |
There was a problem hiding this comment.
Yes, using cudart>=13 on a 12.9 driver is forward compatibility mode. If a user did not have the cuda-compat package installed in this scenario then everything would have failed already. By not checking the driver version we are effectively implicitly assuming that RMM_MIN_ASYNC_MANAGED_ALLOC_CUDA_VERSION is 13.0, though, since otherwise if the feature was introduced in e.g. 13.1 or 13.2 we could be missing user-mode driver support as well if you had the 13.0 compat driver.
Minor note, there are also edge cases where forward compatibility is not sufficient. I don't see any documentation indicating that async managed allocations are one of them, though.
…managed_memory_resource
…/rmm into cuda_async_managed_memory_resource
…mory resources (#2083) Precursor to #2056. This refactors the Python/Cython code for memory resources to make it easier to add a new `experimental` namespace. This is a small breaking change in the Cython API, for some Cython names that were unintentionally exposed in multiple modules. [For example](rapidsai/rapidsmpf#575), `device_memory_resource` should be cimported from `rmm.librmm.memory_resource` rather than `rmm.pylibrmm.memory_resource`. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) URL: #2083
|
@wence- @vyasr I have responded to all the threads above and requested another round of review. I'd like to finish this up and get something merged so it can be more readily tested. It is experimental so I am happy to break things if we decide on a better plan for implementation later on. My current hope is that this feature is never stabilized, and is replaced entirely by the CCCL implementations once they are available. |
|
/merge |
Adds a new cuda_async_pinned_memory_resource that provides stream-ordered pinned (page-locked) host memory allocation using CUDA 13.0's cudaMemGetDefaultMemPool API with cudaMemAllocationTypePinned. This parallels the cuda_async_managed_memory_resource added in rapidsai#2056 and addresses part of rapidsai#2054. Key features: - Uses default pinned memory pool for stream-ordered allocation - Accessible from both host and device - Requires CUDA 13.0+ (matches managed version for consistency) - Simpler requirements than managed (no concurrent managed access needed) - Works on WSL2 and other systems where managed memory is not supported Implementation includes: - C++ header and implementation in cuda_async_pinned_memory_resource.hpp - Runtime capability check in runtime_capabilities.hpp - C++ tests in cuda_async_pinned_mr_tests.cpp - Python bindings in experimental module - Python tests in test_cuda_async_pinned_memory_resource.py
Adds a new cuda_async_pinned_memory_resource that provides stream-ordered pinned (page-locked) host memory allocation using CUDA 13.0's cudaMemGetDefaultMemPool API with cudaMemAllocationTypePinned. This parallels the cuda_async_managed_memory_resource added in rapidsai#2056 and addresses part of rapidsai#2054. Key features: - Uses default pinned memory pool for stream-ordered allocation - Accessible from both host and device - Requires CUDA 13.0+ (matches managed version for consistency) Implementation includes: - C++ header and implementation in cuda_async_pinned_memory_resource.hpp - Runtime capability check in runtime_capabilities.hpp - C++ tests in cuda_async_pinned_mr_tests.cpp - Python bindings in experimental module - Python tests in test_cuda_async_pinned_memory_resource.py
Adds a new cuda_async_pinned_memory_resource that provides stream-ordered pinned (page-locked) host memory allocation using CUDA 13.0's cudaMemGetDefaultMemPool API with cudaMemAllocationTypePinned. This parallels the cuda_async_managed_memory_resource added in rapidsai#2056 and addresses part of rapidsai#2054. Key features: - Uses default pinned memory pool for stream-ordered allocation - Accessible from both host and device - Requires CUDA 13.0+ (matches managed version for consistency) Implementation includes: - C++ header and implementation in cuda_async_pinned_memory_resource.hpp - Runtime capability check in runtime_capabilities.hpp - C++ tests in cuda_async_pinned_mr_tests.cpp - Python bindings in experimental module - Python tests in test_cuda_async_pinned_memory_resource.py
Description
Contributes to #2054.
Some follow-up tasks (after this PR):
managed_memory_resourceon CUDA 13 with this?Checklist