Skip to content

Add experimental cuda_async_managed_memory_resource.#2056

Merged
rapids-bot[bot] merged 31 commits intorapidsai:branch-25.12from
bdice:cuda_async_managed_memory_resource
Oct 13, 2025
Merged

Add experimental cuda_async_managed_memory_resource.#2056
rapids-bot[bot] merged 31 commits intorapidsai:branch-25.12from
bdice:cuda_async_managed_memory_resource

Conversation

@bdice
Copy link
Collaborator

@bdice bdice commented Sep 27, 2025

Description

Contributes to #2054.

Some follow-up tasks (after this PR):

  • Decide whether to reimplement the existing managed_memory_resource on CUDA 13 with this?
  • Determine whether decompression engine flags work, if provided (if so, should we use the default pool?)
  • Determine whether we need to implement a release threshold argument, or provide docs on how to set that.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@bdice
Copy link
Collaborator Author

bdice commented Sep 30, 2025

There is parallel design work happening in NVIDIA/cccl#5998. I don't want to offer this as a stable API in RMM given that we are fairly close to being able to use CCCL's MR implementations directly. I will mark this as experimental.

@bdice bdice self-assigned this Oct 6, 2025
@bdice bdice added feature request New feature or request non-breaking Non-breaking change labels Oct 6, 2025
@bdice bdice marked this pull request as ready for review October 7, 2025 04:00
@bdice bdice requested review from a team as code owners October 7, 2025 04:00
@bdice bdice requested review from ttnghia and vyasr October 7, 2025 04:00
@rapidsai rapidsai deleted a comment from copy-pr-bot bot Oct 7, 2025
@bdice bdice changed the title Add cuda_async_managed_memory_resource. Add experimental cuda_async_managed_memory_resource. Oct 7, 2025
Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python changes look good, though I had one small question.

location.type = cudaMemLocationTypeDevice;
location.id = rmm::get_current_cuda_device().value();
cudaMemAllocationType type = cudaMemAllocationTypeManaged;
RMM_CUDA_TRY(cudaMemGetDefaultMemPool(&managed_pool_handle, &location, type));
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this use cudaMemGetMemPool (the current memory pool for this device) instead of cudaMemGetDefaultMemPool (the default memory pool for this device)?

CCCL is using the default memory pool so we should probably match. https://github.com/NVIDIA/cccl/blob/de213a108b12aa5fdd4b7c8889aec4120734b4f1/cudax/include/cuda/experimental/__memory_resource/managed_memory_resource.cuh#L63

Comment on lines +62 to +65
* @brief Determine at runtime if the CUDA driver/runtime supports the stream-ordered
* managed memory allocator functions.
*
* Stream-ordered managed memory pools were introduced in CUDA 13.0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, using cudart>=13 on a 12.9 driver is forward compatibility mode. If a user did not have the cuda-compat package installed in this scenario then everything would have failed already. By not checking the driver version we are effectively implicitly assuming that RMM_MIN_ASYNC_MANAGED_ALLOC_CUDA_VERSION is 13.0, though, since otherwise if the feature was introduced in e.g. 13.1 or 13.2 we could be missing user-mode driver support as well if you had the 13.0 compat driver.

Minor note, there are also edge cases where forward compatibility is not sufficient. I don't see any documentation indicating that async managed allocations are one of them, though.

rapids-bot bot pushed a commit that referenced this pull request Oct 10, 2025
…mory resources (#2083)

Precursor to #2056. This refactors the Python/Cython code for memory resources to make it easier to add a new `experimental` namespace.

This is a small breaking change in the Cython API, for some Cython names that were unintentionally exposed in multiple modules. [For example](rapidsai/rapidsmpf#575), `device_memory_resource` should be cimported from `rmm.librmm.memory_resource` rather than `rmm.pylibrmm.memory_resource`.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #2083
@bdice
Copy link
Collaborator Author

bdice commented Oct 13, 2025

@wence- @vyasr I have responded to all the threads above and requested another round of review. I'd like to finish this up and get something merged so it can be more readily tested. It is experimental so I am happy to break things if we decide on a better plan for implementation later on. My current hope is that this feature is never stabilized, and is replaced entirely by the CCCL implementations once they are available.

@bdice
Copy link
Collaborator Author

bdice commented Oct 13, 2025

/merge

@rapids-bot rapids-bot bot merged commit 82cc74e into rapidsai:branch-25.12 Oct 13, 2025
78 checks passed
@github-project-automation github-project-automation bot moved this from Review to Done in RMM Project Board Oct 13, 2025
bdice added a commit to bdice/rmm that referenced this pull request Nov 25, 2025
Adds a new cuda_async_pinned_memory_resource that provides stream-ordered
pinned (page-locked) host memory allocation using CUDA 13.0's
cudaMemGetDefaultMemPool API with cudaMemAllocationTypePinned.

This parallels the cuda_async_managed_memory_resource added in rapidsai#2056 and
addresses part of rapidsai#2054.

Key features:
- Uses default pinned memory pool for stream-ordered allocation
- Accessible from both host and device
- Requires CUDA 13.0+ (matches managed version for consistency)
- Simpler requirements than managed (no concurrent managed access needed)
- Works on WSL2 and other systems where managed memory is not supported

Implementation includes:
- C++ header and implementation in cuda_async_pinned_memory_resource.hpp
- Runtime capability check in runtime_capabilities.hpp
- C++ tests in cuda_async_pinned_mr_tests.cpp
- Python bindings in experimental module
- Python tests in test_cuda_async_pinned_memory_resource.py
bdice added a commit to bdice/rmm that referenced this pull request Nov 25, 2025
Adds a new cuda_async_pinned_memory_resource that provides stream-ordered
pinned (page-locked) host memory allocation using CUDA 13.0's
cudaMemGetDefaultMemPool API with cudaMemAllocationTypePinned.

This parallels the cuda_async_managed_memory_resource added in rapidsai#2056 and
addresses part of rapidsai#2054.

Key features:
- Uses default pinned memory pool for stream-ordered allocation
- Accessible from both host and device
- Requires CUDA 13.0+ (matches managed version for consistency)

Implementation includes:
- C++ header and implementation in cuda_async_pinned_memory_resource.hpp
- Runtime capability check in runtime_capabilities.hpp
- C++ tests in cuda_async_pinned_mr_tests.cpp
- Python bindings in experimental module
- Python tests in test_cuda_async_pinned_memory_resource.py
bdice added a commit to bdice/rmm that referenced this pull request Nov 25, 2025
Adds a new cuda_async_pinned_memory_resource that provides stream-ordered
pinned (page-locked) host memory allocation using CUDA 13.0's
cudaMemGetDefaultMemPool API with cudaMemAllocationTypePinned.

This parallels the cuda_async_managed_memory_resource added in rapidsai#2056 and
addresses part of rapidsai#2054.

Key features:
- Uses default pinned memory pool for stream-ordered allocation
- Accessible from both host and device
- Requires CUDA 13.0+ (matches managed version for consistency)

Implementation includes:
- C++ header and implementation in cuda_async_pinned_memory_resource.hpp
- Runtime capability check in runtime_capabilities.hpp
- C++ tests in cuda_async_pinned_mr_tests.cpp
- Python bindings in experimental module
- Python tests in test_cuda_async_pinned_memory_resource.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request non-breaking Non-breaking change

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants