-
Notifications
You must be signed in to change notification settings - Fork 245
Description
Summary
This proposal addresses undefined behavior in all RMM memory resource adaptors by changing them to store cuda::mr::any_resource instead of non-owning device_async_resource_ref. This will ensure adaptors maintain ownership of their upstream resources through "reification" - converting resource references to owning copies.
Impact: This change will eliminate a common source of undefined behavior in RMM and make memory resource adaptors safe to use with temporary resources.
Problem
RMM adaptors currently store non-owning references (device_async_resource_ref) to upstream resources, leading to undefined behavior when the upstream resource is destroyed before the adaptor.
Example of the Problem
A particularly dangerous scenario occurs when setting the global device resource:
void setup_statistics_tracking() {
// Create base resource
cuda::mr::any_resource<cuda::mr::device_accessible> cuda_mr =
rmm::mr::cuda_memory_resource();
// Wrap it in statistics adaptor
auto stats = rmm::mr::statistics_resource_adaptor<rmm::mr::device_memory_resource>(cuda_mr);
// Set as current device resource
rmm::mr::set_current_device_resource_ref(stats);
// cuda_mr and stats go out of scope here!
// The global device resource now points to destroyed memory
}
int main() {
setup_statistics_tracking();
// UNDEFINED BEHAVIOR: Global resource points to destroyed adaptor
// which points to destroyed cuda_memory_resource
rmm::device_buffer buf(1024); // May crash, corrupt memory, or appear to work
return 0;
}This is a classic use-after-free bug - the global resource reference points to stack memory that has been deallocated.
Current Workarounds
Users must currently ensure both the upstream resource AND the adaptor outlive their usage:
static cuda::mr::any_resource<...> cuda_mr = ...; // Must be static
static auto stats = statistics_resource_adaptor(cuda_mr); // Must be static
rmm::mr::set_current_device_resource_ref(stats);This is error-prone and requires managing two separate lifetimes.
Proposed Solution
Change all adaptors to store cuda::mr::any_resource and reify resource references in constructors. This will eliminate the need to separately manage the upstream resource's lifetime.
After the Fix
Users will only need to ensure the adaptor outlives its usage:
cuda::mr::any_resource<...> cuda_mr = ...; // Can be local!
static auto stats = statistics_resource_adaptor(cuda_mr); // Only this needs to be static
rmm::mr::set_current_device_resource_ref(stats);
// cuda_mr goes out of scope - but stats owns a copy via reification ✅Implementation Pattern
Each adaptor should be modified to:
- Store
cuda::mr::any_resource<cuda::mr::device_accessible>(owning) instead ofdevice_async_resource_ref(non-owning) - "Reify" the resource reference in constructors:
cuda::mr::any_resource{upstream}creates an owning copy - Mark the member
mutableto allow conversion toresource_refin const methods
class statistics_resource_adaptor {
public:
// Constructor takes resource_ref and reifies to any_resource
statistics_resource_adaptor(device_async_resource_ref upstream)
: upstream_{cuda::mr::any_resource<cuda::mr::device_accessible>{upstream}} {}
// Legacy constructor for raw pointer (also reifies)
statistics_resource_adaptor(Upstream* upstream)
: upstream_{cuda::mr::any_resource<cuda::mr::device_accessible>{
to_device_async_resource_ref_checked(upstream)}} {}
// Getter converts any_resource back to resource_ref for API compatibility
[[nodiscard]] device_async_resource_ref get_upstream_resource() const noexcept {
return device_async_resource_ref{upstream_};
}
private:
// Stores owning any_resource (mutable for conversion in const methods)
cuda::mr::any_resource<cuda::mr::device_accessible> mutable upstream_;
};Adaptors to Modify
The following 12 adaptors should be updated:
aligned_resource_adaptor.hpparena_memory_resource.hppbinning_memory_resource.hppfailure_callback_resource_adaptor.hppfixed_size_memory_resource.hpplimiting_resource_adaptor.hpplogging_resource_adaptor.hpppool_memory_resource.hppprefetch_resource_adaptor.hppstatistics_resource_adaptor.hppthread_safe_resource_adaptor.hpptracking_resource_adaptor.hpp
Implementation Plan
Step 1: Update Member Declarations
Change from:
device_async_resource_ref upstream_;To:
cuda::mr::any_resource<cuda::mr::device_accessible> mutable upstream_;Step 2: Update Constructors
Change constructor initializer lists to reify the upstream resource:
// Before:
explicit adaptor(device_async_resource_ref upstream)
: upstream_{upstream} {}
// After:
explicit adaptor(device_async_resource_ref upstream)
: upstream_{cuda::mr::any_resource<cuda::mr::device_accessible>{upstream}} {}Step 3: Update Getters
Modify get_upstream_resource() to convert from any_resource to resource_ref:
// Before:
[[nodiscard]] device_async_resource_ref get_upstream_resource() const noexcept {
return upstream_;
}
// After:
[[nodiscard]] device_async_resource_ref get_upstream_resource() const noexcept {
return device_async_resource_ref{upstream_};
}Step 4: Add Required Include
Add to the top of each modified header:
#include <cuda/memory_resource>Expected Benefits
- Safety by default: Adaptors automatically own their upstream resources
- No API breakage: Existing code continues to work unchanged
- Simpler lifetime management: Users only manage the adaptor's lifetime, not the upstream resource
- Minimal overhead: Only adds the size of
any_resource(2-3 pointers) per adaptor
Key Concept: Reification
Reification is the process of converting a non-owning device_async_resource_ref to an owning cuda::mr::any_resource:
// Non-owning reference
device_async_resource_ref ref = /* ... */;
// Reification creates owning copy
cuda::mr::any_resource<cuda::mr::device_accessible> owner{ref};This is analogous to converting std::string_view to std::string - the owning type makes a copy and manages its lifetime.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status