Skip to content

Use any_resource in RMM adaptors to fix lifetime issue #2207

@bdice

Description

@bdice

Summary

This proposal addresses undefined behavior in all RMM memory resource adaptors by changing them to store cuda::mr::any_resource instead of non-owning device_async_resource_ref. This will ensure adaptors maintain ownership of their upstream resources through "reification" - converting resource references to owning copies.

Impact: This change will eliminate a common source of undefined behavior in RMM and make memory resource adaptors safe to use with temporary resources.

Problem

RMM adaptors currently store non-owning references (device_async_resource_ref) to upstream resources, leading to undefined behavior when the upstream resource is destroyed before the adaptor.

Example of the Problem

A particularly dangerous scenario occurs when setting the global device resource:

void setup_statistics_tracking() {
  // Create base resource
  cuda::mr::any_resource<cuda::mr::device_accessible> cuda_mr =
      rmm::mr::cuda_memory_resource();

  // Wrap it in statistics adaptor
  auto stats = rmm::mr::statistics_resource_adaptor<rmm::mr::device_memory_resource>(cuda_mr);

  // Set as current device resource
  rmm::mr::set_current_device_resource_ref(stats);

  // cuda_mr and stats go out of scope here!
  // The global device resource now points to destroyed memory
}

int main() {
  setup_statistics_tracking();

  // UNDEFINED BEHAVIOR: Global resource points to destroyed adaptor
  // which points to destroyed cuda_memory_resource
  rmm::device_buffer buf(1024);  // May crash, corrupt memory, or appear to work

  return 0;
}

This is a classic use-after-free bug - the global resource reference points to stack memory that has been deallocated.

Current Workarounds

Users must currently ensure both the upstream resource AND the adaptor outlive their usage:

static cuda::mr::any_resource<...> cuda_mr = ...;  // Must be static
static auto stats = statistics_resource_adaptor(cuda_mr);  // Must be static
rmm::mr::set_current_device_resource_ref(stats);

This is error-prone and requires managing two separate lifetimes.

Proposed Solution

Change all adaptors to store cuda::mr::any_resource and reify resource references in constructors. This will eliminate the need to separately manage the upstream resource's lifetime.

After the Fix

Users will only need to ensure the adaptor outlives its usage:

cuda::mr::any_resource<...> cuda_mr = ...;  // Can be local!
static auto stats = statistics_resource_adaptor(cuda_mr);  // Only this needs to be static
rmm::mr::set_current_device_resource_ref(stats);
// cuda_mr goes out of scope - but stats owns a copy via reification ✅

Implementation Pattern

Each adaptor should be modified to:

  1. Store cuda::mr::any_resource<cuda::mr::device_accessible> (owning) instead of device_async_resource_ref (non-owning)
  2. "Reify" the resource reference in constructors: cuda::mr::any_resource{upstream} creates an owning copy
  3. Mark the member mutable to allow conversion to resource_ref in const methods
class statistics_resource_adaptor {
 public:
  // Constructor takes resource_ref and reifies to any_resource
  statistics_resource_adaptor(device_async_resource_ref upstream)
    : upstream_{cuda::mr::any_resource<cuda::mr::device_accessible>{upstream}} {}

  // Legacy constructor for raw pointer (also reifies)
  statistics_resource_adaptor(Upstream* upstream)
    : upstream_{cuda::mr::any_resource<cuda::mr::device_accessible>{
        to_device_async_resource_ref_checked(upstream)}} {}

  // Getter converts any_resource back to resource_ref for API compatibility
  [[nodiscard]] device_async_resource_ref get_upstream_resource() const noexcept {
    return device_async_resource_ref{upstream_};
  }

 private:
  // Stores owning any_resource (mutable for conversion in const methods)
  cuda::mr::any_resource<cuda::mr::device_accessible> mutable upstream_;
};

Adaptors to Modify

The following 12 adaptors should be updated:

  • aligned_resource_adaptor.hpp
  • arena_memory_resource.hpp
  • binning_memory_resource.hpp
  • failure_callback_resource_adaptor.hpp
  • fixed_size_memory_resource.hpp
  • limiting_resource_adaptor.hpp
  • logging_resource_adaptor.hpp
  • pool_memory_resource.hpp
  • prefetch_resource_adaptor.hpp
  • statistics_resource_adaptor.hpp
  • thread_safe_resource_adaptor.hpp
  • tracking_resource_adaptor.hpp

Implementation Plan

Step 1: Update Member Declarations

Change from:

device_async_resource_ref upstream_;

To:

cuda::mr::any_resource<cuda::mr::device_accessible> mutable upstream_;

Step 2: Update Constructors

Change constructor initializer lists to reify the upstream resource:

// Before:
explicit adaptor(device_async_resource_ref upstream)
  : upstream_{upstream} {}

// After:
explicit adaptor(device_async_resource_ref upstream)
  : upstream_{cuda::mr::any_resource<cuda::mr::device_accessible>{upstream}} {}

Step 3: Update Getters

Modify get_upstream_resource() to convert from any_resource to resource_ref:

// Before:
[[nodiscard]] device_async_resource_ref get_upstream_resource() const noexcept {
  return upstream_;
}

// After:
[[nodiscard]] device_async_resource_ref get_upstream_resource() const noexcept {
  return device_async_resource_ref{upstream_};
}

Step 4: Add Required Include

Add to the top of each modified header:

#include <cuda/memory_resource>

Expected Benefits

  1. Safety by default: Adaptors automatically own their upstream resources
  2. No API breakage: Existing code continues to work unchanged
  3. Simpler lifetime management: Users only manage the adaptor's lifetime, not the upstream resource
  4. Minimal overhead: Only adds the size of any_resource (2-3 pointers) per adaptor

Key Concept: Reification

Reification is the process of converting a non-owning device_async_resource_ref to an owning cuda::mr::any_resource:

// Non-owning reference
device_async_resource_ref ref = /* ... */;

// Reification creates owning copy
cuda::mr::any_resource<cuda::mr::device_accessible> owner{ref};

This is analogous to converting std::string_view to std::string - the owning type makes a copy and manages its lifetime.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    To-do

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions