Skip to content

Opt-in pinned-host memory pool for high-throughput CPU→GPU transfers #1872

@heyong4725

Description

@heyong4725

Context

dora today supports two zero-copy transports between nodes:

Path Memory Pinned? GPU-direct?
Zenoh SHM (≥4 KiB messages) Host shared memory, page-aligned No No — requires cudaMemcpyHostToDevice from non-pinned memory
cudaIpc handles GPU memory N/A Yes, but only useful when both producer and consumer are on GPU

There's a gap: CPU-produced data destined for GPU consumption. Camera frames, sensor batches, real-time perception inputs — common dora workloads where the producer is a CPU node (driver, decoder, preprocessor) and the consumer is a CUDA-using node.

Today the consumer has to either:

  • Allocate its own pinned host buffer, copy from zenoh-SHM into it, then cudaMemcpyHostToDevice (two copies)
  • Skip pinning and accept the slower non-pinned cudaMemcpyHostToDevice from zenoh-SHM (one copy, slower)
  • Implement custom pinned-memory infrastructure outside dora (complex, not portable across users)

None of these are great. A daemon-brokered opt-in pinned-host-buffer pool would let producers write directly into pinned memory that the consumer can DMA from. The expected throughput improvement on this specific CPU→GPU path is 3-5× over the non-pinned baseline (claim from PR #1623; needs independent measurement).

What this issue is, and isn't

This issue documents the gap and invites proposals. It does NOT commit to an architecture. dora has historically moved AWAY from custom shared-memory infrastructure (see #1745 which removed the previous SHM cache in favor of zenoh-SHM). Any new transport has to clear a high bar.

PR #1623 attempted a 2,717-line implementation of one possible architecture (daemon-brokered /dev/shm/dora_pool_* ring buffers with custom protocol). It was closed because the implementation was 7× larger than the feature needed and the design framing was a fork-for-paper not a focused proposal. The underlying gap it tried to address is real; this issue keeps that gap visible for a future, leaner attempt.

What a reviewable proposal would address

Before writing code, an issue/RFC should answer:

  1. Why a new transport, not extending an existing one?

    • Can zenoh-shm be extended to support optionally-pinned host memory? Zenoh has a SHM provider interface; could a PinnedHostShmProviderBackend plug in cleanly?
    • Can the existing Arrow IPC path be CUDA-aware (e.g., produce-into-pinned, consume-from-pinned, with metadata tracking)?
    • Why are these worse than a new daemon-brokered pool?
  2. What's the cross-platform story?

    • cudaHostRegister works on Linux, Windows, macOS (with CUDA drivers). Acceptable.
    • /dev/shm/-style POSIX shm is Linux/macOS only; Windows has named shared memory (CreateFileMapping). Cross-platform shm is feasible but adds code.
    • What's the graceful fallback when CUDA isn't available at all? Today most dora users don't use CUDA — the feature has to be invisible to them.
  3. Lifecycle: who owns pinned memory, and when does it get freed?

    • Pinned memory is a finite system resource. Without bounded pool size, a misbehaving node could exhaust it.
    • When does a registered pool get freed? On producer drop? On daemon shutdown? On explicit free? PR dora manage pinned memory about 1000MB/S #1623 had a "ring buffer with self-release" — needs design clarity.
    • What happens if the producer crashes mid-write? Daemon needs to detect and reclaim.
  4. API shape: how does a node opt in?

    • Python: node.send_output_pinned(...) vs. flag on existing send_output(..., pinned=True)?
    • Rust: same question.
    • C/C++: do we need the API there too?
    • How does the consumer know it can DMA-directly vs. needs to fall back?
  5. Measurement and validation:

    • Benchmark against zenoh-shm + caller-side pinning (the workaround that exists today), not just against non-pinned zenoh-shm. The relevant question is "is daemon-brokered pinning worth the complexity vs. caller-side pinning?"
    • Real workload (camera frame at 30 Hz, 1080p RGBA = ~8 MB per frame) not synthetic micro-benchmark.

Reference material

  • PR dora manage pinned memory about 1000MB/S #1623 (closed) — tang-canran's implementation attempt. Code is in binaries/daemon/src/memory_manager.rs (215 LOC), the new DaemonRequest::{Register,Read,Free}PinnedMemory variants in libraries/message/src/node_to_daemon.rs, and the Python bindings in apis/python/node/src/lib.rs. Worth reading for the side-channel architecture (NOT the design doc framing).
  • PR Remove custom shared memory and drop token infrastructure #1745 — removed dora's previous custom shared memory infrastructure in favor of zenoh-shm. Context for why a new SHM transport needs to clear a high bar.
  • CUDA Programming Guide §3.2.5 — pinned memory semantics and trade-offs.

Triage

Labels: cuda, enhancement, design-needed
Owner: unassigned. Proposals welcome from any contributor.
Priority: not currently scheduled. Filed for visibility.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions