Opt-in pinned-host memory pool for high-throughput CPU→GPU transfers

## Context

dora today supports two zero-copy transports between nodes:

| Path | Memory | Pinned? | GPU-direct? |
|---|---|---|---|
| Zenoh SHM (≥4 KiB messages) | Host shared memory, page-aligned | **No** | No — requires `cudaMemcpyHostToDevice` from non-pinned memory |
| cudaIpc handles | GPU memory | N/A | Yes, but only useful when both producer and consumer are on GPU |

There's a gap: **CPU-produced data destined for GPU consumption**. Camera frames, sensor batches, real-time perception inputs — common dora workloads where the producer is a CPU node (driver, decoder, preprocessor) and the consumer is a CUDA-using node.

Today the consumer has to either:
- Allocate its own pinned host buffer, copy from zenoh-SHM into it, then `cudaMemcpyHostToDevice` (two copies)
- Skip pinning and accept the slower non-pinned `cudaMemcpyHostToDevice` from zenoh-SHM (one copy, slower)
- Implement custom pinned-memory infrastructure outside dora (complex, not portable across users)

None of these are great. A daemon-brokered opt-in pinned-host-buffer pool would let producers write directly into pinned memory that the consumer can DMA from. The expected throughput improvement on this specific CPU→GPU path is 3-5× over the non-pinned baseline (claim from PR #1623; needs independent measurement).

## What this issue is, and isn't

**This issue documents the gap and invites proposals.** It does NOT commit to an architecture. dora has historically moved AWAY from custom shared-memory infrastructure (see #1745 which removed the previous SHM cache in favor of zenoh-SHM). Any new transport has to clear a high bar.

PR #1623 attempted a 2,717-line implementation of one possible architecture (daemon-brokered `/dev/shm/dora_pool_*` ring buffers with custom protocol). It was closed because the implementation was 7× larger than the feature needed and the design framing was a fork-for-paper not a focused proposal. The underlying gap it tried to address is real; this issue keeps that gap visible for a future, leaner attempt.

## What a reviewable proposal would address

Before writing code, an issue/RFC should answer:

1. **Why a new transport, not extending an existing one?**
   - Can zenoh-shm be extended to support optionally-pinned host memory? Zenoh has a SHM provider interface; could a `PinnedHostShmProviderBackend` plug in cleanly?
   - Can the existing Arrow IPC path be CUDA-aware (e.g., produce-into-pinned, consume-from-pinned, with metadata tracking)?
   - Why are these worse than a new daemon-brokered pool?

2. **What's the cross-platform story?**
   - `cudaHostRegister` works on Linux, Windows, macOS (with CUDA drivers). Acceptable.
   - `/dev/shm/`-style POSIX shm is Linux/macOS only; Windows has named shared memory (`CreateFileMapping`). Cross-platform shm is feasible but adds code.
   - What's the graceful fallback when CUDA isn't available at all? Today most dora users don't use CUDA — the feature has to be invisible to them.

3. **Lifecycle: who owns pinned memory, and when does it get freed?**
   - Pinned memory is a finite system resource. Without bounded pool size, a misbehaving node could exhaust it.
   - When does a registered pool get freed? On producer drop? On daemon shutdown? On explicit free? PR #1623 had a "ring buffer with self-release" — needs design clarity.
   - What happens if the producer crashes mid-write? Daemon needs to detect and reclaim.

4. **API shape: how does a node opt in?**
   - Python: `node.send_output_pinned(...)` vs. flag on existing `send_output(..., pinned=True)`?
   - Rust: same question.
   - C/C++: do we need the API there too?
   - How does the consumer know it can DMA-directly vs. needs to fall back?

5. **Measurement and validation:**
   - Benchmark against zenoh-shm + caller-side pinning (the workaround that exists today), not just against non-pinned zenoh-shm. The relevant question is "is daemon-brokered pinning worth the complexity vs. caller-side pinning?"
   - Real workload (camera frame at 30 Hz, 1080p RGBA = ~8 MB per frame) not synthetic micro-benchmark.

## Reference material

- **PR #1623** (closed) — tang-canran's implementation attempt. Code is in `binaries/daemon/src/memory_manager.rs` (215 LOC), the new `DaemonRequest::{Register,Read,Free}PinnedMemory` variants in `libraries/message/src/node_to_daemon.rs`, and the Python bindings in `apis/python/node/src/lib.rs`. Worth reading for the side-channel architecture (NOT the design doc framing).
- **PR #1745** — removed dora's previous custom shared memory infrastructure in favor of zenoh-shm. Context for why a new SHM transport needs to clear a high bar.
- **CUDA Programming Guide §3.2.5** — pinned memory semantics and trade-offs.

## Triage

Labels: `cuda`, `enhancement`, `design-needed`
Owner: unassigned. Proposals welcome from any contributor.
Priority: not currently scheduled. Filed for visibility.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Opt-in pinned-host memory pool for high-throughput CPU→GPU transfers #1872

Context

What this issue is, and isn't

What a reviewable proposal would address

Reference material

Triage

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Path	Memory	Pinned?	GPU-direct?
Zenoh SHM (≥4 KiB messages)	Host shared memory, page-aligned	No	No — requires `cudaMemcpyHostToDevice` from non-pinned memory
cudaIpc handles	GPU memory	N/A	Yes, but only useful when both producer and consumer are on GPU

Opt-in pinned-host memory pool for high-throughput CPU→GPU transfers #1872

Description

Context

What this issue is, and isn't

What a reviewable proposal would address

Reference material

Triage

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions