You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
dora today supports two zero-copy transports between nodes:
Path
Memory
Pinned?
GPU-direct?
Zenoh SHM (≥4 KiB messages)
Host shared memory, page-aligned
No
No — requires cudaMemcpyHostToDevice from non-pinned memory
cudaIpc handles
GPU memory
N/A
Yes, but only useful when both producer and consumer are on GPU
There's a gap: CPU-produced data destined for GPU consumption. Camera frames, sensor batches, real-time perception inputs — common dora workloads where the producer is a CPU node (driver, decoder, preprocessor) and the consumer is a CUDA-using node.
Today the consumer has to either:
Allocate its own pinned host buffer, copy from zenoh-SHM into it, then cudaMemcpyHostToDevice (two copies)
Skip pinning and accept the slower non-pinned cudaMemcpyHostToDevice from zenoh-SHM (one copy, slower)
Implement custom pinned-memory infrastructure outside dora (complex, not portable across users)
None of these are great. A daemon-brokered opt-in pinned-host-buffer pool would let producers write directly into pinned memory that the consumer can DMA from. The expected throughput improvement on this specific CPU→GPU path is 3-5× over the non-pinned baseline (claim from PR #1623; needs independent measurement).
What this issue is, and isn't
This issue documents the gap and invites proposals. It does NOT commit to an architecture. dora has historically moved AWAY from custom shared-memory infrastructure (see #1745 which removed the previous SHM cache in favor of zenoh-SHM). Any new transport has to clear a high bar.
PR #1623 attempted a 2,717-line implementation of one possible architecture (daemon-brokered /dev/shm/dora_pool_* ring buffers with custom protocol). It was closed because the implementation was 7× larger than the feature needed and the design framing was a fork-for-paper not a focused proposal. The underlying gap it tried to address is real; this issue keeps that gap visible for a future, leaner attempt.
What a reviewable proposal would address
Before writing code, an issue/RFC should answer:
Why a new transport, not extending an existing one?
Can zenoh-shm be extended to support optionally-pinned host memory? Zenoh has a SHM provider interface; could a PinnedHostShmProviderBackend plug in cleanly?
Can the existing Arrow IPC path be CUDA-aware (e.g., produce-into-pinned, consume-from-pinned, with metadata tracking)?
Why are these worse than a new daemon-brokered pool?
What's the cross-platform story?
cudaHostRegister works on Linux, Windows, macOS (with CUDA drivers). Acceptable.
/dev/shm/-style POSIX shm is Linux/macOS only; Windows has named shared memory (CreateFileMapping). Cross-platform shm is feasible but adds code.
What's the graceful fallback when CUDA isn't available at all? Today most dora users don't use CUDA — the feature has to be invisible to them.
Lifecycle: who owns pinned memory, and when does it get freed?
Pinned memory is a finite system resource. Without bounded pool size, a misbehaving node could exhaust it.
When does a registered pool get freed? On producer drop? On daemon shutdown? On explicit free? PR dora manage pinned memory about 1000MB/S #1623 had a "ring buffer with self-release" — needs design clarity.
What happens if the producer crashes mid-write? Daemon needs to detect and reclaim.
API shape: how does a node opt in?
Python: node.send_output_pinned(...) vs. flag on existing send_output(..., pinned=True)?
Rust: same question.
C/C++: do we need the API there too?
How does the consumer know it can DMA-directly vs. needs to fall back?
Measurement and validation:
Benchmark against zenoh-shm + caller-side pinning (the workaround that exists today), not just against non-pinned zenoh-shm. The relevant question is "is daemon-brokered pinning worth the complexity vs. caller-side pinning?"
Real workload (camera frame at 30 Hz, 1080p RGBA = ~8 MB per frame) not synthetic micro-benchmark.
Reference material
PR dora manage pinned memory about 1000MB/S #1623 (closed) — tang-canran's implementation attempt. Code is in binaries/daemon/src/memory_manager.rs (215 LOC), the new DaemonRequest::{Register,Read,Free}PinnedMemory variants in libraries/message/src/node_to_daemon.rs, and the Python bindings in apis/python/node/src/lib.rs. Worth reading for the side-channel architecture (NOT the design doc framing).
CUDA Programming Guide §3.2.5 — pinned memory semantics and trade-offs.
Triage
Labels: cuda, enhancement, design-needed
Owner: unassigned. Proposals welcome from any contributor.
Priority: not currently scheduled. Filed for visibility.
Context
dora today supports two zero-copy transports between nodes:
cudaMemcpyHostToDevicefrom non-pinned memoryThere's a gap: CPU-produced data destined for GPU consumption. Camera frames, sensor batches, real-time perception inputs — common dora workloads where the producer is a CPU node (driver, decoder, preprocessor) and the consumer is a CUDA-using node.
Today the consumer has to either:
cudaMemcpyHostToDevice(two copies)cudaMemcpyHostToDevicefrom zenoh-SHM (one copy, slower)None of these are great. A daemon-brokered opt-in pinned-host-buffer pool would let producers write directly into pinned memory that the consumer can DMA from. The expected throughput improvement on this specific CPU→GPU path is 3-5× over the non-pinned baseline (claim from PR #1623; needs independent measurement).
What this issue is, and isn't
This issue documents the gap and invites proposals. It does NOT commit to an architecture. dora has historically moved AWAY from custom shared-memory infrastructure (see #1745 which removed the previous SHM cache in favor of zenoh-SHM). Any new transport has to clear a high bar.
PR #1623 attempted a 2,717-line implementation of one possible architecture (daemon-brokered
/dev/shm/dora_pool_*ring buffers with custom protocol). It was closed because the implementation was 7× larger than the feature needed and the design framing was a fork-for-paper not a focused proposal. The underlying gap it tried to address is real; this issue keeps that gap visible for a future, leaner attempt.What a reviewable proposal would address
Before writing code, an issue/RFC should answer:
Why a new transport, not extending an existing one?
PinnedHostShmProviderBackendplug in cleanly?What's the cross-platform story?
cudaHostRegisterworks on Linux, Windows, macOS (with CUDA drivers). Acceptable./dev/shm/-style POSIX shm is Linux/macOS only; Windows has named shared memory (CreateFileMapping). Cross-platform shm is feasible but adds code.Lifecycle: who owns pinned memory, and when does it get freed?
API shape: how does a node opt in?
node.send_output_pinned(...)vs. flag on existingsend_output(..., pinned=True)?Measurement and validation:
Reference material
binaries/daemon/src/memory_manager.rs(215 LOC), the newDaemonRequest::{Register,Read,Free}PinnedMemoryvariants inlibraries/message/src/node_to_daemon.rs, and the Python bindings inapis/python/node/src/lib.rs. Worth reading for the side-channel architecture (NOT the design doc framing).Triage
Labels:
cuda,enhancement,design-neededOwner: unassigned. Proposals welcome from any contributor.
Priority: not currently scheduled. Filed for visibility.