Skip to content

Add per-node shared_memory_pool_size config and zero_copy_threshold getter#1741

Merged
mergify[bot] merged 1 commit into
mainfrom
port-shm-pool-size-config
Apr 23, 2026
Merged

Add per-node shared_memory_pool_size config and zero_copy_threshold getter#1741
mergify[bot] merged 1 commit into
mainfrom
port-shm-pool-size-config

Conversation

@phil-opp
Copy link
Copy Markdown
Collaborator

Salvaged from the closed #1378 (zenoh-SHM refactor), which was written against pre-rewrite main and became unmergeable after the Q1 2026 consolidation. These are the portable pieces:

  • New optional shared_memory_pool_size field on dataflow nodes. Accepts an integer (raw bytes) or a string with unit suffix (KB, MB, GB). Priority: YAML > DORA_NODE_SHM_POOL_SIZE env var > built-in default.
  • ByteSize newtype in dora-message with string/integer serde, Display, FromStr, JsonSchema.
  • DoraNode::zero_copy_threshold() getter.

…ld getter

- New optional `shared_memory_pool_size` field on dataflow nodes. Accepts an
  integer (raw bytes) or a string with unit suffix (`KB`, `MB`, `GB`,
  case-insensitive). Priority: YAML config > `DORA_NODE_SHM_POOL_SIZE` env
  var > built-in default.
- Add `ByteSize` newtype in `dora-message` with string/integer serde support,
  Display, FromStr, and JsonSchema.
- Add `DoraNode::zero_copy_threshold()` getter to surface the already-existing
  `DORA_ZERO_COPY_THRESHOLD`-backed value.

Salvaged from the closed #1378 (zenoh-SHM refactor), which was written
against pre-rewrite main and became unmergeable after the Q1 2026
consolidation. These are the portable pieces that don't conflict with
the new architecture.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@phil-opp
Copy link
Copy Markdown
Collaborator Author

@Mergifyio queue

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 23, 2026

Merge Queue Status

  • 🟠 Waiting for queue conditions
  • ⏳ Enter queue
  • ⏳ Run checks
  • ⏳ Merge
Required conditions to enter a queue
  • -closed [📌 queue requirement]
  • -conflict [📌 queue requirement]
  • -draft [📌 queue requirement]
  • any of [📌 queue -> configuration change requirements]:
    • -mergify-configuration-changed
    • check-success = Configuration changed
  • any of [📌 queue requirement]:
    • check-neutral = Mergify Merge Protections
    • check-skipped = Mergify Merge Protections
    • check-success = Mergify Merge Protections
  • any of [🔀 queue conditions]:
    • all of [📌 queue conditions of queue rule default]:
      • base=main
      • check-success=Audit (cargo-audit + cargo-deny)
      • check-success=Check
      • check-success=Clippy
      • check-success=Format
      • check-success=License check
      • check-success=Typos
      • check-success=Unwrap budget
      • any of [🛡 GitHub branch protection]:
        • check-neutral = Mergify Merge Protections
        • check-skipped = Mergify Merge Protections
        • check-success = Mergify Merge Protections

@mergify mergify Bot added the queued label Apr 23, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 23, 2026

Merge Queue Status

This pull request spent 56 minutes 25 seconds in the queue, including 43 minutes 56 seconds running CI.

Required conditions to merge
  • check-success=Benchmark regression check
  • check-success=E2E Tests
  • check-success=Semantic contract tests
  • check-success=Test (ubuntu-latest)
  • check-success=pip-release all green
  • any of [🛡 GitHub branch protection]:
    • check-neutral = Mergify Merge Protections
    • check-skipped = Mergify Merge Protections
    • check-success = Mergify Merge Protections

mergify Bot added a commit that referenced this pull request Apr 23, 2026
@mergify mergify Bot merged commit 5bfc9d4 into main Apr 23, 2026
40 of 46 checks passed
@mergify mergify Bot deleted the port-shm-pool-size-config branch April 23, 2026 09:54
@mergify mergify Bot removed the queued label Apr 23, 2026
trunk-io Bot pushed a commit that referenced this pull request Apr 28, 2026
* refactor: remove DropToken system now that zenoh SHM is the data plane

Continuation of #1741, salvaging more from the closed #1378. Zenoh's SHM
provider now handles buffer lifecycle via its own reference counting, so
the custom shmem + DropToken tracking path is pure legacy.

- Node API: zenoh session + SHM provider are now mandatory in standard
  mode (no more DropStream fallback, no custom shmem allocation cache,
  no `DataSampleInner::Shmem`). `allocate_data_sample` always returns a
  heap buffer; zenoh publishes large payloads zero-copy.
- Message types: removed `DataMessage::SharedMemory`, `DropToken`,
  `NodeDropEvent`, `DaemonReply::NextDropEvents`, the
  `SubscribeDrop` / `ReportDropTokens` / `NextFinishedDropTokens`
  requests, and `drop_tokens` on `NextEvent`. `DataMessage` is now
  just `Vec`.
- DaemonCommunication::Shmem drops `daemon_drop_region_id`.
- Daemon: removed `drop_channels`, `pending_drop_tokens`,
  `DropTokenInformation`, `check_drop_token`, the drop listener loop,
  and the SHM mmap+copy fan-out path (only `DataMessage::Vec` remains).
- Dropped the `shared_memory_extended` dependency from dora-node-api.

Nodes that were implicitly relying on the non-tokio fallback will now
fail init with a clear error instead of silently degrading.

* cleanup: remove dead SHM-protocol leftovers from message enums

- DaemonReply::PreparedMessage: no construction or match sites remain
  after the SHM data-plane removal.
- node_to_daemon::InputData: single-variant enum with no callers (the
  InputData used in integration tests comes from a different module).
- Collapse the now-single-arm DataMessage match in the daemon into an
  irrefutable let.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Trigger CI

* Opt-in owned tokio runtime for nodes without ambient executor (#1748)

Support opt-in owned tokio runtime for nodes

Embedders that lack an ambient tokio runtime can now set
DORA_CREATE_OWNED_TOKIO_RUNTIME=1 to have the node build its own
multi-threaded runtime for the zenoh SHM data plane. Default remains
strict: init errors when no runtime is available.

Follow-up to #1745.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Create owned tokio runtime by default when none is ambient

Node authors shouldn't have to set up a tokio runtime just to call
`DoraNode::init_from_env()`. Drop the `DORA_CREATE_OWNED_TOKIO_RUNTIME`
opt-in gate introduced in #1748 and always build an owned multi-threaded
runtime when `Handle::try_current()` returns `Err`. Callers with an
ambient runtime keep using it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Carry ParamUpdate value as JSON bytes over the bincode wire

`NodeEvent` is serialized with bincode on the daemon→node TCP channel.
`serde_json::Value::deserialize` calls `Deserializer::deserialize_any`,
which bincode does not support: the first `ParamUpdate` would kill the
node's event stream with "Bincode does not support the
serde::Deserializer::deserialize_any method".

Change `NodeEvent::ParamUpdate.value` to `value_json: Vec<u8>`
(JSON-encoded), and serialize/deserialize at the daemon and node
boundaries. The public `Event::ParamUpdate.value: serde_json::Value`
stays unchanged for callers.

Adds a bincode round-trip regression test over representative JSON
shapes so we don't slip a `deserialize_any` field back into
`NodeEvent`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Fall back to heap publishes when zenoh SHM provider creation fails

CI runners with a small `/dev/shm` (and tests that spawn many nodes
in sequence without segment cleanup between runs) now hit
`ShmProviderBuilder::default_backend` returning `OS error 12`
(ENOMEM), which used to abort node init outright. Treat the SHM
provider as best-effort instead: log a warning and proceed with
`zenoh_shm_provider = None`. The send path already publishes via
heap buffers when the provider is missing, so messages still flow —
just without the SHM zero-copy fast path.

Restores the contract-test suite that started failing once nodes
actually reached the SHM allocation step (previously they failed
earlier on tokio runtime init, masking the issue).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant