Skip to content

Harden zenoh subscriber threads against panics and shutdown hangs#1746

Merged
mergify[bot] merged 1 commit into
mainfrom
event-stream-hardening
Apr 23, 2026
Merged

Harden zenoh subscriber threads against panics and shutdown hangs#1746
mergify[bot] merged 1 commit into
mainfrom
event-stream-hardening

Conversation

@phil-opp
Copy link
Copy Markdown
Collaborator

Three related fixes salvaged from the closed #1378 that were not covered by #1745:

  1. Wrap each zenoh subscriber thread in catch_unwind so a panic surfaces as EventItem::FatalError instead of silently killing sample delivery.
  2. Replace blocking subscriber.recv() with futures::future::select on recv_async() and a shutdown channel; dropping EventStream now wakes every subscriber immediately.
  3. Add a stop_received flag so recv_async / poll_next return None after Event::Stop — subscriber threads hold event-channel sender clones that would otherwise keep the receiver open.

Includes regression tests for the Stop → None invariant.

…utdown races

Three related fixes salvaged from the closed #1378:

1. Panic handling. The subscriber threads are spawned from
   `std::thread::Builder`, so an uncaught panic would silently kill
   sample delivery for that input. Wrap the body in
   `catch_unwind(AssertUnwindSafe(...))` and surface a panic as an
   `EventItem::FatalError` so the node observes the failure.

2. Clean shutdown. The threads previously blocked on
   `subscriber.recv()` and only exited once the zenoh session itself
   tore down — dropping the `EventStream` was not enough to unblock
   them, causing test hangs. Replace the blocking `recv()` with
   `futures::future::select` on `subscriber.recv_async()` and a
   dedicated `flume` shutdown channel; dropping the new
   `_zenoh_shutdown_tx` on the `EventStream` disconnects the channel
   and wakes every subscriber immediately.

3. `stop_received` flag. Subscriber threads hold clones of the event
   channel sender, so after `AllInputsClosed`/`Stop` the daemon thread
   dropping its own sender is no longer sufficient to close the
   `tokio::sync::mpsc::Receiver` — `recv_async`/`poll_next` would hang
   waiting on those live subscriber senders. Track delivery of
   `Event::Stop` and return `None` on subsequent calls so the node
   exits and `EventStream::drop` triggers the shutdown path.

Add regression tests covering the `Stop` → `None` invariant via both
`recv()` and `Stream::next()`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@phil-opp
Copy link
Copy Markdown
Collaborator Author

@Mergifyio queue

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 23, 2026

Merge Queue Status

This pull request spent 34 minutes 13 seconds in the queue, including 31 minutes 57 seconds running CI.

Required conditions to merge
  • check-success=Benchmark regression check
  • check-success=E2E Tests
  • check-success=Semantic contract tests
  • check-success=Test (ubuntu-latest)
  • check-success=pip-release all green
  • any of [🛡 GitHub branch protection]:
    • check-neutral = Mergify Merge Protections
    • check-skipped = Mergify Merge Protections
    • check-success = Mergify Merge Protections

@mergify mergify Bot added the queued label Apr 23, 2026
mergify Bot added a commit that referenced this pull request Apr 23, 2026
@mergify mergify Bot merged commit 06cc1e9 into main Apr 23, 2026
40 of 46 checks passed
@mergify mergify Bot deleted the event-stream-hardening branch April 23, 2026 13:59
@mergify mergify Bot removed the queued label Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant