fix: recover outbound opens from stale primary handles by peter941221 · Pull Request #562 · paritytech/litep2p

peter941221 · 2026-04-03T16:17:53Z

PR Body - litep2p transport fix + force-close attribution for `polkadot-sdk#11540`

Summary

This PR does two narrow things that came out of the paritytech/polkadot-sdk#11540 investigation:

Fix outbound substream opening when the cached primary handle is stale but a live secondary handle still exists.
Add protocol-aware logging around notification-channel clogging / ForceClose so the next field run can answer which protocol is actually triggering the teardown path.

Transport fix

In TransportService, outbound open previously always preferred the cached primary connection.

That can leave the transport stuck in a bad state when:

the primary handle is already stale
a live secondary connection still exists
or try_get_permit() succeeds on primary but the send path immediately fails with ConnectionClosed

This PR changes that path so that:

a live secondary is promoted before opening the outbound substream if the primary handle is stale
if both cached handles are stale, open_substream() returns ConnectionClosed but keeps the peer context until the real ConnectionClosed propagation arrives
if the primary send path fails with ConnectionClosed, the transport retries once through fresh connection selection so a live secondary can recover the open

Force-close attribution logs

The latest #11540 thread also surfaced a different live-runtime question: when notification backpressure triggers ForceClose, which protocol is actually clogging the queue?

To make that observable, this PR carries protocol_name and sync_channel_size into NotificationHandle and logs:

when a clogged sync queue first queues ForceClose
when queuing ForceClose fails
when the handle remains clogged but a close is already queued
when the clogged state is cleared
when NotificationProtocol processes the resulting ForceClose command

This is instrumentation only; it does not change the current close semantics.

Tests

Local litep2p:

cargo test protocol::transport_service::tests:: -- --nocapture
cargo test protocol::notification::tests:: -- --nocapture

Path-patched polkadot-sdk validation in WSL:

SKIP_WASM_BUILD=1 cargo test --locked -p sc-network notification_protocol_ -- --nocapture
SKIP_WASM_BUILD=1 cargo test --locked -p sc-network reserved_peer_close_triggers_immediate_reopen -- --nocapture
SKIP_WASM_BUILD=1 cargo test --locked -p sc-network closing_primary_with_only_open_desired_secondary_closes_instead_of_replacing -- --nocapture
SKIP_WASM_BUILD=1 cargo test --locked -p sc-network notification_sink_replacement_stays_internal_and_keeps_peer_live -- --nocapture
SKIP_WASM_BUILD=1 cargo test --locked -p sc-network inflight_secondary_validation_closes_peer_until_replacement_really_opens -- --nocapture

Scope note

I am not claiming this is the single complete explanation for every disconnect reported in paritytech/polkadot-sdk#11540.

My current reading is narrower:

this PR fixes one transport-state correctness hole around stale primary / live secondary handling
and it makes the notification-triggered ForceClose path much easier to attribute in the next field run

That felt like the safest honest slice to upstream first.

fix: recover outbound opens via live secondary handles

b13cb2b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: recover outbound opens from stale primary handles#562

fix: recover outbound opens from stale primary handles#562
peter941221 wants to merge 1 commit intoparitytech:masterfrom
peter941221:codex/issue-11540-transport-fallback

peter941221 commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

peter941221 commented Apr 3, 2026

PR Body - litep2p transport fix + force-close attribution for polkadot-sdk#11540

Summary

Transport fix

Force-close attribution logs

Tests

Scope note

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

PR Body - litep2p transport fix + force-close attribution for `polkadot-sdk#11540`