Skip to content

fix: recover outbound opens from stale primary handles#562

Open
peter941221 wants to merge 1 commit intoparitytech:masterfrom
peter941221:codex/issue-11540-transport-fallback
Open

fix: recover outbound opens from stale primary handles#562
peter941221 wants to merge 1 commit intoparitytech:masterfrom
peter941221:codex/issue-11540-transport-fallback

Conversation

@peter941221
Copy link
Copy Markdown

PR Body - litep2p transport fix + force-close attribution for polkadot-sdk#11540

Summary

This PR does two narrow things that came out of the paritytech/polkadot-sdk#11540 investigation:

  1. Fix outbound substream opening when the cached primary handle is stale but a live secondary handle still exists.
  2. Add protocol-aware logging around notification-channel clogging / ForceClose so the next field run can answer which protocol is actually triggering the teardown path.

Transport fix

In TransportService, outbound open previously always preferred the cached primary connection.

That can leave the transport stuck in a bad state when:

  • the primary handle is already stale
  • a live secondary connection still exists
  • or try_get_permit() succeeds on primary but the send path immediately fails with ConnectionClosed

This PR changes that path so that:

  • a live secondary is promoted before opening the outbound substream if the primary handle is stale
  • if both cached handles are stale, open_substream() returns ConnectionClosed but keeps the peer context until the real ConnectionClosed propagation arrives
  • if the primary send path fails with ConnectionClosed, the transport retries once through fresh connection selection so a live secondary can recover the open

Force-close attribution logs

The latest #11540 thread also surfaced a different live-runtime question: when notification backpressure triggers ForceClose, which protocol is actually clogging the queue?

To make that observable, this PR carries protocol_name and sync_channel_size into NotificationHandle and logs:

  • when a clogged sync queue first queues ForceClose
  • when queuing ForceClose fails
  • when the handle remains clogged but a close is already queued
  • when the clogged state is cleared
  • when NotificationProtocol processes the resulting ForceClose command

This is instrumentation only; it does not change the current close semantics.

Tests

Local litep2p:

cargo test protocol::transport_service::tests:: -- --nocapture
cargo test protocol::notification::tests:: -- --nocapture

Path-patched polkadot-sdk validation in WSL:

SKIP_WASM_BUILD=1 cargo test --locked -p sc-network notification_protocol_ -- --nocapture
SKIP_WASM_BUILD=1 cargo test --locked -p sc-network reserved_peer_close_triggers_immediate_reopen -- --nocapture
SKIP_WASM_BUILD=1 cargo test --locked -p sc-network closing_primary_with_only_open_desired_secondary_closes_instead_of_replacing -- --nocapture
SKIP_WASM_BUILD=1 cargo test --locked -p sc-network notification_sink_replacement_stays_internal_and_keeps_peer_live -- --nocapture
SKIP_WASM_BUILD=1 cargo test --locked -p sc-network inflight_secondary_validation_closes_peer_until_replacement_really_opens -- --nocapture

Scope note

I am not claiming this is the single complete explanation for every disconnect reported in paritytech/polkadot-sdk#11540.

My current reading is narrower:

  • this PR fixes one transport-state correctness hole around stale primary / live secondary handling
  • and it makes the notification-triggered ForceClose path much easier to attribute in the next field run

That felt like the safest honest slice to upstream first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant