fix: recover outbound opens from stale primary handles#562
Open
peter941221 wants to merge 1 commit intoparitytech:masterfrom
Open
fix: recover outbound opens from stale primary handles#562peter941221 wants to merge 1 commit intoparitytech:masterfrom
peter941221 wants to merge 1 commit intoparitytech:masterfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR Body - litep2p transport fix + force-close attribution for
polkadot-sdk#11540Summary
This PR does two narrow things that came out of the
paritytech/polkadot-sdk#11540investigation:ForceCloseso the next field run can answer which protocol is actually triggering the teardown path.Transport fix
In
TransportService, outbound open previously always preferred the cached primary connection.That can leave the transport stuck in a bad state when:
try_get_permit()succeeds on primary but the send path immediately fails withConnectionClosedThis PR changes that path so that:
open_substream()returnsConnectionClosedbut keeps the peer context until the realConnectionClosedpropagation arrivesConnectionClosed, the transport retries once through fresh connection selection so a live secondary can recover the openForce-close attribution logs
The latest
#11540thread also surfaced a different live-runtime question: when notification backpressure triggersForceClose, which protocol is actually clogging the queue?To make that observable, this PR carries
protocol_nameandsync_channel_sizeintoNotificationHandleand logs:ForceCloseForceClosefailsNotificationProtocolprocesses the resultingForceClosecommandThis is instrumentation only; it does not change the current close semantics.
Tests
Local
litep2p:Path-patched
polkadot-sdkvalidation in WSL:Scope note
I am not claiming this is the single complete explanation for every disconnect reported in
paritytech/polkadot-sdk#11540.My current reading is narrower:
ForceClosepath much easier to attribute in the next field runThat felt like the safest honest slice to upstream first.