[dnm] net: Improve connectivity to validators by reconnecting on requests by lexnv · Pull Request #9160 · paritytech/polkadot-sdk

lexnv · 2025-07-10T11:26:25Z

This PR changes the behavior of returning immediate errors when attempting to submit requests.

In the past, substrate operated under the assumption that validator connections are stable. However, we have seen multiple cases in practice where validators will disconnect and reconnect. To minimize the downtime of the connection, the networking backend will attempt to reconnect to validators if they are disconnected.

This PR has subtle implications for the request-response protocols, which may lead to increased timeouts during the dialing process.

Needed for testing:

Poor connectivity to a validators group #8885

requests Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

paritytech-workflow-stopper · 2025-07-10T12:02:38Z

All GitHub workflows were cancelled due to failure one of the required jobs.
Failed workflow url: https://github.com/paritytech/polkadot-sdk/actions/runs/16193835107
Failed job name: test-linux-stable-no-try-runtime

paritytech-workflow-stopper · 2025-07-10T12:02:38Z

All GitHub workflows were cancelled due to failure one of the required jobs.
Failed workflow url: https://github.com/paritytech/polkadot-sdk/actions/runs/16193835107
Failed job name: test-linux-stable

skunert · 2025-07-21T06:48:29Z

I remember that back in the day we already had something like this. @ordian changed it here: paritytech/polkadot#4253

So the question to me is: What changed that we want this now that was not true back then? Or should we never have changed it? I know the linked PR is ancient, just trying to reason why we did not want it in the first place.

lexnv · 2025-07-21T10:37:04Z

I'm not sure if we had reliable insights into the connection stability before. The previous metric was taking into account a snapshot of connectivity at the time of metric collection #8885 (comment), which led Andrei to adding a new histogram in:

Add polkadot_parachain_peer_connectivity metric #8973

That being said, I'm still not convinced this PoC is the proper fix. The PR improves connectivity here (especially if connections drop for network reasons), but at the same time, it adds unnecessary delays if dialing takes a long time (not entirely sure what assumptions higher req-resp protocols make about this).

Hopefully #9178 will improve things enough here, but there's still a chance that connections will drop.

From @ordian's PR:

As discussed, we should attempt to preconnect to every validator and trying to connect if disconnected in req/resp will only waste time.

Mostly thinking out loud: Maybe this assumes that a validator has disconnected (after the preconnect) because of a protocol violation which led to our node being banned? And therefore, dialing only wastes resources after that point?

However, if we dropped the connection because of an unexpected EOF, we should be able to re-establish the connection and allow the req-resp to go through. Maybe @ordian has a better context here (IIRC, notification protocol will backoff / chill a disconnected peer that attempts to reconnect so maybe that's why we don't try to establish a new connection even tho the req-resp would function correctly?)

ordian · 2025-07-22T14:39:20Z

However, if we dropped the connection because of an unexpected EOF, we should be able to re-establish the connection and allow the req-resp to go through. Maybe @ordian has a better context here (IIRC, notification protocol will backoff / chill a disconnected peer that attempts to reconnect so maybe that's why we don't try to establish a new connection even tho the req-resp would function correctly?)

The way it works with validator-to-validator connections is that we should be maintaining the connectivity via reserved peerset. If there are temporary disconnects, eventually (we assume) peers should establish and maintain the connection. Given that assumption, there is a little point in trying to reconnect for request response protocols if we for some reason aren't.

That being said, there's some discrepancy with the way notification and request-response protocols work when it comes to reserved peerset and banning. AFAIK request-response protocol respect banning which notif protocol don't. Maybe we want to make some adjustments there, but this is a different discussion to be had.

net: Improve connectivity to validators by establishing connections on

8fdaa29

requests Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

lexnv changed the title ~~[dnm] net: Improve connectivity to validators by establishing connections on~~ [dnm] net: Improve connectivity to validators by reconnecting on requests Jul 10, 2025

lexnv mentioned this pull request Jul 10, 2025

Poor connectivity to a validators group #8885

Open

lexnv mentioned this pull request Jul 18, 2025

AsyncBacking: Parachain blocks produced after 1 minute #9227

Open

lexnv mentioned this pull request Jul 31, 2025

lexnv/evidence: Retain at rank I polkadot-fellows/Evaluations#186

Merged

3 tasks

lexnv closed this Oct 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dnm] net: Improve connectivity to validators by reconnecting on requests#9160

[dnm] net: Improve connectivity to validators by reconnecting on requests#9160
lexnv wants to merge 1 commit intomasterfrom
lexnv/connect-if-disconnect

lexnv commented Jul 10, 2025

Uh oh!

paritytech-workflow-stopper bot commented Jul 10, 2025

Uh oh!

paritytech-workflow-stopper bot commented Jul 10, 2025

Uh oh!

skunert commented Jul 21, 2025

Uh oh!

lexnv commented Jul 21, 2025

Uh oh!

ordian commented Jul 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lexnv commented Jul 10, 2025

Uh oh!

paritytech-workflow-stopper bot commented Jul 10, 2025

Uh oh!

paritytech-workflow-stopper bot commented Jul 10, 2025

Uh oh!

skunert commented Jul 21, 2025

Uh oh!

lexnv commented Jul 21, 2025

Uh oh!

ordian commented Jul 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants