Skip to content

[dnm] net: Improve connectivity to validators by reconnecting on requests#9160

Closed
lexnv wants to merge 1 commit intomasterfrom
lexnv/connect-if-disconnect
Closed

[dnm] net: Improve connectivity to validators by reconnecting on requests#9160
lexnv wants to merge 1 commit intomasterfrom
lexnv/connect-if-disconnect

Conversation

@lexnv
Copy link
Copy Markdown
Contributor

@lexnv lexnv commented Jul 10, 2025

This PR changes the behavior of returning immediate errors when attempting to submit requests.

In the past, substrate operated under the assumption that validator connections are stable. However, we have seen multiple cases in practice where validators will disconnect and reconnect. To minimize the downtime of the connection, the networking backend will attempt to reconnect to validators if they are disconnected.

This PR has subtle implications for the request-response protocols, which may lead to increased timeouts during the dialing process.

Needed for testing:

requests

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
@lexnv lexnv changed the title [dnm] net: Improve connectivity to validators by establishing connections on [dnm] net: Improve connectivity to validators by reconnecting on requests Jul 10, 2025
@paritytech-workflow-stopper
Copy link
Copy Markdown

All GitHub workflows were cancelled due to failure one of the required jobs.
Failed workflow url: https://github.com/paritytech/polkadot-sdk/actions/runs/16193835107
Failed job name: test-linux-stable-no-try-runtime

@paritytech-workflow-stopper
Copy link
Copy Markdown

All GitHub workflows were cancelled due to failure one of the required jobs.
Failed workflow url: https://github.com/paritytech/polkadot-sdk/actions/runs/16193835107
Failed job name: test-linux-stable

@skunert
Copy link
Copy Markdown
Contributor

skunert commented Jul 21, 2025

I remember that back in the day we already had something like this. @ordian changed it here: paritytech/polkadot#4253

So the question to me is: What changed that we want this now that was not true back then? Or should we never have changed it? I know the linked PR is ancient, just trying to reason why we did not want it in the first place.

@lexnv
Copy link
Copy Markdown
Contributor Author

lexnv commented Jul 21, 2025

I'm not sure if we had reliable insights into the connection stability before. The previous metric was taking into account a snapshot of connectivity at the time of metric collection #8885 (comment), which led Andrei to adding a new histogram in:

That being said, I'm still not convinced this PoC is the proper fix. The PR improves connectivity here (especially if connections drop for network reasons), but at the same time, it adds unnecessary delays if dialing takes a long time (not entirely sure what assumptions higher req-resp protocols make about this).

Hopefully #9178 will improve things enough here, but there's still a chance that connections will drop.


From @ordian's PR:

As discussed, we should attempt to preconnect to every validator and trying to connect if disconnected in req/resp will only waste time.

Mostly thinking out loud: Maybe this assumes that a validator has disconnected (after the preconnect) because of a protocol violation which led to our node being banned? And therefore, dialing only wastes resources after that point?

However, if we dropped the connection because of an unexpected EOF, we should be able to re-establish the connection and allow the req-resp to go through. Maybe @ordian has a better context here (IIRC, notification protocol will backoff / chill a disconnected peer that attempts to reconnect so maybe that's why we don't try to establish a new connection even tho the req-resp would function correctly?)

@ordian
Copy link
Copy Markdown
Contributor

ordian commented Jul 22, 2025

However, if we dropped the connection because of an unexpected EOF, we should be able to re-establish the connection and allow the req-resp to go through. Maybe @ordian has a better context here (IIRC, notification protocol will backoff / chill a disconnected peer that attempts to reconnect so maybe that's why we don't try to establish a new connection even tho the req-resp would function correctly?)

The way it works with validator-to-validator connections is that we should be maintaining the connectivity via reserved peerset. If there are temporary disconnects, eventually (we assume) peers should establish and maintain the connection. Given that assumption, there is a little point in trying to reconnect for request response protocols if we for some reason aren't.

That being said, there's some discrepancy with the way notification and request-response protocols work when it comes to reserved peerset and banning. AFAIK request-response protocol respect banning which notif protocol don't. Maybe we want to make some adjustments there, but this is a different discussion to be had.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants