[dnm] net: Improve connectivity to validators by reconnecting on requests#9160
[dnm] net: Improve connectivity to validators by reconnecting on requests#9160
Conversation
requests Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
|
All GitHub workflows were cancelled due to failure one of the required jobs. |
|
All GitHub workflows were cancelled due to failure one of the required jobs. |
|
I remember that back in the day we already had something like this. @ordian changed it here: paritytech/polkadot#4253 So the question to me is: What changed that we want this now that was not true back then? Or should we never have changed it? I know the linked PR is ancient, just trying to reason why we did not want it in the first place. |
|
I'm not sure if we had reliable insights into the connection stability before. The previous metric was taking into account a snapshot of connectivity at the time of metric collection #8885 (comment), which led Andrei to adding a new histogram in: That being said, I'm still not convinced this PoC is the proper fix. The PR improves connectivity here (especially if connections drop for network reasons), but at the same time, it adds unnecessary delays if dialing takes a long time (not entirely sure what assumptions higher req-resp protocols make about this). Hopefully #9178 will improve things enough here, but there's still a chance that connections will drop. From @ordian's PR:
Mostly thinking out loud: Maybe this assumes that a validator has disconnected (after the preconnect) because of a protocol violation which led to our node being banned? And therefore, dialing only wastes resources after that point? However, if we dropped the connection because of an unexpected EOF, we should be able to re-establish the connection and allow the req-resp to go through. Maybe @ordian has a better context here (IIRC, notification protocol will backoff / chill a disconnected peer that attempts to reconnect so maybe that's why we don't try to establish a new connection even tho the req-resp would function correctly?) |
The way it works with validator-to-validator connections is that we should be maintaining the connectivity via reserved peerset. If there are temporary disconnects, eventually (we assume) peers should establish and maintain the connection. Given that assumption, there is a little point in trying to reconnect for request response protocols if we for some reason aren't. That being said, there's some discrepancy with the way notification and request-response protocols work when it comes to reserved peerset and banning. AFAIK request-response protocol respect banning which notif protocol don't. Maybe we want to make some adjustments there, but this is a different discussion to be had. |
This PR changes the behavior of returning immediate errors when attempting to submit requests.
In the past, substrate operated under the assumption that validator connections are stable. However, we have seen multiple cases in practice where validators will disconnect and reconnect. To minimize the downtime of the connection, the networking backend will attempt to reconnect to validators if they are disconnected.
This PR has subtle implications for the request-response protocols, which may lead to increased timeouts during the dialing process.
Needed for testing: