Send PeerViewChange with high priority#4755
Conversation
|
This definitely needs to be tested under high load.
To get heavy load you can spawn a good amount of parachains and use malus nodes to create a dispute storm. |
This PR should help in the situations where we have a high backlog of un-processed messages in approval-distribution and it helps with the fact that we will gossip messages to some peers as soon as we received the PeerViewChange instead of when we caught up with the processing. This has the benefits that peers will faster reached the conclusion that enough assignments have been triggered they won't trigger their own assignment. We can probably simulate with the approval-voting subsystem-benchmark such a scenario and measure how fast our node gossip the assignments it received to it is peers. |
The main benefit of this change is that we keep gossip protocols in better sync with what other peers work on. I expect this to lead to improved gossip propagation times -> lower approval checking lag, less needless work (like @alexggh said above, we won't trigger assignments while a sufficient other assignments sit in our queue).
I think this is mandatory. AFAIR we had some trouble in the past when switching to async channel, maybe it's a good time to try it and maybe get some speedup there as well.
Yeah, like good old times on Versi :) |
|
|
||
| /// Message counter over subsystems. | ||
| #[derive(Default, Clone)] | ||
| pub struct MessageCounter(Arc<Mutex<MessageCounterInner>>); |
There was a problem hiding this comment.
Since all of the inner stuff is counters, we can drop the mutex and use atomics.
| macro_rules! send_message { | ||
| ($event:expr, $message:ident) => { | ||
| if let Ok(event) = $event.focus() { | ||
| let has_high_priority = matches!(event, NetworkBridgeEvent::PeerViewChange(..)); |
There was a problem hiding this comment.
Thought about this a bit more, on side effect of this is that we change the order of messages, for example if the system is loaded I can imagine a situation where the bridge sends the PeerConnected on the normal channel and then the PeerViewChange on the priority channel, so the PeerViewChange will arrive before the PeerConnected and it will be discarded.
I don't think it is a frequent problem, but nonetheless something this might cause, so I wonder if we need to make PeerConnected high priority as well.
There was a problem hiding this comment.
Was it possible to receive PeerViewChange before PeerConnected because of some network problems?
f7508b9 to
f1ebc26
Compare
f1ebc26 to
33b293f
Compare
|
Tested on Versi, results posted to the first message |
|
The CI pipeline was cancelled due to failure one of the required jobs. |
alexggh
left a comment
There was a problem hiding this comment.
Overall looks good to me, left you some comments once those get addressed we should be good to go.
sandreim
left a comment
There was a problem hiding this comment.
LGTM! Nice work @AndreiEres , left just a few more nits.
prdoc/pr_4755.prdoc
Outdated
| title: Send PeerViewChange with high priority | ||
|
|
||
| doc: | ||
| - audience: Node Operator |
There was a problem hiding this comment.
I think the Node developers should be the main target audience. The node operators would like want to hear if and how would this reduces the CPU utilization of the node.
prdoc/pr_4755.prdoc
Outdated
| doc: | ||
| - audience: Node Operator | ||
| description: | | ||
| - orchestra updated to 0.4.0 |
There was a problem hiding this comment.
This can be more specific and mention why we upgrading it, in this case it should be the support for susbsytem message delivery prioritization.
* master: add elastic scaling MVP guide (#4663) Send PeerViewChange with high priority (#4755) [ci] Update forklift in CI image (#5032) Adjust base value for statement-distribution regression tests (#5028) [pallet_contracts] Add support for transient storage in contracts host functions (#4566) [1 / 5] Optimize logic for gossiping assignments (#4848) Remove `pallet-getter` usage from pallet-session (#4972) command-action: added scoped permissions to the github tokens (#5016) net/litep2p: Propagate ValuePut events to the network backend (#5018) rpc: add back rpc logger (#4952) Updated substrate-relay version for tests (#5017) Remove most all usage of `sp-std` (#5010) Use sp_runtime::traits::BadOrigin (#5011)
Closes paritytech#577 ### Changed - `orchestra` updated to 0.4.0 - `PeerViewChange` sent with high priority and should be processed first in a queue. - To count them in tests added tracker to TestSender and TestOverseer. It acts more like a smoke test though. ### Testing on Versi The changes were tested on Versi with two objectives: 1. Make sure the node functionality does not change. 2. See how the changes affect performance. Test setup: - 2.5 hours for each case - 100 validators - 50 parachains - validatorsPerCore = 2 - neededApprovals = 100 - nDelayTranches = 89 - relayVrfModuloSamples = 50 During the test period, all nodes ran without any crashes, which satisfies the first objective. To estimate the change in performance we used ToF charts. The graphs show that there are no spikes in the top as before. This proves that our hypothesis is correct. ### Normalized charts with ToF  [Before](https://grafana.teleport.parity.io/goto/ZoR53ClSg?orgId=1)  [After](https://grafana.teleport.parity.io/goto/6ux5qC_IR?orgId=1) ### Conclusion The prioritization of subsystem messages reduces the ToF of the networking subsystem, which helps faster propagation of gossip messages.
* master: (125 commits) add elastic scaling MVP guide (#4663) Send PeerViewChange with high priority (#4755) [ci] Update forklift in CI image (#5032) Adjust base value for statement-distribution regression tests (#5028) [pallet_contracts] Add support for transient storage in contracts host functions (#4566) [1 / 5] Optimize logic for gossiping assignments (#4848) Remove `pallet-getter` usage from pallet-session (#4972) command-action: added scoped permissions to the github tokens (#5016) net/litep2p: Propagate ValuePut events to the network backend (#5018) rpc: add back rpc logger (#4952) Updated substrate-relay version for tests (#5017) Remove most all usage of `sp-std` (#5010) Use sp_runtime::traits::BadOrigin (#5011) network/tx: Ban peers with tx that fail to decode (#5002) Try State Hook for Bounties (#4563) [statement-distribution] Add metrics for distributed statements in V2 (#4554) added sync command (#4818) Bridges V2 refactoring backport and `pallet_bridge_messages` simplifications (#4935) xcm-executor: Improve logging (#4996) Remove usage of `sp-std` on templates (#5001) ...
Closes paritytech#577 ### Changed - `orchestra` updated to 0.4.0 - `PeerViewChange` sent with high priority and should be processed first in a queue. - To count them in tests added tracker to TestSender and TestOverseer. It acts more like a smoke test though. ### Testing on Versi The changes were tested on Versi with two objectives: 1. Make sure the node functionality does not change. 2. See how the changes affect performance. Test setup: - 2.5 hours for each case - 100 validators - 50 parachains - validatorsPerCore = 2 - neededApprovals = 100 - nDelayTranches = 89 - relayVrfModuloSamples = 50 During the test period, all nodes ran without any crashes, which satisfies the first objective. To estimate the change in performance we used ToF charts. The graphs show that there are no spikes in the top as before. This proves that our hypothesis is correct. ### Normalized charts with ToF  [Before](https://grafana.teleport.parity.io/goto/ZoR53ClSg?orgId=1)  [After](https://grafana.teleport.parity.io/goto/6ux5qC_IR?orgId=1) ### Conclusion The prioritization of subsystem messages reduces the ToF of the networking subsystem, which helps faster propagation of gossip messages.
* master: (130 commits) add elastic scaling MVP guide (#4663) Send PeerViewChange with high priority (#4755) [ci] Update forklift in CI image (#5032) Adjust base value for statement-distribution regression tests (#5028) [pallet_contracts] Add support for transient storage in contracts host functions (#4566) [1 / 5] Optimize logic for gossiping assignments (#4848) Remove `pallet-getter` usage from pallet-session (#4972) command-action: added scoped permissions to the github tokens (#5016) net/litep2p: Propagate ValuePut events to the network backend (#5018) rpc: add back rpc logger (#4952) Updated substrate-relay version for tests (#5017) Remove most all usage of `sp-std` (#5010) Use sp_runtime::traits::BadOrigin (#5011) network/tx: Ban peers with tx that fail to decode (#5002) Try State Hook for Bounties (#4563) [statement-distribution] Add metrics for distributed statements in V2 (#4554) added sync command (#4818) Bridges V2 refactoring backport and `pallet_bridge_messages` simplifications (#4935) xcm-executor: Improve logging (#4996) Remove usage of `sp-std` on templates (#5001) ...
Closes paritytech#577 ### Changed - `orchestra` updated to 0.4.0 - `PeerViewChange` sent with high priority and should be processed first in a queue. - To count them in tests added tracker to TestSender and TestOverseer. It acts more like a smoke test though. ### Testing on Versi The changes were tested on Versi with two objectives: 1. Make sure the node functionality does not change. 2. See how the changes affect performance. Test setup: - 2.5 hours for each case - 100 validators - 50 parachains - validatorsPerCore = 2 - neededApprovals = 100 - nDelayTranches = 89 - relayVrfModuloSamples = 50 During the test period, all nodes ran without any crashes, which satisfies the first objective. To estimate the change in performance we used ToF charts. The graphs show that there are no spikes in the top as before. This proves that our hypothesis is correct. ### Normalized charts with ToF  [Before](https://grafana.teleport.parity.io/goto/ZoR53ClSg?orgId=1)  [After](https://grafana.teleport.parity.io/goto/6ux5qC_IR?orgId=1) ### Conclusion The prioritization of subsystem messages reduces the ToF of the networking subsystem, which helps faster propagation of gossip messages.
Closes #577
Changed
orchestraupdated to 0.4.0PeerViewChangesent with high priority and should be processed first in a queue.Testing on Versi
The changes were tested on Versi with two objectives:
Test setup:
During the test period, all nodes ran without any crashes, which satisfies the first objective.
To estimate the change in performance we used ToF charts. The graphs show that there are no spikes in the top as before. This proves that our hypothesis is correct.
Normalized charts with ToF
Before
After
Conclusion
The prioritization of subsystem messages reduces the ToF of the networking subsystem, which helps faster propagation of gossip messages.