Introduce approval-voting/distribution benchmark by alexggh · Pull Request #2621 · paritytech/polkadot-sdk

alexggh · 2023-12-05T09:40:20Z

Summary

Built on top of the tooling and ideas introduced in #2528, this PR introduces a synthetic benchmark for measuring and assessing the performance characteristics of the approval-voting and approval-distribution subsystems.

Currently this allows, us to simulate the behaviours of these systems based on the following dimensions:

TestConfiguration:
# Test 1
- objective: !ApprovalsTest
    last_considered_tranche: 89
    min_coalesce: 1
    max_coalesce: 6
    enable_assignments_v2: true
    send_till_tranche: 60
    stop_when_approved: false
    coalesce_tranche_diff: 12
    workdir_prefix: "/tmp"
    num_no_shows_per_candidate: 0
    approval_distribution_expected_tof: 6.0
    approval_distribution_cpu_ms: 3.0
    approval_voting_cpu_ms: 4.30
  n_validators: 500
  n_cores: 100
  n_included_candidates: 100
  min_pov_size: 1120
  max_pov_size: 5120
  peer_bandwidth: 524288000000
  bandwidth: 524288000000
  latency:
    min_latency:
      secs: 0
      nanos: 1000000
    max_latency:
      secs: 0
      nanos: 100000000
  error: 0
  num_blocks: 10

The approach

We build a real overseer with the real implementations for approval-voting and approval-distribution subsystems.
For a given network size, for each validator we pre-computed all potential assignments and approvals it would send, because this a computation heavy operation this will be cached on a file on disk and be re-used if the generation parameters don't change.
The messages will be sent accordingly to the configured parameters and those are split into 3 main benchmarking scenarios.

Benchmarking scenarios

Best case scenario approvals_throughput_best_case.yaml

It send to the approval-distribution only the minimum required tranche to gathered the needed_approvals, so that a candidate is approved.

Behaviour in the presence of no-shows approvals_no_shows.yaml

It sends the tranche needed to approve a candidate when we have a maximum of num_no_shows_per_candidate tranches with no-shows for each candidate.

Maximum throughput approvals_throughput.yaml

It sends all the tranches for each block and measures the used CPU and necessary network bandwidth. by the approval-voting and approval-distribution subsystem.

How to run it

cargo run -p polkadot-subsystem-bench --release -- test-sequence --path polkadot/node/subsystem-bench/examples/approvals_throughput.yaml

Evaluating performance

Use the real subsystems metrics

If you follow the steps in https://github.com/paritytech/polkadot-sdk/tree/master/polkadot/node/subsystem-bench#install-grafana for installing locally prometheus and grafana, all real metrics for the approval-distribution, approval-voting and overseer are available. E.g:

Profile with pyroscope

Setup pyroscope following the steps in https://github.com/paritytech/polkadot-sdk/tree/master/polkadot/node/subsystem-bench#install-pyroscope, then run any of the benchmark scenario with --profile as the arguments.
Open the pyroscope dashboard in grafana, e.g:

Useful logs

Network bandwidth requirements:

Payload bytes received from peers: 503993 KiB total, 50399 KiB/block
Payload bytes sent to peers: 629971 KiB total, 62997 KiB/block

Cpu usage by the approval-distribution/approval-voting subsystems.

approval-distribution CPU usage 84.061s
approval-distribution CPU usage per block 8.406s
approval-voting CPU usage 96.532s
approval-voting CPU usage per block 9.653s

Time passed until a given block is approved

 Chain selection approved  after 3500 ms hash=0x0101010101010101010101010101010101010101010101010101010101010101
Chain selection approved  after 4500 ms hash=0x0202020202020202020202020202020202020202020202020202020202020202

Using benchmark to quantify improvements from #1178 + #1191

Using a versi-node we compare the scenarios where all new optimisations are disabled with a scenarios where tranche0 assignments are sent in a single message and a conservative simulation where the coalescing of approvals gives us just 50% reduction in the number of messages we send.

Overall, what we see is a speedup of around 30-40% in the time it takes to process the necessary messages and a 30-40% reduction in the necessary bandwidth.

Best case scenario comparison(minimum required tranches sent).

Unoptimised

    Number of blocks: 10
    Payload bytes received from peers: 53289 KiB total, 5328 KiB/block
    Payload bytes sent to peers: 52489 KiB total, 5248 KiB/block
    approval-distribution CPU usage 6.732s
    approval-distribution CPU usage per block 0.673s
    approval-voting CPU usage 9.523s
    approval-voting CPU usage per block 0.952s

vs Optimisation enabled

   Number of blocks: 10
   Payload bytes received from peers: 32141 KiB total, 3214 KiB/block
   Payload bytes sent to peers: 37314 KiB total, 3731 KiB/block
   approval-distribution CPU usage 4.658s
   approval-distribution CPU usage per block 0.466s
   approval-voting CPU usage 6.236s
   approval-voting CPU usage per block 0.624s

Worst case all tranches sent, very unlikely happens when sharding breaks.

Unoptimised

   Number of blocks: 10
   Payload bytes received from peers: 746393 KiB total, 74639 KiB/block
   Payload bytes sent to peers: 729151 KiB total, 72915 KiB/block
   approval-distribution CPU usage 118.681s
   approval-distribution CPU usage per block 11.868s
   approval-voting CPU usage 124.118s
   approval-voting CPU usage per block 12.412s

vs optimised

    Number of blocks: 10
    Payload bytes received from peers: 503993 KiB total, 50399 KiB/block
    Payload bytes sent to peers: 629971 KiB total, 62997 KiB/block
    approval-distribution CPU usage 84.061s
    approval-distribution CPU usage per block 8.406s
    approval-voting CPU usage 96.532s
    approval-voting CPU usage per block 9.653s

TODOs

[x] Polish implementation.
[x] Use what we have so far to evaluate #1191 before merging.
[x] List of features and additional dimensions we want to use for benchmarking.
[x] Run benchmark on hardware similar with versi and kusama nodes.
[ ] Add benchmark to be run in CI for catching regression in performance.
[ ] Rebase on latest changes for network emulation.

Signed-off-by: Andrei Sandu <[email protected]>

The pr migrates: - paritytech/polkadot#7554 Signed-off-by: Alexandru Gheorghe <[email protected]>

Signed-off-by: Alexandru Gheorghe <[email protected]>

…tiple_candidates_polkadot_sdk

Signed-off-by: Alexandru Gheorghe <[email protected]>

…reim/the_v2_assignments

Signed-off-by: Alexandru Gheorghe <[email protected]>

Signed-off-by: Andrei Sandu <[email protected]>

…reim/the_v2_assignments Signed-off-by: Andrei Sandu <[email protected]>

Signed-off-by: Andrei Sandu <[email protected]>

…reim/the_v2_assignments Signed-off-by: Andrei Sandu <[email protected]>

Signed-off-by: Andrei Sandu <[email protected]>

Signed-off-by: Alexandru Gheorghe <[email protected]>

This reverts commit 5e004e1.

…o feature/approve_multiple_candidates_polkadot_sdk_v2

Signed-off-by: Alexandru Gheorghe <[email protected]>

…tiple_candidates_polkadot_sdk_v3

…ignments

…ple_candidates_polkadot_sdk_v3

... the param was incorrectly appended to v9 instead of creating a new version as v10. Signed-off-by: Alexandru Gheorghe <[email protected]>

E.g: https://github.com/paritytech/polkadot-sdk/actions/runs/6313437255/job/17141490461?pr=1178 Signed-off-by: Alexandru Gheorghe <[email protected]>

... failure example here: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/3799036 Signed-off-by: Alexandru Gheorghe <[email protected]>

…ple_candidates_polkadot_sdk_v3

Signed-off-by: Alexandru Gheorghe <[email protected]>

alindima · 2024-01-23T11:10:17Z

Question: are we aiming to first merge #2970 and then rebase this PR, or to first merge this PR into #2970 ?

alexggh · 2024-01-23T11:28:51Z

Question: are we aiming to first merge #2970 and then rebase this PR, or to first merge this PR into #2970 ?

First merge #2970 and then rebase this PR.

polkadot/node/subsystem-bench/src/subsystem-bench.rs

polkadot/node/subsystem-bench/src/cli.rs

polkadot/node/subsystem-bench/src/core/configuration.rs

polkadot/node/subsystem-bench/src/core/display.rs

polkadot/node/subsystem-bench/src/core/mod.rs

polkadot/node/subsystem-bench/src/approval/helpers.rs

…rovals-merged-with-andreis-v2

Signed-off-by: Alexandru Gheorghe <[email protected]>

polkadot/node/subsystem-bench/Cargo.toml

Signed-off-by: Alexandru Gheorghe <[email protected]>

sandreim

LGTM!

polkadot/node/subsystem-bench/examples/approvals_throughput_best_case.yaml

polkadot/node/subsystem-bench/examples/approvals_throughput_no_optimisations_enabled.yaml

polkadot/node/subsystem-bench/src/approval/mock_chain_selection.rs

polkadot/node/subsystem-bench/src/approval/mod.rs

polkadot/node/subsystem-bench/src/approval/message_generator.rs

…bench-approvals

Signed-off-by: Alexandru Gheorghe <[email protected]>

alexggh · 2024-02-02T14:19:43Z

Addressed all review feedback, once the CI passes will merge this PR.

Polkadot-Forum · 2024-05-21T14:59:55Z

This pull request has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/what-are-subsystem-benchmarks/8212/1

sandreim and others added 30 commits August 25, 2023 19:15

merge from archived repo

7230df4

Signed-off-by: Andrei Sandu <[email protected]>

cargo lock

d04c182

Signed-off-by: Andrei Sandu <[email protected]>

Merge remote-tracking branch 'origin' into sandreim/the_v2_assignments

f4f0e70

Signed-off-by: Andrei Sandu <[email protected]>

Approve multiple candidates with a single signature

341c7af

The pr migrates: - paritytech/polkadot#7554 Signed-off-by: Alexandru Gheorghe <[email protected]>

Fix build warnings

619fff2

Signed-off-by: Alexandru Gheorghe <[email protected]>

Merge remote-tracking branch 'origin/master' into feature/approve_mul…

5f1558d

…tiple_candidates_polkadot_sdk

ci: fix worker binaries could not be found

ed1d9d0

Signed-off-by: Alexandru Gheorghe <[email protected]>

Add missing bits

7d7b82c

Signed-off-by: Alexandru Gheorghe <[email protected]>

Build with network-protocol-staging

7bc13d3

Signed-off-by: Alexandru Gheorghe <[email protected]>

Validate disconnect theory

53f8556

Signed-off-by: Alexandru Gheorghe <[email protected]>

Merge branch 'master' of github.com:paritytech/polkadot-sdk into sand…

442b1e4

…reim/the_v2_assignments

Log errors when banning peers

5e004e1

Signed-off-by: Alexandru Gheorghe <[email protected]>

fix zombienet test

9850b2f

Signed-off-by: Andrei Sandu <[email protected]>

Merge branch 'master' of github.com:paritytech/polkadot-sdk into sand…

f71eb31

…reim/the_v2_assignments Signed-off-by: Andrei Sandu <[email protected]>

cargo lock

46cfaf1

Signed-off-by: Andrei Sandu <[email protected]>

Merge branch 'master' of github.com:paritytech/polkadot-sdk into sand…

0086502

…reim/the_v2_assignments Signed-off-by: Andrei Sandu <[email protected]>

superfluous

47beabd

Signed-off-by: Andrei Sandu <[email protected]>

Merge branch 'master' into sandreim/the_v2_assignments

ee88408

Separate approval

3d3e37c

Signed-off-by: Alexandru Gheorghe <[email protected]>

Revert "Log errors when banning peers"

da61d98

This reverts commit 5e004e1.

Merge remote-tracking branch 'origin/sandreim/the_v2_assignments' int…

9c0375c

…o feature/approve_multiple_candidates_polkadot_sdk_v2

Cleanup post migrating hacks when migrating from polkadot repo

f3fee24

Signed-off-by: Alexandru Gheorghe <[email protected]>

Fixup clippy

6338d33

Signed-off-by: Alexandru Gheorghe <[email protected]>

Merge remote-tracking branch 'origin/master' into feature/approve_mul…

d4fb01a

…tiple_candidates_polkadot_sdk_v3

Merge remote-tracking branch 'origin/master' into sandreim/the_v2_ass…

5832ad7

…ignments

Merge branch 'sandreim/the_v2_assignments' into feature/approve_multi…

989bbcc

…ple_candidates_polkadot_sdk_v3

Add host configuration v10 with approval_voting params

7fdae09

... the param was incorrectly appended to v9 instead of creating a new version as v10. Signed-off-by: Alexandru Gheorghe <[email protected]>

Fixup failure of lint-markdown

e70b113

E.g: https://github.com/paritytech/polkadot-sdk/actions/runs/6313437255/job/17141490461?pr=1178 Signed-off-by: Alexandru Gheorghe <[email protected]>

Fixup test-rustdoc job

85939bb

... failure example here: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/3799036 Signed-off-by: Alexandru Gheorghe <[email protected]>

Merge branch 'sandreim/the_v2_assignments' into feature/approve_multi…

a6df360

…ple_candidates_polkadot_sdk_v3

alexggh removed request for athei and koute January 22, 2024 07:39

Fixes

11e6dc1

Signed-off-by: Alexandru Gheorghe <[email protected]>

AndreiEres approved these changes Jan 24, 2024

View reviewed changes

Base automatically changed from sandreim/availability-write-bench to master January 25, 2024 17:52

alexggh added 3 commits January 29, 2024 15:42

Merge remote-tracking branch 'origin/master' into subsystem-bench-app…

fad5df0

…rovals-merged-with-andreis-v2

Merge remote-tracking branch 'origin/master' into subsystem-bench-app…

7e06da2

…rovals-merged-with-andreis-v2

Minor fixes

2721a42

Signed-off-by: Alexandru Gheorghe <[email protected]>

alexggh added the R0-no-crate-publish-required The change does not require any crates to be re-published. label Jan 29, 2024

Fix toml format

296f7e7

Signed-off-by: Alexandru Gheorghe <[email protected]>

AndreiEres reviewed Jan 30, 2024

View reviewed changes

polkadot/node/subsystem-bench/Cargo.toml Show resolved Hide resolved

Remove commented code

bab7aa1

Signed-off-by: Alexandru Gheorghe <[email protected]>

sandreim approved these changes Jan 30, 2024

View reviewed changes

alexggh added 5 commits February 2, 2024 12:21

Merge remote-tracking branch 'origin/master' into alexaggh/subsystem-…

d84a335

…bench-approvals

Remove unused property

459a628

Signed-off-by: Alexandru Gheorghe <[email protected]>

Address review findings

73fe675

Signed-off-by: Alexandru Gheorghe <[email protected]>

Fixup

c5cb34f

Signed-off-by: Alexandru Gheorghe <[email protected]>

Configurable session_index in mock

8fe7026

Signed-off-by: Alexandru Gheorghe <[email protected]>

alexggh added this pull request to the merge queue Feb 5, 2024

Merged via the queue into master with commit f9f8868 Feb 5, 2024

alexggh deleted the alexaggh/subsystem-bench-approvals branch February 5, 2024 07:27

Conversation

alexggh commented Dec 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The approach

Benchmarking scenarios

Best case scenario approvals_throughput_best_case.yaml

Behaviour in the presence of no-shows approvals_no_shows.yaml

Maximum throughput approvals_throughput.yaml

How to run it

Evaluating performance

Use the real subsystems metrics

Profile with pyroscope

Useful logs

Using benchmark to quantify improvements from #1178 + #1191

Best case scenario comparison(minimum required tranches sent).

Worst case all tranches sent, very unlikely happens when sharding breaks.

TODOs

Uh oh!

alindima commented Jan 23, 2024

Uh oh!

alexggh commented Jan 23, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sandreim left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alexggh commented Feb 2, 2024

Uh oh!

Polkadot-Forum commented May 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

alexggh commented Dec 5, 2023 •

edited

Loading