Skip to content

Add PVF execution priority#4837

Merged
AndreiEres merged 59 commits intomasterfrom
AndreiEres/pvf-execution-priority
Oct 9, 2024
Merged

Add PVF execution priority#4837
AndreiEres merged 59 commits intomasterfrom
AndreiEres/pvf-execution-priority

Conversation

@AndreiEres
Copy link
Contributor

@AndreiEres AndreiEres commented Jun 19, 2024

Resolves #4632

The new logic optimizes the distribution of execution jobs for disputes, approvals, and backings. Testing shows improved finality lag and candidate checking times, especially under heavy network load.

Approach

This update adds prioritization to the PVF execution queue. The logic partially implements the suggestions from #4632 (comment).

We use thresholds to determine how much a current priority can "steal" from lower ones:

  • Disputes: 70%
  • Approvals: 80%
  • Backing System Parachains: 100%
  • Backing: 100%

A threshold indicates the portion of the current priority that can be allocated from lower priorities.

For example:

  • Disputes take 70%, leaving 30% for approvals and all backings.
  • 80% of the remaining goes to approvals, which is 30% * 80% = 24% of the original 100%.
  • If we used parts of the original 100%, approvals couldn't take more than 24%, even if there are no disputes.

Assuming a maximum of 12 executions per block, with a 6-second window, 2 CPU cores, and a 2-second run time, we get these distributions:

  • With disputes: 8 disputes, 3 approvals, 1 backing
  • Without disputes: 9 approvals, 3 backings

It's worth noting that when there are no disputes, if there's only one backing job, we continue processing approvals regardless of their fulfillment status.

Versi Testing 40/20

Testing showed a slight difference in finality lag and candidate checking time between this pull request and its base on the master branch. The more loaded the network, the greater the observed difference.

Testing Parameters:

  • 40 validators (4 malicious)
  • 20 gluttons with 2 seconds of PVF execution time
  • 6 VRF modulo samples
  • 12 required approvals

Pasted Graphic 3
Pasted Graphic 4

Versi Testing 80/40

For this test, we compared the master branch with the branch from #5616. The second branch is based on the current one but removes backing jobs that have exceeded their time limits. We excluded malicious nodes to reduce noise from disputing and banning validators. The results show that, under the same load, nodes experience less finality lag and reduced recovery and check time. Even parachains are functioning with a shorter block time, although it remains over 6 seconds.

Testing Parameters:

  • 80 validators (0 malicious)
  • 40 gluttons with 2 seconds of PVF execution time
  • 6 VRF modulo samples
  • 30 required approvals

image

@AndreiEres AndreiEres changed the title [WIP] PVF execution priority [WIP] Add PVF execution priority Jun 19, 2024
@AndreiEres AndreiEres added T0-node This PR/Issue is related to the topic “node”. T8-polkadot This PR/Issue is related to/affects the Polkadot network. labels Jun 19, 2024
@AndreiEres AndreiEres changed the title [WIP] Add PVF execution priority Add PVF execution priority Jun 20, 2024
@AndreiEres AndreiEres marked this pull request as ready for review June 20, 2024 10:32
AndreiEres and others added 6 commits June 26, 2024 11:55
Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>
@AndreiEres
Copy link
Contributor Author

Updated the logic according to #4632 (comment) which looks reasonable

Copy link
Contributor

@alexggh alexggh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass, good job, I like the simplicity, I left you some comments, nothing major.

As we discussed online, it would be nice if we can run some integrations tests to understand how the system behaves on the limit conditions, I think we can simulate that by making the pvf execution longer, so that we fall behind on work.

///
/// This system might seem complex, but we operate with the remaining percentages because:
/// - Not all job types are present in each block. If we used parts of the original 100%,
/// approvals could not exceed 24%, even if there are no disputes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've trouble parsing "even if there are no disputes" here. If there are no backing jobs then approvals takes like 80% of the total, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that we will skip jobs if we only have approvals in the queue and they have reached 80%? If every current priority has reached its limit, we will select the highest one to execute. In that case, we will continue picking approvals in the absence of other jobs.

/// - Not all job types are present in each block. If we used parts of the original 100%,
/// approvals could not exceed 24%, even if there are no disputes.
/// - We cannot fully prioritize backing system parachains over backing other parachains based
/// on the distribution of the original 100%.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we might've some gradiations among the different system parachains, so do whatever you think is best for now. AssetHub could maybe halt without killing anything, although this depends upon how its used. Among the chains that must run "enough" during an epoch:

  • Sassafras tickets collection, if we ever moved it onto a parachain (maybe never).
  • Validator elections should maintain a fresh validator set, even if we've lost many for some reason.
  • If validator elections runs then we must run bridgehub enough, and any future DKG chains.
  • Some governance chain without smart contracts (collectives?).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is just a first step. I agree it needs to be e reconsidered once we start out to move governance and staking out of the relay chain.

However, I think that a better algorithm would actually consider to target specific block times for the system parachains in times of overload.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Afaik, critical system paracahins should typically not require numerous blocks, specific block times, etc, but some require like one block from each validator during the epoch, including enough slack so spam cannot censor them.

This assumes the critical system paracahins cannot be tricked into wasting their own blocks, which likely becomes a risk only if the d-day governance supports smart contracts.

Copy link
Contributor

@s0me0ne-unkn0wn s0me0ne-unkn0wn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good, thank you! 👍

Comment on lines +686 to +691
const PRIORITY_ALLOCATION_THRESHOLDS: &'static [(PvfExecPriority, usize)] = &[
(PvfExecPriority::Dispute, 70),
(PvfExecPriority::Approval, 80),
(PvfExecPriority::BackingSystemParas, 100),
(PvfExecPriority::Backing, 100),
];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How cool would it be to put that into the parachain host config? (Not in this PR ofc, just an idea)

Copy link
Contributor Author

@AndreiEres AndreiEres Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. I believe we'll have more information to make a decision after using it in production for a while. @sandreim WDYT?

/// to separate and prioritize execution jobs by request type.
/// The order is important, because we iterate through the values and assume it is going from higher
/// to lowest priority.
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Hash, EnumIter)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we are already leaking to validation what we want to have validated here, we should likely keep the old name PvfExecKind and infer "priority" from this, together with other information. E.g. whether a validation is still worthwhile (ttl PR based on this one).

@AndreiEres AndreiEres added this pull request to the merge queue Oct 9, 2024
Merged via the queue into master with commit e294d62 Oct 9, 2024
@AndreiEres AndreiEres deleted the AndreiEres/pvf-execution-priority branch October 9, 2024 17:44
ordian added a commit that referenced this pull request Oct 11, 2024
* master: (28 commits)
  `substrate-node`: removed excessive polkadot-sdk features (#5925)
  Rename QueueEvent::StartWork (#6015)
  [ci] Remove quick-benchmarks-omni from GitLab (#6014)
  Set larger timeout for cmd.yml (#6006)
  Fix `0003-beefy-and-mmr` test (#6003)
  Remove redundant XCMs from dry run's forwarded xcms (#5913)
  Add RadiumBlock bootnodes to Coretime Polkadot Chain spec (#5967)
  Bump strum from 0.26.2 to 0.26.3 (#5943)
  Add PVF execution priority (#4837)
  Snowbridge V2 docs (#5902)
  Fix u256 conversion in BABE (#5994)
  [ci] Move test-linux-stable-no-try-runtime to GHA (#5979)
  Bump PoV request timeout (#5924)
  [Release/CI] Github flow to build `polkadot`/`polkadot-parachain` rc binaries and deb package (#5963)
  [ci] Remove short-benchmarks from Gitlab (#5988)
  Disable flaky tests reported in 5972/5973/5974 (#5976)
  Bump some dependencies (#5886)
  bump zombienet version and set request for k8s (#5968)
  [omni-bencher] Make all runtimes work (#5872)
  Omni-Node renamings (#5915)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

T0-node This PR/Issue is related to the topic “node”. T8-polkadot This PR/Issue is related to/affects the Polkadot network.

Projects

Status: Completed

Development

Successfully merging this pull request may close these issues.

PVF: prioritise execution depending on context

8 participants