Collation fairness in backing subsystem by tdimitrov · Pull Request #7114 · paritytech/polkadot-sdk

tdimitrov · 2025-01-10T14:25:58Z

#4880 implements collation fairness in the collator protocol subsystem but it has got some drawbacks:

It doesn't take into account group rotations. The new backing group has no way to know if claims within its view has been claimed by the previous backing group.
It doesn't take into account statements from other validators on the same backing group. If multiple collators are generating candidates via different validators the latter don't know about each other fetches. As a result they might fetch more advertisements than necessary.

To overcome these problems a solution was proposed in #5079 (comment). In nutshell the idea is to move collations tracking in the backing subsystem since it has got a better view on what candidates are being fetched and seconded. How this solves the problems outline above? One of backing subsystem's goals is to keep track of all candidates that are getting backed no matter if they originate from validator's own backing group (if any or not). This fact naturally solves both problems. First we track candidates from group rotations since we have got a single view of the claim queue. Second we receive statements from all backing groups and we can keep the local claim queue state relatively up to date.

Implementation notes

The PR touches collation-protocol/validator side, backing subsystem and ClaimQueueState (introduced with #4880).

`ClaimQueueState`

ClaimQueueState was created on the fly in #4880. This means that each time we wanted to know the state of the claim queue we built ClaimQueueState instance. In this PR we want to keep a single view of the claim queue for each leaf and each core and continuously update it by adding new leaves and candidates and dropping old ones. For this reason ClaimQueueState is extended to keep track of the candidate hash occupying each a slot. This is needed because we want to be able to drop claims for failed fetches, invaid candidates, etc. So claimed is no longer a bool but an enum:

enum ClaimState {
	/// Unclaimed
	Free,
	/// The candidate is pending fetching or validation.
	Pending(CandidateHash),
	/// The candidate is seconded.
	Seconded(CandidateHash),
}

Note that the ordering here is not perfect. Let's say there are candidates A1->A2->A3 for a parachain built on top of each other. If we see them in the order A2, A3, A1 we will insert them in the claim queue in the same (wrong) order. This is fine because (a) we need the candidate hash only to be able to release claims and (b) the exact order of candidates is known by the prospective parachains subsystem.

The second change in this module is related to leaves. The claim queue has got different state for each fork and we need to be able to track this. For this reason PerLeafClaimQueueState which wraps a HashMap of ClaimQueueState for each leaf. The non-trivial logic here is adding a new leaf (add_leaf). We need to handle three different cases:

The new leaf builds on top of old leaf. Or in other words we are just extending a fork with a new element.
The new leaf is a for from a non-leaf block in another fork. In this case we are creating a new fork.
The new leaf is a completely new and separate fork.

❗ All these cases are covered with unit tests but I'd appreciate an extra attention from the reviewers. ❗

backing subsystem

Backing subsystem has got an instance of PerLeafClaimQueueState for each core and updates it by keeping track of new candidates. We have got three different set of candidates:

Candidates which the local validator fetches from a collator.
Seconded candidates from our backing group which we receive from statement distribution.
Backed candidates from other backing groups.

Each group is handled differently:

When we receive an advertisement which we want to fetch first we reserve a claim in the claim queue (handled by collator protocol). When we receive the candidate we validate it. On fetch or validation failure we need to drop the claim.
We import the candidate if we have got a free slot in the claim queue and validate it. We drop the claim if validation fails.
If there is a spot in the claim queue we claim and import it. Once claimed these slots are not supposed to be released.

To achieve all these three new messages are introduced:

CanClaim returns true if a claim can be made for a specific para id. This message is needed to handle collator protocol v1 advertisements which don't contain candidate hash. More on this in the next section.
PendingSlots - returns all pending claims from the claim queue.
DropClaims - drops claims for one or more candidates.

All these messages are used by the collator protocol and their application will be explained there.

In nutshell the following changes are made in backing:

Instantiate PerLeafClaimQueueState and keep it up to date. This includes adding new leaves, removing stale leaves, making claims for candidate and releasing claims for bad candidates. These should be easy to spot in the PR.
Handle the new messages mentioned in the previous paragraph.

collator protocol

Most controversial changes are in this subsystem. Since backing subsystem keeps track of the claim queue state the collator protocol no longer builds ClaimQueueState 'on the fly'. The price to pay is additional messages exchanged with backing.
Generally speaking the collator protocol follows these steps:

An advertisement is received.
If it is over protocol version 2+ CanSecond message is sent to the backing subsystem. The latter gets the leaves where the candidate can be seconded and makes a claim for each leaf. If a claim can't be made - the leaf is dropped.
If it is over protocol version 1 we can't claim a slot in the claim queue since we don't know the candidate hash. I opted not to support generic claims since they complicate the implementation and hopefully we will deprecate v1 at some point. So what we do is send CanClaim to backing subsystem and get a confirmation that at this point there is a free spot for the candidate in the claim queue. There is a potential race here because the spot is not claimed and can be occupied by another candidate while we do the fetch. I think this risk is acceptable. Worst case we might have a few extra candidates.
If we get a confirmation from backing that we can fetch the candidate the actual fetch is initiated. If it is successful the collator protocol sends Seconded to backing and the claim is marked as Seconded. If the fetch is unsuccessful we use DropClaims to notify backing that the claim should be released.
We might have more than one candidate to fetch. If so - we need to make a decision based on the claim queue. The problem is we no longer have the claim queue state locally so we need to get it with yet another message. PendingSlots returns all claims in pending state (these are all pending fetches). We see if we have got a pending candidate for this para id and fetch it. The claims in the result are ordered by priority.

While writing this I just realised we can optimise it for some cases. If there is only one pending candidate in the queue we don't care about the claim queue state. Note that if we receive an advertisement and there is no pending fetch we immediately initiate fetching without adding the candidate in the queue. So in the happy case we don't send PendingSlots.

The collator protocol also keeps track of candidates which are blocked by backing (CanSecond has returned BlockedByBacking error for them). These candidates has got pending claims and if their relay parent goes out of view they are dropped.

TODOs

The testing for this PR is still work in progress, but the functionally I consider it done.

Handle the TODOs in the code
Add support for forks in backing tests so that all current tests are green.
Add fairness tests in backing.
Fix collator tests and add cases for the new flows.

…without a candidate hash

…ue state from backing

paritytech-workflow-stopper · 2025-01-14T08:59:34Z

All GitHub workflows were cancelled due to failure one of the required jobs.
Failed workflow url: https://github.com/paritytech/polkadot-sdk/actions/runs/12764068800
Failed job name: test-linux-stable

alindima · 2025-01-14T09:55:23Z

We might have more than one candidate to fetch. If so - we need to make a decision based on the claim queue. The problem is we no longer have the claim queue state locally so we need to get it with yet another message. PendingSlots returns all claims in pending state (these are all pending fetches). We see if we have got a pending candidate for this para id and fetch it. The claims in the result are ordered by priority.

What priority is this? If it's just the order in the claim queue, we can for sure just keep a copy of that

The collator protocol also keeps track of candidates which are blocked by backing (CanSecond has returned BlockedByBacking error for them). These candidates has got pending claims and if their relay parent goes out of view they are dropped.

Why keep claims for candidates that have been blocked by backing? I don't think they will ever be accepted

alindima · 2025-01-14T11:34:47Z

CanClaim returns true if a claim can be made for a specific para id. This message is needed to handle collator protocol v1 advertisements which don't contain candidate hash. More on this in the next section.

To simplify, we could keep the old way of buffering on max_candidate_depth for these (migrating it to scheduling_lookahead). Caveat being that they won't be able to use elastic scaling, but if they want to, they need to upgrade anyway

alindima · 2025-01-15T09:48:47Z

The second change in this module is related to leaves. The claim queue has got different state for each fork and we need to be able to track this. For this reason PerLeafClaimQueueState which wraps a HashMap of ClaimQueueState for each leaf. The non-trivial logic here is adding a new leaf (add_leaf). We need to handle three different cases:

This seems to be very related to the BackingImplicitView. I think these two should be merged and the ClaimQueueState should be a part of it

alindima

Just a quick pass

alindima · 2025-01-14T10:02:15Z

polkadot/node/core/backing/src/lib.rs


 	// add entries in `per_relay_parent`. for all new relay-parents.
-	for maybe_new in fresh_relay_parents {
+	for maybe_new in fresh_relay_parents.into_iter().rev() {


alindima · 2025-01-14T10:03:01Z

polkadot/node/core/backing/src/lib.rs

 		)
 		.await?;

+		// get the ancestor of the relay block


we could get this info from the implicit view

alindima · 2025-01-14T11:36:12Z

polkadot/node/core/backing/src/lib.rs

 	background_validation_tx: &mpsc::Sender<(Hash, ValidatedCandidateCommand)>,
 	attesting: AttestingData,
-) -> Result<(), Error> {
+) -> Result<bool, Error> {


we should document what these booleans represent

alindima · 2025-01-14T11:37:32Z

polkadot/node/core/backing/src/lib.rs

 		return Ok(())
 	}

+	claim_slot_for_seconded_statement(&statement, rp_state, &mut state.claim_queue_state)?;


this should be after if let Err(Error::RejectedByProspectiveParachains) = res {

@tdimitrov

Implements the `CollationManager` and the new collator protocol (validator side) subsystem. Issues #8182 and #7752. These are the big remaining parts which would enable us to test the entire implementation. TODO: - [ ] add a couple more unit tests (see the suggestions at the bottom of the tests file) - [x] polish the ClaimQueueState and verify if it's sufficiently covered by unit tests - #10334 - #10368 - [x] add metrics and polish logs - #10730 - [x] add a CLI parameter for enabling the experimental subsystem (and remove the compile-time feature) -> #10285 - [x] implement registered paras update, using #9055 - [ ] do some manual zombienet tests with v1 protocol version and with restarting validators (including syncing with warp sync) - [x] prdoc - [x] Rollback - 03e8915 - 05e1497 These commits were added just to run the CI tests for this PR with the new experimental protocol After merging: - [ ] versi testing Uses a slightly modified version of the ClaimQueueState written by @tdimitrov in #7114. --------- Co-authored-by: Tsvetomir Dimitrov <tsvetomir@parity.io> Co-authored-by: Serban Iorga <serban@parity.io> Co-authored-by: Serban Iorga <serban300@gmail.com> Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>

tdimitrov added 28 commits December 20, 2024 10:55

Move ClaimQueueState to subsystem-util

ccade8a

PerLeafClaimQueueState implementation

263b28b

Use VecDeque to keep claims

c526b4c

Track leaves in PerLeafClaimQueueState

6c9768c

Tests in separate modules

6a1a027

Track the candidate hashes we have made claims for in ClaimQueueState

0471aab

ClaimQueueState: Track para id for each candidate and support claims …

67b2ffb

…without a candidate hash

claims handling in backing

3f42f17

claims handling in collator protocol

85d6082

Introduce ClaimState in ClaimQueueState - tests not working

a8cf6b7

ClaimedSlots message in backing + mark seconded slots

4111622

dequeue_next_collation_and_fetchin collator protocol uses claim que…

e4cbc7d

…ue state from backing

Fixes in backing

31029e3

small fixes in collator protocol

0b927b0

Cleanup claim_queue_state

360971f

Simplify collator protocol v1 claim check

a6d58f8

claim queue state: ClaimState always has got the candidate hash

c0a9083

More tests for claim_queue_state

2b34022

Add fork_from_state + some more tests

4875b4f

Fixes in backing

4d172ec

Keep track on claims for all cores

56e8e18

Remove unused function from PerLeafClaimQueueState

12f4bbc

Fix leaf update

32094af

More logs in claim_queue_state

85ece50

More logs in seconding_sanity_check

fad223a

Fix a return value in PerLeafClaimQueueState

2ffbb83

Various small fixes

2a13c0c

ClaimedSlots -> PendingSlots

07fa67f

alindima reviewed Jan 17, 2025

View reviewed changes

alindima mentioned this pull request Feb 4, 2025

Deprecate and remove AsyncBackingParameters #5079

Open

alindima mentioned this pull request Apr 8, 2025

Collator Protocol Revamp: Collation manager #8182

Closed

alindima mentioned this pull request May 15, 2025

collator-protocol-revamp: CollationManager and subsystem impl #8541

Merged

9 tasks

pgherveou added this to [preview] release tracker Mar 18, 2026

github-project-automation bot moved this to Todo in [preview] release tracker Mar 18, 2026

pgherveou removed this from [preview] release tracker Mar 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collation fairness in backing subsystem#7114

Collation fairness in backing subsystem#7114
tdimitrov wants to merge 28 commits intomasterfrom
tsv-backing-fairness

tdimitrov commented Jan 10, 2025 •

edited

Loading

Uh oh!

paritytech-workflow-stopper bot commented Jan 14, 2025

Uh oh!

alindima commented Jan 14, 2025

Uh oh!

alindima commented Jan 14, 2025

Uh oh!

alindima commented Jan 15, 2025

Uh oh!

alindima left a comment

Uh oh!

alindima Jan 14, 2025

Uh oh!

alindima Jan 14, 2025

Uh oh!

alindima Jan 14, 2025

Uh oh!

alindima Jan 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tdimitrov commented Jan 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Implementation notes

ClaimQueueState

backing subsystem

collator protocol

TODOs

Uh oh!

paritytech-workflow-stopper bot commented Jan 14, 2025

Uh oh!

alindima commented Jan 14, 2025

Uh oh!

alindima commented Jan 14, 2025

Uh oh!

alindima commented Jan 15, 2025

Uh oh!

alindima left a comment

Choose a reason for hiding this comment

Uh oh!

alindima Jan 14, 2025

Choose a reason for hiding this comment

Uh oh!

alindima Jan 14, 2025

Choose a reason for hiding this comment

Uh oh!

alindima Jan 14, 2025

Choose a reason for hiding this comment

Uh oh!

alindima Jan 14, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tdimitrov commented Jan 10, 2025 •

edited

Loading

`ClaimQueueState`