Skip to content

collator-protocol-revamp: CollationManager and subsystem impl#8541

Merged
tdimitrov merged 159 commits intomasterfrom
alindima/collator-protocol-revamp-collation-manager
Feb 4, 2026
Merged

collator-protocol-revamp: CollationManager and subsystem impl#8541
tdimitrov merged 159 commits intomasterfrom
alindima/collator-protocol-revamp-collation-manager

Conversation

@alindima
Copy link
Copy Markdown
Contributor

@alindima alindima commented May 15, 2025

Implements the CollationManager and the new collator protocol (validator side) subsystem.

Issues #8182 and #7752.

These are the big remaining parts which would enable us to test the entire implementation.

TODO:

After merging:

  • versi testing

Uses a slightly modified version of the ClaimQueueState written by @tdimitrov in #7114.

alindima added 30 commits April 3, 2025 11:18
…p-peer-manager' into alindima/collator-protocol-revamp-reputation-db-draft
…p-reputation-db-draft' into alindima/collator-protocol-revamp-collation-manager
Copy link
Copy Markdown
Contributor Author

@alindima alindima left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, i'm happy with this. Some more metrics and tests could be added but can be follow-ups.

Cannot approve since I'm the original author, but consider this my approval :D

mod tests;

pub use metrics::Metrics;
pub use crate::validator_side_metrics::Metrics;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will probably be needed to add different metrics depending on the subsystem variant. It's good to have the common ones deduplicated but I'd also add some more for the new version

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like us to have finer grained metrics, including timers for all operations. But this can be a follow-up, to avoid keeping this monster around open forever

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I preferred not to add metrics which we won't use so I just transferred what we already have.
I agree in tough situations extra metrics will be useful so it can be a followup - I didn't close #10402 for this reason.

handle_collation_request_result: prometheus::Histogram,
collator_peer_count: prometheus::Gauge<prometheus::U64>,
collation_request_duration: prometheus::Histogram,
// TODO: Not available for the new implementation. Remove with the old implementation.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conceptually, requesting unblocked collations are present in both variants

doc:
- audience: Node Operator
description: |-
This PR adds a new collator protocol (validator side) subsystem.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably deserves a bit more :D

Copy link
Copy Markdown
Member

@eskimor eskimor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following improvements can be a followup, but the constants should really be fixed/properly argued why they make sense.

Follow up improvements (not for this PR, but so that they are noted somewhere):

  • Parallel fetch after some small timeout: Is implemented in legacy implementation, should be brought back & improved. This is a fix for us not having proper streaming and can help a great deal to mitigate impact on fetch attacks - @tdimitrov knows details.
  • Negative reputation bump on fetch problems: We might not be able to punish hard for single issues (as network issues can happen to honest nodes - measurements would be good), but if implemented properly (e.g. above parallel fetch) any real harm will only come from coordinated attacks, thus we should look into punishing harder on coordination.
  • Race condition fix
  • Timer should start at leaf activation - parallel fetches should likely alter that behavior even more (we can likely be more aggressive in fetching, if we have parallel fetches) - instead of waiting doing nothing, we might as well fetch what is there, if we can fetch more if it arrives.
  • Possibly others from my chat with @tdimitrov

Fix for this PR: Get constants in order - or have a proper argument why they are good as is.

/// saturated to this value.
pub const MAX_SCORE: u16 = 35_000;
/// Reputation bump for getting a valid candidate included in a finalized block.
pub const VALID_INCLUDED_CANDIDATE_BUMP: u16 = 100;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not what we agreed on, ok I just found the reasoning above for the value - it only argues in one dimension (against the inactivity decay, which seems to be causing more problems than it solves), but not against the other axes/more important axis: Relationship to negative reputation changes and with regards to those, this value is completely off.

Copy link
Copy Markdown
Member

@eskimor eskimor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, let's get this PR merged finally. Any further fixes can come in a followup.

@alindima
Copy link
Copy Markdown
Contributor Author

alindima commented Feb 2, 2026

Parallel fetch after some small timeout: Is implemented in legacy implementation, should be brought back & improved. This is a fix for us not having proper streaming and can help a great deal to mitigate impact on fetch attacks - @tdimitrov knows details.

Parallel fetch for the same claim queue spot? We do fetch in parallel if there are multiple claim queue slots available. This is esentially the same as in the old implementation

@tdimitrov
Copy link
Copy Markdown
Contributor

Parallel fetch for the same claim queue spot?

Yes

@paritytech-workflow-stopper
Copy link
Copy Markdown

All GitHub workflows were cancelled due to failure one of the required jobs.
Failed workflow url: https://github.com/paritytech/polkadot-sdk/actions/runs/21588396729
Failed job name: fmt

@tdimitrov
Copy link
Copy Markdown
Contributor

/cmd fmt

@tdimitrov tdimitrov enabled auto-merge February 2, 2026 13:52
@tdimitrov tdimitrov added this pull request to the merge queue Feb 4, 2026
Merged via the queue into master with commit a40ab3c Feb 4, 2026
308 of 311 checks passed
@tdimitrov tdimitrov deleted the alindima/collator-protocol-revamp-collation-manager branch February 4, 2026 10:03
github-merge-queue bot pushed a commit that referenced this pull request Feb 27, 2026
A followup from #8541
with changes requested by @eskimor:
- Adjust the protocol parameters and add comments about the picked
values
- Simpler fetch mechanism - advertisements from unknown collators (those
with 0 reputation) are delayed. Everything else is fetched immediately.

---------

Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>
tdimitrov added a commit that referenced this pull request Feb 27, 2026
A followup from #8541
with changes requested by @eskimor:
- Adjust the protocol parameters and add comments about the picked
values
- Simpler fetch mechanism - advertisements from unknown collators (those
with 0 reputation) are delayed. Everything else is fetched immediately.

---------

Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>
(cherry picked from commit 26bb41a)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

T8-polkadot This PR/Issue is related to/affects the Polkadot network. T18-zombienet_tests Trigger zombienet CI tests.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants