collator-protocol-revamp: CollationManager and subsystem impl#8541
collator-protocol-revamp: CollationManager and subsystem impl#8541
Conversation
…variant for the validator side
…rotocol-revamp-peer-manager
…rotocol-revamp-peer-manager
…rotocol-revamp-peer-manager
…rotocol-revamp-peer-manager
…p-peer-manager' into alindima/collator-protocol-revamp-reputation-db-draft
…p-reputation-db-draft' into alindima/collator-protocol-revamp-collation-manager
alindima
left a comment
There was a problem hiding this comment.
LGTM, i'm happy with this. Some more metrics and tests could be added but can be follow-ups.
Cannot approve since I'm the original author, but consider this my approval :D
| mod tests; | ||
|
|
||
| pub use metrics::Metrics; | ||
| pub use crate::validator_side_metrics::Metrics; |
There was a problem hiding this comment.
it will probably be needed to add different metrics depending on the subsystem variant. It's good to have the common ones deduplicated but I'd also add some more for the new version
There was a problem hiding this comment.
I'd like us to have finer grained metrics, including timers for all operations. But this can be a follow-up, to avoid keeping this monster around open forever
There was a problem hiding this comment.
I preferred not to add metrics which we won't use so I just transferred what we already have.
I agree in tough situations extra metrics will be useful so it can be a followup - I didn't close #10402 for this reason.
| handle_collation_request_result: prometheus::Histogram, | ||
| collator_peer_count: prometheus::Gauge<prometheus::U64>, | ||
| collation_request_duration: prometheus::Histogram, | ||
| // TODO: Not available for the new implementation. Remove with the old implementation. |
There was a problem hiding this comment.
conceptually, requesting unblocked collations are present in both variants
polkadot/node/network/collator-protocol/src/validator_side_experimental/common.rs
Outdated
Show resolved
Hide resolved
polkadot/node/network/collator-protocol/src/validator_side_experimental/common.rs
Outdated
Show resolved
Hide resolved
prdoc/pr_8541.prdoc
Outdated
| doc: | ||
| - audience: Node Operator | ||
| description: |- | ||
| This PR adds a new collator protocol (validator side) subsystem. |
There was a problem hiding this comment.
Probably deserves a bit more :D
...adot/node/network/collator-protocol/src/validator_side_experimental/collation_manager/mod.rs
Outdated
Show resolved
Hide resolved
...adot/node/network/collator-protocol/src/validator_side_experimental/collation_manager/mod.rs
Outdated
Show resolved
Hide resolved
polkadot/node/network/collator-protocol/src/validator_side/mod.rs
Outdated
Show resolved
Hide resolved
…erimental/collation_manager/mod.rs Co-authored-by: Alin Dima <[email protected]>
eskimor
left a comment
There was a problem hiding this comment.
The following improvements can be a followup, but the constants should really be fixed/properly argued why they make sense.
Follow up improvements (not for this PR, but so that they are noted somewhere):
- Parallel fetch after some small timeout: Is implemented in legacy implementation, should be brought back & improved. This is a fix for us not having proper streaming and can help a great deal to mitigate impact on fetch attacks - @tdimitrov knows details.
- Negative reputation bump on fetch problems: We might not be able to punish hard for single issues (as network issues can happen to honest nodes - measurements would be good), but if implemented properly (e.g. above parallel fetch) any real harm will only come from coordinated attacks, thus we should look into punishing harder on coordination.
- Race condition fix
- Timer should start at leaf activation - parallel fetches should likely alter that behavior even more (we can likely be more aggressive in fetching, if we have parallel fetches) - instead of waiting doing nothing, we might as well fetch what is there, if we can fetch more if it arrives.
- Possibly others from my chat with @tdimitrov
Fix for this PR: Get constants in order - or have a proper argument why they are good as is.
| /// saturated to this value. | ||
| pub const MAX_SCORE: u16 = 35_000; | ||
| /// Reputation bump for getting a valid candidate included in a finalized block. | ||
| pub const VALID_INCLUDED_CANDIDATE_BUMP: u16 = 100; |
There was a problem hiding this comment.
This is not what we agreed on, ok I just found the reasoning above for the value - it only argues in one dimension (against the inactivity decay, which seems to be causing more problems than it solves), but not against the other axes/more important axis: Relationship to negative reputation changes and with regards to those, this value is completely off.
eskimor
left a comment
There was a problem hiding this comment.
On second thought, let's get this PR merged finally. Any further fixes can come in a followup.
Parallel fetch for the same claim queue spot? We do fetch in parallel if there are multiple claim queue slots available. This is esentially the same as in the old implementation |
Yes |
|
All GitHub workflows were cancelled due to failure one of the required jobs. |
|
/cmd fmt |
A followup from #8541 with changes requested by @eskimor: - Adjust the protocol parameters and add comments about the picked values - Simpler fetch mechanism - advertisements from unknown collators (those with 0 reputation) are delayed. Everything else is fetched immediately. --------- Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>
A followup from #8541 with changes requested by @eskimor: - Adjust the protocol parameters and add comments about the picked values - Simpler fetch mechanism - advertisements from unknown collators (those with 0 reputation) are delayed. Everything else is fetched immediately. --------- Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com> (cherry picked from commit 26bb41a)
Implements the
CollationManagerand the new collator protocol (validator side) subsystem.Issues #8182 and #7752.
These are the big remaining parts which would enable us to test the entire implementation.
TODO:
ClaimQueueStatecosmetics #10334--experimental-collator-protocolcli argument to enable the new collator protocol implementation #10285para_idsRuntime API #9055These commits were added just to run the CI tests for this PR with the new experimental protocol
After merging:
Uses a slightly modified version of the ClaimQueueState written by @tdimitrov in #7114.