Conversation
…without a candidate hash
…ue state from backing
|
All GitHub workflows were cancelled due to failure one of the required jobs. |
What priority is this? If it's just the order in the claim queue, we can for sure just keep a copy of that
Why keep claims for candidates that have been blocked by backing? I don't think they will ever be accepted |
To simplify, we could keep the old way of buffering on max_candidate_depth for these (migrating it to scheduling_lookahead). Caveat being that they won't be able to use elastic scaling, but if they want to, they need to upgrade anyway |
This seems to be very related to the |
|
|
||
| // add entries in `per_relay_parent`. for all new relay-parents. | ||
| for maybe_new in fresh_relay_parents { | ||
| for maybe_new in fresh_relay_parents.into_iter().rev() { |
| ) | ||
| .await?; | ||
|
|
||
| // get the ancestor of the relay block |
There was a problem hiding this comment.
we could get this info from the implicit view
| background_validation_tx: &mpsc::Sender<(Hash, ValidatedCandidateCommand)>, | ||
| attesting: AttestingData, | ||
| ) -> Result<(), Error> { | ||
| ) -> Result<bool, Error> { |
There was a problem hiding this comment.
we should document what these booleans represent
| return Ok(()) | ||
| } | ||
|
|
||
| claim_slot_for_seconded_statement(&statement, rp_state, &mut state.claim_queue_state)?; |
There was a problem hiding this comment.
this should be after if let Err(Error::RejectedByProspectiveParachains) = res {
Implements the `CollationManager` and the new collator protocol (validator side) subsystem. Issues #8182 and #7752. These are the big remaining parts which would enable us to test the entire implementation. TODO: - [ ] add a couple more unit tests (see the suggestions at the bottom of the tests file) - [x] polish the ClaimQueueState and verify if it's sufficiently covered by unit tests - #10334 - #10368 - [x] add metrics and polish logs - #10730 - [x] add a CLI parameter for enabling the experimental subsystem (and remove the compile-time feature) -> #10285 - [x] implement registered paras update, using #9055 - [ ] do some manual zombienet tests with v1 protocol version and with restarting validators (including syncing with warp sync) - [x] prdoc - [x] Rollback - 03e8915 - 05e1497 These commits were added just to run the CI tests for this PR with the new experimental protocol After merging: - [ ] versi testing Uses a slightly modified version of the ClaimQueueState written by @tdimitrov in #7114. --------- Co-authored-by: Tsvetomir Dimitrov <tsvetomir@parity.io> Co-authored-by: Serban Iorga <serban@parity.io> Co-authored-by: Serban Iorga <serban300@gmail.com> Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>
#4880 implements collation fairness in the collator protocol subsystem but it has got some drawbacks:
To overcome these problems a solution was proposed in #5079 (comment). In nutshell the idea is to move collations tracking in the backing subsystem since it has got a better view on what candidates are being fetched and seconded. How this solves the problems outline above? One of backing subsystem's goals is to keep track of all candidates that are getting backed no matter if they originate from validator's own backing group (if any or not). This fact naturally solves both problems. First we track candidates from group rotations since we have got a single view of the claim queue. Second we receive statements from all backing groups and we can keep the local claim queue state relatively up to date.
Implementation notes
The PR touches collation-protocol/validator side, backing subsystem and
ClaimQueueState(introduced with #4880).ClaimQueueStateClaimQueueStatewas created on the fly in #4880. This means that each time we wanted to know the state of the claim queue we builtClaimQueueStateinstance. In this PR we want to keep a single view of the claim queue for each leaf and each core and continuously update it by adding new leaves and candidates and dropping old ones. For this reasonClaimQueueStateis extended to keep track of the candidate hash occupying each a slot. This is needed because we want to be able to drop claims for failed fetches, invaid candidates, etc. Soclaimedis no longer aboolbut an enum:Note that the ordering here is not perfect. Let's say there are candidates A1->A2->A3 for a parachain built on top of each other. If we see them in the order A2, A3, A1 we will insert them in the claim queue in the same (wrong) order. This is fine because (a) we need the candidate hash only to be able to release claims and (b) the exact order of candidates is known by the prospective parachains subsystem.
The second change in this module is related to leaves. The claim queue has got different state for each fork and we need to be able to track this. For this reason
PerLeafClaimQueueStatewhich wraps aHashMapofClaimQueueStatefor each leaf. The non-trivial logic here is adding a new leaf (add_leaf). We need to handle three different cases:❗ All these cases are covered with unit tests but I'd appreciate an extra attention from the reviewers. ❗
backing subsystem
Backing subsystem has got an instance of
PerLeafClaimQueueStatefor each core and updates it by keeping track of new candidates. We have got three different set of candidates:Each group is handled differently:
To achieve all these three new messages are introduced:
CanClaimreturnstrueif a claim can be made for a specific para id. This message is needed to handle collator protocol v1 advertisements which don't contain candidate hash. More on this in the next section.PendingSlots- returns all pending claims from the claim queue.DropClaims- drops claims for one or more candidates.All these messages are used by the collator protocol and their application will be explained there.
In nutshell the following changes are made in backing:
PerLeafClaimQueueStateand keep it up to date. This includes adding new leaves, removing stale leaves, making claims for candidate and releasing claims for bad candidates. These should be easy to spot in the PR.collator protocol
Most controversial changes are in this subsystem. Since backing subsystem keeps track of the claim queue state the collator protocol no longer builds
ClaimQueueState'on the fly'. The price to pay is additional messages exchanged with backing.Generally speaking the collator protocol follows these steps:
If it is over protocol version 2+
CanSecondmessage is sent to the backing subsystem. The latter gets the leaves where the candidate can be seconded and makes a claim for each leaf. If a claim can't be made - the leaf is dropped.If it is over protocol version 1 we can't claim a slot in the claim queue since we don't know the candidate hash. I opted not to support generic claims since they complicate the implementation and hopefully we will deprecate v1 at some point. So what we do is send
CanClaimto backing subsystem and get a confirmation that at this point there is a free spot for the candidate in the claim queue. There is a potential race here because the spot is not claimed and can be occupied by another candidate while we do the fetch. I think this risk is acceptable. Worst case we might have a few extra candidates.Secondedto backing and the claim is marked asSeconded. If the fetch is unsuccessful we useDropClaimsto notify backing that the claim should be released.PendingSlotsreturns all claims in pending state (these are all pending fetches). We see if we have got a pending candidate for this para id and fetch it. The claims in the result are ordered by priority.PendingSlots.The collator protocol also keeps track of candidates which are blocked by backing (
CanSecondhas returnedBlockedByBackingerror for them). These candidates has got pending claims and if their relay parent goes out of view they are dropped.TODOs
The testing for this PR is still work in progress, but the functionally I consider it done.