elastic scaling: add core selector to cumulus#5372
Conversation
|
|
||
| /// The default claim queue offset to be used if it's not configured/accessible in the parachain | ||
| /// runtime | ||
| pub const DEFAULT_CLAIM_QUEUE_OFFSET: u8 = 1; |
There was a problem hiding this comment.
open question: should this be 0 or 1.
0 would preserve backwards compatibility with the current collators (they are all building right at the top of the claim queue).
1 would be what core-sharing parachains should use.
There was a problem hiding this comment.
1 should be the default value. With 0 they can't fully use the 2s of execution if the core is shared and can't find any reason why this would be useful in practice.
There was a problem hiding this comment.
I think the default needs to be configurable by the parachain ... or we derive it from async backing settings. In other words, we either can make a default that does the right thing for all chains or we should not provide one, but make it configurable with simple doc explaining when to use what. (async backing/6s block times yet or not?)
Can you please explain why we need this? AFAIR the parachain runtime does not really cares on which core it is build, it is just more about ensuring that someone is not sending a block build for core X to be included on core Y. |
|
IMO passing a pre-digest that contains the |
The RFC has more details about it. @sandreim may have more inside info but from my understanding: The parachain runtime will indirectly specify the core on which the candidate is valid on (to prevent the situation you mentioned). The way it specifies the core is by sending a UMP message with a sequence number (which can be the parachain block number) and a claim queue offset. The relay chain and the collator node will then use this info to determine the core index. This was done to avoid passing the claim queue snapshot into the parachain's inherent data. So indeed the parachain doesn't really care on which core it is built, but it needs to somehow make sure that it commits to one core. That's one reason why we don't specify the core in the UMP message. |
|
|
||
| /// The default claim queue offset to be used if it's not configured/accessible in the parachain | ||
| /// runtime | ||
| pub const DEFAULT_CLAIM_QUEUE_OFFSET: u8 = 1; |
There was a problem hiding this comment.
1 should be the default value. With 0 they can't fully use the 2s of execution if the core is shared and can't find any reason why this would be useful in practice.
Can you detail how it would work ? The only way the runtime could know the exact core index would require to see the claim queue assignments and that is more expensive in terms of storage proof compared to the current proposal. |
The block author just provides a pre-digest. Like for example the BABE randomness or the AURA author index. The runtime can read it from there. From there it will be able to expose the information at validation. I don't see the need for having the parachain runtime know anything about the claim queue. |
I am not familiar with that code, but I will look into it. If we do this, does the collator actually have any control over the core index ? Also, how would the parachain runtime verify if the core index is correct ? |
I first thought it doesn't need, but we need to have at least the selector. I mean in the runtime doesn't need to care about the index as long as it outputs the correct index. I left some questions on the RFC. I think if they are answered, I have a better understanding. 👍 |
|
The CI pipeline was cancelled due to failure one of the required jobs. |
|
The CI pipeline was cancelled due to failure one of the required jobs. |
|
I am also having some problems fully wrapping my head around what the impact of a claim queue offset is.
Lets imagine an offset of 2. During authoring I check the claim queue at position 2 and see that I have cores 0, 1, 2. Now I submit 3 blocks at this relay parent. If for some reason, previous validators skipped a block, would the claims for core 1 at position 0 and 1 interfere with the candidate chain? Basically collators before me skipped some blocks and one of the blocks I submit on core 1 get an earlier claim but might depend on blocks I submitted on core 0 and 2? Or my mental model does not match what is actually happening? |
Collators don't submit to a specific core. They submit to the nth assigned core in the claim queue. So if the previous collators skipped authoring, you can use their slots in the claim queue. @sandreim correct me if I'm wrong |
That'd be a serious limit on elastic scaling. You can make a parachain with faster collators, but you cannot speed up validators so easily. I suppose the conversation moved on from there though. |
sandreim
left a comment
There was a problem hiding this comment.
LGTM but left a few comments.
cumulus/client/consensus/aura/src/collators/slot_based/block_builder_task.rs
Outdated
Show resolved
Hide resolved
cumulus/client/consensus/aura/src/collators/slot_based/block_builder_task.rs
Outdated
Show resolved
Hide resolved
| self.last_data = Some((relay_parent, data)); | ||
| Ok(&mut self.last_data.as_mut().expect("last_data was just set above").1) |
There was a problem hiding this comment.
This feels a bit clunky. Can we just check/update if needed before and then in this match we'd just return a mutable reference to the inner ?
There was a problem hiding this comment.
I've tried making this look nicer but this was the best I could :D
cumulus/client/consensus/aura/src/collators/slot_based/block_builder_task.rs
Outdated
Show resolved
Hide resolved
| - name: rococo-parachain-runtime | ||
| bump: minor | ||
| - name: polkadot-parachain-bin | ||
| bump: major |
There was a problem hiding this comment.
How is this a major bump ? Anyone using the binary is broken ?
There was a problem hiding this comment.
it's a major bump on the polkadot-parachain-lib because we add a new trait bound to a public trait.
When I initially created this prdoc the polkadot-parachain-lib was part of the bin crate so that's why I bumped the bin crate. Fixed it now
|
|
||
| /// Trait for selecting the next core to build the candidate for. | ||
| pub trait SelectCore { | ||
| fn select_core_for_child() -> (CoreSelector, ClaimQueueOffset); |
There was a problem hiding this comment.
so that it better expresses what this API returns. It doesn't return the core selector for the current block. It returns the core selector for the next block.
skunert
left a comment
There was a problem hiding this comment.
Looks really good, only nits. I like the trait based config approach we have here now.
cumulus/client/consensus/aura/src/collators/slot_based/block_builder_task.rs
Outdated
Show resolved
Hide resolved
| if !claimed_cores.insert(*core_index) { | ||
| tracing::debug!( | ||
| target: LOG_TARGET, | ||
| "Core {:?} was already claimed at this relay chain slot", |
There was a problem hiding this comment.
This is basically the condition that is described above when we have not enough cores scheduled to support the paras slot duration. (at least without merging of para blocks)
cumulus/client/consensus/aura/src/collators/slot_based/block_builder_task.rs
Outdated
Show resolved
Hide resolved
| (**self).version(relay_parent).await | ||
| } | ||
|
|
||
| async fn claim_queue( |
There was a problem hiding this comment.
With the introduction of this can we remove availability_cores from the interface? Did not check, but should not be used anymore now right?
There was a problem hiding this comment.
yes, we could. But is there a good reason for that? There's still information that the availability-cores API contains that you cannot get elsewhere: like when a core is occupied or not. Maybe this will come in handy at some point in the future.
Partially implements #5048
What's left to be implemented (in a follow-up PR):
View the RFC for more context: polkadot-fellows/RFCs#103