Update Scheduler to Support Relay Chain Block Number Provider#6362
Update Scheduler to Support Relay Chain Block Number Provider#6362
Conversation
| } | ||
|
|
||
| let mut incomplete_since = now + One::one(); | ||
| let mut when = IncompleteSince::<T>::take().unwrap_or(now); |
There was a problem hiding this comment.
Can you explain why would not it work with IncompleteSince, without the block Queue?
How we determine the MaxScheduledBlocks bound?
With the IncompleteSince we iterate over blocks that might have no task to execute and this might make a situation with many incomplete blocks even worth. But probably not too much? One more read?
Both solutions need a strategy for a situation when there are two many tasks that can not be completed and the task queue only grow. If such strategy not yet in place.
There was a problem hiding this comment.
With the
IncompleteSincewe iterate over blocks that might have no task to execute and this might make a situation with many incomplete blocks even worth. But probably not too much? One more read?
Yes, but then this becomes unbounded in case too many blocks are skipped. The idea behind using the Queue is to bound this to a sufficient number.
How we determine the MaxScheduledBlocks bound?
This should be determined similar to the existing MaxScheduledPerBlock?
Both solutions need a strategy for a situation when there are two many tasks that can not be completed and the task queue only grow. If such strategy not yet in place.
There is already a retry mechanism and the task is purged if the retry count is exceeded (even if failed).
There was a problem hiding this comment.
The Queue not only bounds how many blocks gonna be processed from the past. It bounds for how many blocks we can schedule. If the number is 50, we can schedule only 50 jobs with distinct schedule time.
The MaxScheduledPerBlock for me seems simpler to define. Because the block size its exiting constrain the system have. But how many distinct schedule time points you can have is something new.
Retries work in case if a certain task fails while it's function call is being executed (not the scheduler fail). I meant a case when there are many (or few but too heavy) overdue tasks (task_block < now), so that the scheduler never (or needs too many time) to complete them and exist such overdue state to start processing tasks in time. Do we handle such case?
There was a problem hiding this comment.
The Queue not only bounds how many blocks gonna be processed from the past. It bounds for how many blocks we can schedule. If the number is 50, we can schedule only 50 jobs with distinct schedule time
Indeed, I do not find it quite comfortable to run a for loop with IncompleteSince when there could be an unknown number of blocks passed between the successive runs. You could always keep the MaxScheduledBlocks on the higher side that would give you a similar experience?
I meant a case when there are many (or few but too heavy) overdue tasks (task_block < now), so that the scheduler never (or needs too many time) to complete them and exist such overdue state to start processing tasks in time. Do we handle such case?
But this stays as an issue even in the current implementation? The change here just makes it bounded, so that the scheduling itself is blocked in such a case.
There was a problem hiding this comment.
Maybe we can put a quite big bound on the MaxScheduledBlocks, it is just a vec of block numbers.
There was a problem hiding this comment.
Playing devils advocate here since there could be parachains that only produce one block every two hours, which would get stuck without ever catching up the IncompleteSince.
I suggest focusing primarily on our use case, where Asset Hub sets up a scheduler with a Relay Chain block provider. If we can solve this problem with minimal code changes, better no code changes, I would prefer that approach. We can document our expectations for the BlockProvider type next to it's declaration, and if we or someone else encounter the use case you described in the future, we can address it then.
I wouldn’t complicate this pallet for theoretically possible use cases. Instead, we should target this issue for resolution in the next SDK release.
There was a problem hiding this comment.
There was a problem hiding this comment.
If we constraint our usecase, then we can
-
1- keep as it was, iterating on every block.
-
2- bring one level faster: a new storage map that maps from
block_number / 256to abit-array of size 256. And we iterate on this map, meaning we iterate for each 256 block, so 25.6 minutes. We only have to iterate 56 times to do one day. 1700 for one month. PoV cost is only 256 bit (+keys) for our asset hub usecase. PoV is getting bad if we read many empty value, but I guess it is ok. -
(3- implement a priority queue on top of chunks in a storage map).
I suggest 1 or 2, it is good for almost everything IMO.
There was a problem hiding this comment.
I suggest focusing primarily on our use case, where Asset Hub sets up a scheduler with a Relay Chain block provider.
But even in that case can the thing happen that i mentioned. There just is no guarantee as to how much history AH needs to check per block. So we either bound it with a queue or not. Maybe the 2) from Gui is a reasonable center point, but it still removes the property of the scheduler that every scheduled agenda will eventually be serviced since it could miss some.
Like if we only want to change it for AHM then maybe forking the pallet into the runtimes repo would work otherwise we may screw parachain teams by cutting corners here.
There was a problem hiding this comment.
Maybe the 2) from Gui is a reasonable center point, but it still removes the property of the scheduler that every scheduled agenda will eventually be serviced since it could miss some.
I missed this point, what can be missed?
I agree we should remove MaxScheduledBlocks and MaxStaleTaskAge and require in the pallet description that performance can significantly decrease if the chain blocks are too far apart.
As a note I think a priority queue on top of chunks should not be so hard to implement, and is probably the best solution.
substrate/frame/scheduler/src/lib.rs
Outdated
|
|
||
| #[pallet::storage] | ||
| pub type IncompleteSince<T: Config> = StorageValue<_, BlockNumberFor<T>>; | ||
| /// Provider for the block number. Normally this is the `frame_system` pallet. |
There was a problem hiding this comment.
Normally in what case? Parachain or relay/solo?
substrate/frame/scheduler/src/lib.rs
Outdated
| /// Provider for the block number. Normally this is the `frame_system` pallet. | ||
| type BlockNumberProvider: BlockNumberProvider; | ||
|
|
||
| /// The maximum number of blocks that can be scheduled. |
There was a problem hiding this comment.
Any hints on how to configure this? Parachain teams will read this and not know what number to put.
substrate/frame/scheduler/src/lib.rs
Outdated
| #[pallet::constant] | ||
| type MaxScheduledBlocks: Get<u32>; | ||
|
|
||
| /// The maximum number of blocks that a task can be stale for. |
There was a problem hiding this comment.
Also maybe a hint for a sane default value.
substrate/frame/scheduler/src/lib.rs
Outdated
| /// The queue of block numbers that have scheduled agendas. | ||
| #[pallet::storage] | ||
| pub(crate) type Queue<T: Config> = | ||
| StorageValue<_, BoundedVec<BlockNumberFor<T>, T::MaxScheduledBlocks>, ValueQuery>; |
There was a problem hiding this comment.
Do we know if one vector is enough? I think the referenda pallet creates an alarm for each ref...
There was a problem hiding this comment.
Not sure, if I get this, can you elaborate more?
There was a problem hiding this comment.
Is it okay, if I convert it to a vector of vector?
There was a problem hiding this comment.
vector of vector would be the same as a vector in regards to PoV.
I think if we need to make a lot of schedule we will have to use multiple storage item
There was a problem hiding this comment.
For our use-case (Governance) it is probably fine to have just one vector i guess since we only put a single block number in there. So we could reasonable have 1000 in there or so. But yes if someone uses it for other things then it could cause high PoV on a parachain.
| } | ||
|
|
||
| let mut incomplete_since = now + One::one(); | ||
| let mut when = IncompleteSince::<T>::take().unwrap_or(now); |
There was a problem hiding this comment.
Yes the referenda pallet creates an alarm for every ref to check the voting turnout.
We have a problem with (3.2) case only. On the current version (without Queue) it will eventually handle the overdue blocks (we can even calculate how many blocks it will take, lets say if there is no tasks scheduled in that period).
Depends on how many blocks are produced. I guess when we assume that the parachain will produce blocks at least as fast as it can advance the scheduler then yes.
Playing devils advocate here since there could be parachains that only produce one block every two hours, which would get stuck without ever catching up the IncompleteSince.
Conceptually, I believe that a priority Queue is the right data structure. We try to evaluate an ordered list of tasks by their order. It is exactly what a priority queue is good at. The issue with implementing this as a Vector is obviously the PoV.
Maybe we can implement the Queue as a B Tree? Then we can get the next task in log reads and insert in log writes. And it allows us to do exactly what we want: get the next pending task. It could be PoV optimized by using chunks as well.
To me it just seems that most of the pain here is that we are using the wrong data structure for the job.
substrate/frame/scheduler/src/lib.rs
Outdated
| let mut iter = queue.iter(); | ||
| let mut to_remove = Vec::new(); // Collect items to remove | ||
|
|
||
| // Iterate and collect items to remove | ||
| for _ in 0..index { | ||
| iter.next(); | ||
| } | ||
| for item in iter { | ||
| to_remove.push(*item); | ||
| } | ||
|
|
||
| // Now remove the collected items | ||
| for item in to_remove { | ||
| queue.remove(&item); | ||
| } |
There was a problem hiding this comment.
Not sure what is going on here, but maybe something like queue.iter().drain().take(index).map(drop); would work.
There was a problem hiding this comment.
that gave me an error:
error[E0599]: no method named drain found for mutable reference &mut BTreeSet<<... as BlockNumberProvider>::BlockNumber> in the current scope
There was a problem hiding this comment.
If the interface for bounded vec is too annoying you can convert to a vec and convert back. Because the code is difficult to read.
substrate/frame/scheduler/src/lib.rs
Outdated
| incomplete_since = incomplete_since.min(when); | ||
| let mut index = 0; | ||
| let queue_len = queue.len(); | ||
| for when in queue.iter().skip(index) { |
There was a problem hiding this comment.
You can store the iterator and make it mutable to call next onto it. Something like
while let Some(value) = iter.next() {
There was a problem hiding this comment.
the error was:
error[E0608]: cannot index into a value of type BTreeSet<<<T as pallet::Config>::BlockNumberProvider as sp_runtime::traits::BlockNumberProvider>::BlockNumber>
But I will try something similar again and see if that works
substrate/frame/scheduler/src/lib.rs
Outdated
| let mut when = IncompleteSince::<T>::take().unwrap_or(now); | ||
| let mut executed = 0; | ||
| let queue = Queue::<T>::get(); | ||
| let end_index = match queue.iter().position(|&x| x >= now) { |
There was a problem hiding this comment.
Why is this needed? There are only two cases: Either the first element is >= now or it is not.
We can just check the first element in the loop below and then take it from there.
substrate/frame/scheduler/src/lib.rs
Outdated
| return; | ||
| }, | ||
| }; | ||
| if end_index == 0 { |
There was a problem hiding this comment.
The +1 above prevents this from ever happening.
substrate/frame/scheduler/src/lib.rs
Outdated
| let queue_len = queue.len(); | ||
| for when in queue.iter().skip(index) { | ||
| if *when < now.saturating_sub(T::MaxStaleTaskAge::get()) { | ||
| Agenda::<T>::remove(*when); |
There was a problem hiding this comment.
Why do we need this MaxStaleTaskAge? I think it should maybe rather be checked on inserting into an agenda instead of when servicing it, since in the service step we have nothing to lose from executing it.
But am not sure if we need it at all.
Am a bit torn here. I often think that just forking the pallets would make our lives easier since it is not a breaking change anymore... but it could mean increased maintenance..- @seadanda @kianenigma any opinion? |
| const SEED: u32 = 0; | ||
| const BLOCK_NUMBER: u32 = 2; | ||
|
|
||
| fn block_number<T: Config>() -> u32 { |
There was a problem hiding this comment.
naming it max_scheduled_blocks sounds more appropriate
|
From this thread: #6362 (comment)
I think this structure shouldn't be to much complexity added and will solve most parachain usecase. WDYT |
With async backing/elastic scaling, we break the assumption made in the scheduler and elsewhere that parachain block number represents a regular monotonic clock which can be used as a proxy to a wall clock. IMO the pallet should have had this feature from the start, but since this is the contract we released under I understand why making changes which decrease the benchmarked performance seems like a regression. In terms of functionality nothing changes for people who set the block provider to Maybe we could introduce a feature flag which removes the extra data structure and introduced complexity - not sure how disgusting that would be. Since the benchmarking depends on how it's configured, are we even sure there will be a weight impact if it's configured to use |
|
no feature flag plz. it is guaranteed to introduce disaster. what are the downside of making a new pallet? I see nothing. |
|
I propose that we iterate on this issue and split it into a few separate ones, prioritizing the most time-sensitive part, which I believe we can close soon enough to include in the next SDK release.
Next, let’s first see if we agree on (1). If so, we could ask @seemantaggarwal to create a separate PR for it. For (2), we could open a separate issue and discuss the solution for both (2) and (3), whether that ends up being a single or two different solutions. Why not complete the current solution? There’s a lot of disagreement, and we might not make it into the next release. Personally, I’m not happy with the solution either. The Queue, with its overhead, impacts scheduler setups using a local block provider, even those that don’t have issue (2). Additionally, it’s unclear how to determine a good MaxScheduledBlocks value. We can discuss this in a separate issue. @ggwpez @gui1117 @kianenigma @seadanda @xlc @seemantaggarwal |
|
All GitHub workflows were cancelled due to failure one of the required jobs. |
|
@muharem I agree with the disagreement parts, what I will do is, I will have a word with @gui1117 and potentially @seadanda I am not sure where the line can be drawn for each follow up. In the meantime, I will leave the current PR in a slightly better shape for any one to be able to pick it up later easily |
|
how about just copy & paste the pallet scheduler in this PR and called it something else? and if time allowed, abstract duplicated logic into some helper functions |
If we do (1) as Muharem suggest it only modifies an associated type and nothing else, I think it is fine. |
|
from discussion with @gui1117 I will create a separate issue for (1) for now, and then we can circle back to (2) at a later stage |
|
Issue created here: #7434 |
Follow up from #6362 (comment) The goal of this PR is to have the scheduler pallet work on a parachain which does not produce blocks on a regular schedule, thus can use the relay chain as a block provider. Because blocks are not produced regularly, we cannot make the assumption that block number increases monotonically, and thus have new logic to handle multiple spend periods passing between blocks. Requirement: instead of using the hard coded system block number. We add an associated type BlockNumberProvider --------- Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io> Co-authored-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io>
paritytech#7441) Follow up from paritytech#6362 (comment) The goal of this PR is to have the scheduler pallet work on a parachain which does not produce blocks on a regular schedule, thus can use the relay chain as a block provider. Because blocks are not produced regularly, we cannot make the assumption that block number increases monotonically, and thus have new logic to handle multiple spend periods passing between blocks. Requirement: instead of using the hard coded system block number. We add an associated type BlockNumberProvider --------- Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io> Co-authored-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io>
|
I think this is already done in #7441. Should we close it? |
|
yes it resolved by the other PR |
Step in #6297
This PR adds the ability for the Scheduler pallet to specify its source of the block number. This is needed for the scheduler pallet to work on a parachain which does not produce blocks on a regular schedule, thus can use the relay chain as a block provider. Because blocks are not produced regularly, we cannot make the assumption that the block number increases monotonically, and thus have a new logic via a
Queueto handle multiple blocks with valid agenda passed between them.This change only needs a migration for the
Queue:BlockNumberProvidercontinues to use the system pallet's block numberrelay chain's block number
However, we would need separate migrations if the deployed pallets are upgraded on an existing parachain, and the
BlockNumberProvideruses the relay chain block number.Todo