Separate preparation timeouts for PVF prechecking and execution#6139
Separate preparation timeouts for PVF prechecking and execution#6139
Conversation
|
Not sure how to label this for release, i.e. what is worthy of release notes? |
|
@m-cat - the default should be 'silent' but major features, API changes, DB locations, and other things that impact the process of running a validator should be present in release notes. |
eskimor
left a comment
There was a problem hiding this comment.
Excellent! Really nice work! One thing, with regards to leniency, for execution we have a factor of 6, having the same factor here seems to make sense and would make it a bit "safer".
|
With regards to tests, for this particular change there is hardly something we could test.
Ok, actually the situation is a bit different. For backing/approval timeouts we only have like 2-3 validators with the short timeout, here we have a super majority - so a smaller lenience can be justified - still node load can vary over time, also the validator set may change over time ... better safe than sorry. |
pepyakin
left a comment
There was a problem hiding this comment.
Good job. Don't get discouraged by the number of comments. They are all nits, which (at least for my reviews) means they are not blocking the merge and are not necessary to address.
Organizational note: I personally prefer landing code as soon as it's ready. If you see a part that is not ready and you wanna address before the merge, I am up to split it into another PR. That will give us a platform for discussion the changes with more focus. It's also possible to land and as-is and send the fixes with a follow-up PR (as long as everything is linked)
| priority: Priority, | ||
| pvf: Pvf, | ||
| /// The timeout for the preparation job. | ||
| compilation_timeout: Duration, |
There was a problem hiding this comment.
Nit: I think here (and elsewhere) it's more correct to say preparation. The worker will perform preparation, which is the combination of prevalidation and compilation (production of the compiled artifact).
| }; | ||
|
|
||
| /// The time period after which the precheck preparation worker is considered unresponsive and will | ||
| /// be killed. |
There was a problem hiding this comment.
Nit: those docs distinguish between pre-check preparation worker and execute preparation worker. I am not sure if that's the right way of thinking about that. After all it's the very same worker that does the same the job. It does not even have any parametrization.
| /// The time period after which the execute preparation worker is considered unresponsive and will | ||
| /// be killed. | ||
| // NOTE: If you change this make sure to fix the buckets of `pvf_preparation_time` metric. | ||
| pub const EXECUTE_COMPILATION_TIMEOUT: Duration = Duration::from_secs(180); |
There was a problem hiding this comment.
Nit: It would be great if there were a doc line explaining the relationship between the two. Perhaps, moving them into a module (inline or separate file) and in the module doc explaining the stuff we discussed in DMs?
| 20.0, | ||
| 30.0, | ||
| 60.0, | ||
| 180.0, |
There was a problem hiding this comment.
Nit: Do you think that's a good resolution for this metric? IOW, ask the question, looking at a metrics dashboard do you think it's possible that you would think for yourself "I wish there were more buckets available that are higher/lesser than 180"? If you are inclined to say yes just plop more bands. It's very cold code.
There was a problem hiding this comment.
What do you mean by cold code?
There was a problem hiding this comment.
generally, cold code is the code that is not frequently called. Here, I meant that there are no performance reasons to save on the bands. Assuming more bands, less performance. I did not even think about this too much since it's how many preparations per second can we reasonably do in the worst case? 100? So with the performance of the node argument being irrelevant, and by extension memory as well. I don't see other arguments against it.
| - **Prevalidation:** Right now this just tries to deserialize the binary with | ||
| parity-wasm. It is a part of *preparation*. | ||
| - **Compilation:** This is the process of compiling a PVF from wasm code to | ||
| machine code. It is a part of *preparation*. |
There was a problem hiding this comment.
Nit: This book already has a glossary. Do you think it would be better to move those there? Here, we can leave a note saying that this is a loaded document with terms refer to the glossary.
Alternatively (or better, additionally) we could embed those into the text as explainations. I think as a bonus this would allow us to structure the explaination hierarchically, IMO better. Something like the following abstract:
- In order to make the PVF usable for candidate validation it has to be registered on-chain
- As part of the registration process, it has to go through pre-checking.
- Pre-checking is a game of attempting preparation and reporting the results back on-chain.
- We define preparation as a process that: validates the consistency of the wasm binary (aka prevalidation) and the compilation of the wasm module into machine code (refered to as artifact).
- Besides pre-checking, preparation can also be triggered by execution, since compiled artifact is needed for the execution
| /// The time period after which the execute preparation worker is considered unresponsive and will | ||
| /// be killed. | ||
| // NOTE: If you change this make sure to fix the buckets of `pvf_preparation_time` metric. | ||
| pub const EXECUTE_COMPILATION_TIMEOUT: Duration = Duration::from_secs(180); |
There was a problem hiding this comment.
Nit: I wonder if this would be better named as lazy or lenient. After all we use it for the heads up signal which also requires a more permissive timeout.
| Ok(()) | ||
| } | ||
|
|
||
| /// Handles PVF prechecking. |
There was a problem hiding this comment.
Nit: ... prechecking requests
| } else { | ||
| // Artifact is unknown: register it and enqueue a job with the corresponding priority and | ||
| // | ||
| // PVF. |
There was a problem hiding this comment.
lol, finally this is fixed 🎉
|
Thanks for the reviews! 👍 Just stuck on a couple of CI checks:
|
@eskimor Sounds good, I'll address it in the followup PR.
All good. I was expecting more comments than that for my first PR! I agree about addressing the nits in a followup PR. |
|
jFYI: |
There was a problem hiding this comment.
Agreed with @pepyakin comments, other than that well done
Note: once pvf is queued for preparation with some timeout, any subsequent request would discard compilation_timeout parameter and simply enqueue response_receiver (the way it's implemented now)
This is OK because we can't receive execute request for unprepared code until it's enacted once prechecking process concludes, and the code that was already prechecked should pass this process with a greater timeout.
However, this is an external guarantee so wanted to make sure you keep it in mind.
UPD: nvm didn't make it in time 😪
* master: (21 commits) try and fix build (#6170) Companion for EPM duplicate submissions (#6115) Bump docker/setup-buildx-action from 2.0.0 to 2.1.0 (#6141) companion for #12212 (#6162) Bump substrate (#6164) BlockId removal: refactor: StorageProvider (#6160) availability-recovery: use `IfDisconnected::TryConnect` for chunks (#6081) Update clap to version 4 (#6128) Add `force_open_hrmp_channel` Call (#6155) Fix fuzzing builds xcm-fuzz and erasure-coding fuzzer (#6153) BlockId removal refactor: Backend::state_at (#6149) First round of implementers guide fixes (#6146) bump zombienet version (#6142) lingua.dic is not managed by CI team (#6148) pallet-mmr: RPC and Runtime APIs work with block numbers (#6072) Separate preparation timeouts for PVF prechecking and execution (#6139) Malus: add disputed block percentage (#6100) refactor grid topology to expose more info to subsystems (#6140) Manual Para Lock (#5451) Expose node subcommands in Malus CLI (#6135) ...
PULL REQUEST
Overview
Per the linked issue, we make the required changes so that preparation for
execution is more lenient (by a factor of 3) than preparation for prechecking.
We add a compilation_timeout parameter for PVF preparation job
and also split the
COMPILATION_TIMEOUTconstant into two new consts.Todo
What kind of tests should be added?Issues Closed
Closes #4132