PVF: re-preparing artifact on failed runtime construction#3187
Conversation
|
@s0me0ne-unkn0wn before I continue, could you please have a look? Just to confirm whether it is a right direction |
|
Looks good, thank you! TBH, I had a feeling that it would require a more complex design, but it looks like you managed to keep it simple. Considering the complexity of PVF host internals, we could definitely use a test to check the behavior is correct, but probably creating it is not as easy as it sounds 🤔 |
|
@s0me0ne-unkn0wn hi!:) I believe I was wrong on the ordering of the reply to result channel of ValidationBackend and the artifact removal in case of RuntimeConstruction. They are concurrent but as the artifact name is random it's not a race issue (and in practice, the artifact is removed first in majority of cases - I did some tests). |
|
Still struggling to find time to review it myself and run some local tests, maybe other guys will be faster than me. Overall, looks very good! I just need to dive a bit deeper into it. |
s0me0ne-unkn0wn
left a comment
There was a problem hiding this comment.
Looks really good! I've done some local tests, and it works even better than I expected; it catches some cases of on-disk artifact corruption that were only supposed to be addressed by checking hashes.
Left some nits, comments, and questions below, but none are blockers.
| /// A possibly transient runtime instantion error happend during the execution; maybe retried | ||
| /// with preparation |
There was a problem hiding this comment.
| /// A possibly transient runtime instantion error happend during the execution; maybe retried | |
| /// with preparation | |
| /// A possibly transient runtime instantiation error happened during the execution; may be retried | |
| /// with re-preparation |
polkadot/node/core/pvf/src/host.rs
Outdated
| ) -> Result<(), Fatal> { | ||
| let (artifact_id, path) = artifacts | ||
| .remove(artifact_id) | ||
| .expect("artifact sent by the execute queue for removal exists; qed"); |
There was a problem hiding this comment.
Is it necessarily true? Could the pruning procedure remove it in the meantime? I'm just discoursing on it now, not asserting. Probably, it's not a valid concern, as the execution should mark the artifact as recently used, and pruning shouldn't touch it. But how bad could it be if we silently ignored the fact that we tried to remove an artifact that didn't even exist in the cache? Sounds safe to me.
There was a problem hiding this comment.
Agree that it is safe and there is no need for qed assertion :). Fixed at a10f16d
| } | ||
|
|
||
| /// Remove artifact by its id. | ||
| pub fn remove(&mut self, artifact_id: ArtifactId) -> Option<(ArtifactId, PathBuf)> { |
There was a problem hiding this comment.
Do I get it right, we don't currently remove anything from disk, either here or in the unused artifact pruning procedure? Are they always removed from the memory cache table only, and the disk is cleaned up only on node startup? It's totally okay for now, but we shouldn't forget that if we ever decide to re-enable artifact persistence.
There was a problem hiding this comment.
Oh, it seems like I indeed missed the moment when artifact names became really random, so right now, it's not a concern at all. Never mind.
There was a problem hiding this comment.
Yeah, the logic of remove is the same as with pruning, i.e. it affects only the cache
@s0me0ne-unkn0wn thank you for the catch with artifacts' stale cache and a potential panic because of it! |
| break | ||
| } | ||
|
|
||
| let mut wait_retry_delay = true; |
There was a problem hiding this comment.
What do you think about retry_immediately? I would appreciate a brief comment on why we need it
…head-data * origin/master: Fix call enum's metadata regression (#3513) Enable elastic scaling node feature in local testnets genesis (#3509) update development setup in sdk-docs (#3506) Fix accidental no-shows on node restart (#3277) Remove `AssignmentProviderConfig` and use parameters from `HostConfiguration` instead (#3181) [Deprecation] Remove sp_weights::OldWeight (#3491) Fixup multi-collator parachain transition to async backing (#3510) Multi-Block-Migrations, `poll` hook and new System callbacks (#1781) Snowbridge - Extract Ethereum Chain ID (#3501) PVF: re-preparing artifact on failed runtime construction (#3187) Add documentation around FRAME Offchain workers (#3463) [prdoc] Optional SemVer bumps and Docs (#3441) rpc-v2/tx/tests: Add transaction broadcast tests and check propagated tx status (#3193)
…data * ao-collator-parent-head-data: Fix call enum's metadata regression (#3513) Enable elastic scaling node feature in local testnets genesis (#3509) update development setup in sdk-docs (#3506) Fix accidental no-shows on node restart (#3277) Remove `AssignmentProviderConfig` and use parameters from `HostConfiguration` instead (#3181) [Deprecation] Remove sp_weights::OldWeight (#3491) Fixup multi-collator parachain transition to async backing (#3510) Multi-Block-Migrations, `poll` hook and new System callbacks (#1781) Snowbridge - Extract Ethereum Chain ID (#3501) PVF: re-preparing artifact on failed runtime construction (#3187) Add documentation around FRAME Offchain workers (#3463) [prdoc] Optional SemVer bumps and Docs (#3441) rpc-v2/tx/tests: Add transaction broadcast tests and check propagated tx status (#3193)
resolve #3139
execute_artifactRuntimeConstructionerror during the executionvalidate_candidate_with_retryofValidationBackendwith the case of retriableRuntimeConstructionerror during the execution