PVF: re-preparing artifact on failed runtime construction by maksimryndin · Pull Request #3187 · paritytech/polkadot-sdk

maksimryndin · 2024-02-02T11:57:31Z

resolve #3139

use a distinguishable error for execute_artifact
remove artifact in case of a RuntimeConstruction error during the execution
augment the validate_candidate_with_retry of ValidationBackend with the case of retriable RuntimeConstruction error during the execution
update the book (https://paritytech.github.io/polkadot-sdk/book/node/utility/pvf-host-and-workers.html#retrying-execution-requests)
add a test
run zombienet tests

…error during execution

maksimryndin · 2024-02-02T12:13:00Z

@s0me0ne-unkn0wn before I continue, could you please have a look? Just to confirm whether it is a right direction

s0me0ne-unkn0wn · 2024-02-03T10:56:13Z

Looks good, thank you! TBH, I had a feeling that it would require a more complex design, but it looks like you managed to keep it simple.

Considering the complexity of PVF host internals, we could definitely use a test to check the behavior is correct, but probably creating it is not as easy as it sounds 🤔

polkadot/node/core/pvf/src/execute/worker_interface.rs

maksimryndin · 2024-02-05T16:53:45Z

@s0me0ne-unkn0wn hi!:) I believe T13-documentation should be added

I was wrong on the ordering of the reply to result channel of ValidationBackend and the artifact removal in case of RuntimeConstruction. They are concurrent but as the artifact name is random it's not a race issue (and in practice, the artifact is removed first in majority of cases - I did some tests).

s0me0ne-unkn0wn · 2024-02-09T12:59:23Z

Still struggling to find time to review it myself and run some local tests, maybe other guys will be faster than me. Overall, looks very good! I just need to dive a bit deeper into it.

s0me0ne-unkn0wn

Looks really good! I've done some local tests, and it works even better than I expected; it catches some cases of on-disk artifact corruption that were only supposed to be addressed by checking hashes.

Left some nits, comments, and questions below, but none are blockers.

s0me0ne-unkn0wn · 2024-02-11T17:32:51Z

polkadot/node/core/pvf/common/src/execute.rs

+	/// A possibly transient runtime instantion error happend during the execution; maybe retried
+	/// with preparation


Suggested change

/// A possibly transient runtime instantion error happend during the execution; maybe retried

/// with preparation

/// A possibly transient runtime instantiation error happened during the execution; may be retried

/// with re-preparation

My bad :) thank you! Fixed at a10f16d

s0me0ne-unkn0wn · 2024-02-11T17:50:15Z

polkadot/node/core/pvf/src/host.rs

+) -> Result<(), Fatal> {
+	let (artifact_id, path) = artifacts
+		.remove(artifact_id)
+		.expect("artifact sent by the execute queue for removal exists; qed");


Is it necessarily true? Could the pruning procedure remove it in the meantime? I'm just discoursing on it now, not asserting. Probably, it's not a valid concern, as the execution should mark the artifact as recently used, and pruning shouldn't touch it. But how bad could it be if we silently ignored the fact that we tried to remove an artifact that didn't even exist in the cache? Sounds safe to me.

Agree that it is safe and there is no need for qed assertion :). Fixed at a10f16d

s0me0ne-unkn0wn · 2024-02-11T17:57:07Z

polkadot/node/core/pvf/src/artifacts.rs

 	}

+	/// Remove artifact by its id.
+	pub fn remove(&mut self, artifact_id: ArtifactId) -> Option<(ArtifactId, PathBuf)> {


Do I get it right, we don't currently remove anything from disk, either here or in the unused artifact pruning procedure? Are they always removed from the memory cache table only, and the disk is cleaned up only on node startup? It's totally okay for now, but we shouldn't forget that if we ever decide to re-enable artifact persistence.

Oh, it seems like I indeed missed the moment when artifact names became really random, so right now, it's not a concern at all. Never mind.

Yeah, the logic of remove is the same as with pruning, i.e. it affects only the cache

…ents

maksimryndin · 2024-02-12T09:11:06Z

Looks really good! I've done some local tests, and it works even better than I expected; it catches some cases of on-disk artifact corruption that were only supposed to be addressed by checking hashes.

Left some nits, comments, and questions below, but none are blockers.

@s0me0ne-unkn0wn thank you for the catch with artifacts' stale cache and a potential panic because of it!

AndreiEres

Good job, thank you!

AndreiEres · 2024-02-28T11:09:06Z

polkadot/node/core/candidate-validation/src/lib.rs

 				break
 			}
-
+			let mut wait_retry_delay = true;


What do you think about retry_immediately? I would appreciate a brief comment on why we need it

Sure, thank you! Fixed d722ccf

…head-data * origin/master: Fix call enum's metadata regression (#3513) Enable elastic scaling node feature in local testnets genesis (#3509) update development setup in sdk-docs (#3506) Fix accidental no-shows on node restart (#3277) Remove `AssignmentProviderConfig` and use parameters from `HostConfiguration` instead (#3181) [Deprecation] Remove sp_weights::OldWeight (#3491) Fixup multi-collator parachain transition to async backing (#3510) Multi-Block-Migrations, `poll` hook and new System callbacks (#1781) Snowbridge - Extract Ethereum Chain ID (#3501) PVF: re-preparing artifact on failed runtime construction (#3187) Add documentation around FRAME Offchain workers (#3463) [prdoc] Optional SemVer bumps and Docs (#3441) rpc-v2/tx/tests: Add transaction broadcast tests and check propagated tx status (#3193)

…data * ao-collator-parent-head-data: Fix call enum's metadata regression (#3513) Enable elastic scaling node feature in local testnets genesis (#3509) update development setup in sdk-docs (#3506) Fix accidental no-shows on node restart (#3277) Remove `AssignmentProviderConfig` and use parameters from `HostConfiguration` instead (#3181) [Deprecation] Remove sp_weights::OldWeight (#3491) Fixup multi-collator parachain transition to async backing (#3510) Multi-Block-Migrations, `poll` hook and new System callbacks (#1781) Snowbridge - Extract Ethereum Chain ID (#3501) PVF: re-preparing artifact on failed runtime construction (#3187) Add documentation around FRAME Offchain workers (#3463) [prdoc] Optional SemVer bumps and Docs (#3441) rpc-v2/tx/tests: Add transaction broadcast tests and check propagated tx status (#3193)

maksimryndin added 2 commits February 1, 2024 17:06

pvf execute artifact return distinguishable error

e409203

pvf retriable candidate validation in case of a runtime construction …

a2bd8de

…error during execution

maksimryndin mentioned this pull request Feb 2, 2024

PVF: Consider re-preparing artifact on failed runtime construction #3139

Closed

s0me0ne-unkn0wn reviewed Feb 3, 2024

View reviewed changes

polkadot/node/core/pvf/src/execute/worker_interface.rs Outdated Show resolved Hide resolved

s0me0ne-unkn0wn added the T0-node This PR/Issue is related to the topic “node”. label Feb 3, 2024

maksimryndin added 3 commits February 3, 2024 11:44

pvf fix comments

3f3e132

pvf test for artifact corruption

b4c8361

update the book and prdoc on runtime construction retry

30b3e8a

maksimryndin added 2 commits February 5, 2024 16:54

update prdoc

70cc13f

fix markdown lint

2e0f930

maksimryndin marked this pull request as ready for review February 5, 2024 17:12

maksimryndin added 2 commits February 7, 2024 13:58

pvf sync between exec result and artifact removal

c1bd7bf

pvf corrupted artifact test

ac7d3d2

s0me0ne-unkn0wn requested review from alexggh, alindima and koute February 9, 2024 12:57

s0me0ne-unkn0wn approved these changes Feb 11, 2024

View reviewed changes

pvf runtime construction error handling during execution: review comm…

a10f16d

…ents

s0me0ne-unkn0wn requested a review from eskimor February 13, 2024 09:32

maksimryndin added 4 commits February 20, 2024 14:51

pvf reprare on runtime construction: cargo fmt

ef88ee5

Merge branch 'master' into pvf-execute-artifact-specific-error

ed77274

Merge branch 'master' into pvf-execute-artifact-specific-error

ab4f40c

Merge branch 'master' into pvf-execute-artifact-specific-error

5bf6daa

AndreiEres approved these changes Feb 28, 2024

View reviewed changes

pvf runtime construction retry: pr feedback

d722ccf

Merge branch 'master' into pvf-execute-artifact-specific-error

2a83c1d

s0me0ne-unkn0wn enabled auto-merge February 28, 2024 16:05

s0me0ne-unkn0wn added this pull request to the merge queue Feb 28, 2024

Merged via the queue into paritytech:master with commit 4261366 Feb 28, 2024

s0me0ne-unkn0wn mentioned this pull request Mar 28, 2024

PVF: Incorporate wasmtime version in worker version checks #2742

Closed

		/// A possibly transient runtime instantion error happend during the execution; maybe retried
		/// with preparation

Conversation

maksimryndin commented Feb 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maksimryndin commented Feb 2, 2024

Uh oh!

s0me0ne-unkn0wn commented Feb 3, 2024

Uh oh!

Uh oh!

maksimryndin commented Feb 5, 2024

Uh oh!

s0me0ne-unkn0wn commented Feb 9, 2024

Uh oh!

s0me0ne-unkn0wn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maksimryndin commented Feb 12, 2024

Uh oh!

AndreiEres left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

maksimryndin commented Feb 2, 2024 •

edited

Loading