Skip to content
This repository was archived by the owner on Nov 15, 2023. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
866a690
PVF: Remove `rayon` and some uses of `tokio`
mrcnski May 1, 2023
aff4779
PVF: Vote invalid on panics in execution thread (after a retry)
mrcnski May 1, 2023
fc481bc
Merge branch 'mrcnski/pvf-remove-rayon-tokio' into mrcnski/pvf-catch-…
mrcnski May 1, 2023
b03e1b0
Address a couple of TODOs
mrcnski May 1, 2023
0fcedbe
Add some documentation to implementer's guide
mrcnski May 1, 2023
c7e44d1
Fix compile error
mrcnski May 1, 2023
dc0e9c9
Fix compile errors
mrcnski May 1, 2023
67994f5
Fix compile error
mrcnski May 1, 2023
181e89f
Update roadmap/implementers-guide/src/node/utility/candidate-validati…
mrcnski May 2, 2023
df22727
Address comments + couple other changes (see message)
mrcnski May 2, 2023
5223995
Implement proper thread synchronization
mrcnski May 2, 2023
d4eb740
Catch panics in threads so we always notify condvar
mrcnski May 2, 2023
850d2c0
Use `WaitOutcome` enum instead of bool condition variable
mrcnski May 2, 2023
5885222
Merge branch 'mrcnski/pvf-remove-rayon-tokio' into mrcnski/pvf-catch-…
mrcnski May 2, 2023
4f8f1cc
Fix retry timeouts to depend on exec timeout kind
mrcnski May 2, 2023
87437b8
Merge remote-tracking branch 'origin/mrcnski/pvf-catch-panics-in-exec…
mrcnski May 2, 2023
f4c0b4b
Address review comments
mrcnski May 2, 2023
883952c
Make the API for condvars in workers nicer
mrcnski May 2, 2023
cc89ab8
Add a doc
mrcnski May 3, 2023
1967ddb
Use condvar for memory stats thread
mrcnski May 3, 2023
6e7a13c
Small refactor
mrcnski May 3, 2023
566a438
Merge branch 'mrcnski/pvf-remove-rayon-tokio' into mrcnski/pvf-catch-…
mrcnski May 3, 2023
660dadd
Enumerate internal validation errors in an enum
mrcnski May 3, 2023
187528c
Fix comment
mrcnski May 3, 2023
6d7cbf0
Add a log
mrcnski May 3, 2023
a2d0c72
Fix test
mrcnski May 3, 2023
6c54e42
Update variant naming
mrcnski May 4, 2023
7da8cdf
Merge branch 'master' into mrcnski/pvf-catch-panics-in-execution
mrcnski May 16, 2023
da7d191
Address a missed TODO
mrcnski May 16, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions node/core/candidate-validation/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -636,6 +636,8 @@ where
Ok(ValidationResult::Invalid(InvalidCandidate::ExecutionError(
"ambiguous worker death".to_string(),
))),
Err(ValidationError::InvalidCandidate(WasmInvalidCandidate::Panic(err))) =>
Ok(ValidationResult::Invalid(InvalidCandidate::ExecutionError(err))),
Err(ValidationError::InvalidCandidate(WasmInvalidCandidate::PrepareError(e))) => {
// In principle if preparation of the `WASM` fails, the current candidate can not be the
// reason for that. So we can't say whether it is invalid or not. In addition, with
Expand Down Expand Up @@ -709,13 +711,20 @@ trait ValidationBackend {
self.validate_candidate(pvf.clone(), exec_timeout, params.encode()).await;

// Allow limited retries for each kind of error.
//
// TODO: Should we stop retrying after some time has passed?
// We would need to know if we came from backing or approval.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@s0me0ne-unkn0wn IIRC you've looked into this, WDYT?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would a panic resolve itself after retry ? Maybe in case of memory pressure -> OOM ? But still, if we retry and wait too much time, malicious validators might have enough time to vote valid and the including block will be finalized - we currently have 2 relay chain blocks delay on finality which should help, but we'd have to be careful here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean in the case of backing you want the strict timeout to be applied to both retries cumulatively, not to a single one? Looks like a valid idea. There's no dedicated field to tell from which subsystem the request is coming yet, but you can check for execution timeout kind in the incoming CandidateValidationMessage::ValidateFromExhaustive or CandidateValidationMessage::ValidateFromChainState, it will be either PvfExecTimeoutKind::Backing or PvfExecTimeoutKind::Approval so you can tell backing from the other stuff. Sounds a bit hacky probably, but will work.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would a panic resolve itself after retry ? Maybe in case of memory pressure -> OOM ?

Maybe, or also just a hardware glitch, spurious OS glitch, etc. We can't really tell why a panic happened, or if it was an issue with the candidate or not.1 Retrying lets us be more sure about our vote.

Footnotes

  1. Well we do get a stringified error but we can't match on it or anything.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @s0me0ne-unkn0wn! Pushed some changes.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would a panic resolve itself after retry ? Maybe in case of memory pressure -> OOM ? But still, if we retry and wait too much time, malicious validators might have enough time to vote valid and the including block will be finalized - we currently have 2 relay chain blocks delay on finality which should help, but we'd have to be careful here.

No, this should not be possible. If it takes too long, we noshow and more validators will cover for us.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But still, if those new guys also retry, they will also no-show and malicious votes come in ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They will then be replaced by even more checkers. One noshow will be replaced by >1 more checkers, which will again be 2/3 honest. Anyhow, this perfect DoS of all honest nodes is certainly the absolute worst case, so yes we have to be careful when to not raise a dispute. In general, if in doubt raise one.

For hardware faults being the cause, I think we should look into having error checking. E.g. we already suggest/require ECC memory - we should have something similar for disk/db. The node should be able to detect a corrupted db and not just read garbage data and then dispute stuff.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CPU hardware faults are extremely rare (I'm closely familiar with a single case, though a really interesting one), and memory faults are not so rare (I came across two cases in my life) but still not something that should be seriously taken into account (that's something that could be handled by governance reverting the slashes on a case-by-case basis), so do I understand correctly we are left with a single really concerning case of OOM? Can we tell the OOM from the other cases? Afair Linux's oom_kill_task() sends SIGKILL to the process instead of SIGSEGV, could we use it to detect the condition?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we tell the OOM from the other cases?

Someone (I believe @burdges?) had the idea of setting a flag before each allocation and unsetting it after -- would be easy to do if we had the allocator wrapper that's been discussed. And I agree that we should focus on OOM, since we've seen it happen in the wild.

let mut num_internal_retries_left = 1;
let mut num_awd_retries_left = 1;
let mut num_panic_retries_left = 1;
loop {
match validation_result {
Err(ValidationError::InvalidCandidate(
WasmInvalidCandidate::AmbiguousWorkerDeath,
)) if num_awd_retries_left > 0 => num_awd_retries_left -= 1,
Err(ValidationError::InvalidCandidate(WasmInvalidCandidate::Panic(_)))
if num_panic_retries_left > 0 =>
num_panic_retries_left -= 1,
Err(ValidationError::InternalError(_)) if num_internal_retries_left > 0 =>
num_internal_retries_left -= 1,
_ => break,
Expand Down
57 changes: 57 additions & 0 deletions node/core/candidate-validation/src/tests.rs
Original file line number Diff line number Diff line change
Expand Up @@ -540,6 +540,7 @@ fn candidate_validation_bad_return_is_invalid() {
assert_matches!(v, Ok(ValidationResult::Invalid(InvalidCandidate::Timeout)));
}

// Test that we vote valid if we get `AmbiguousWorkerDeath`, retry, and then succeed.
#[test]
fn candidate_validation_one_ambiguous_error_is_valid() {
let validation_data = PersistedValidationData { max_pov_size: 1024, ..Default::default() };
Expand Down Expand Up @@ -728,6 +729,62 @@ fn candidate_validation_retry_internal_errors() {
assert_matches!(v, Err(ValidationFailed(s)) if s == "bar".to_string());
}

// Test that we retry on panic errors.
#[test]
fn candidate_validation_retry_panic_errors() {
let validation_data = PersistedValidationData { max_pov_size: 1024, ..Default::default() };

let pov = PoV { block_data: BlockData(vec![1; 32]) };
let validation_code = ValidationCode(vec![2; 16]);

let descriptor = make_valid_candidate_descriptor(
ParaId::from(1_u32),
dummy_hash(),
validation_data.hash(),
pov.hash(),
validation_code.hash(),
dummy_hash(),
dummy_hash(),
Sr25519Keyring::Alice,
);

let check = perform_basic_checks(
&descriptor,
validation_data.max_pov_size,
&pov,
&validation_code.hash(),
);
assert!(check.is_ok());

let candidate_receipt = CandidateReceipt { descriptor, commitments_hash: Hash::zero() };

let pool = TaskExecutor::new();
let (mut ctx, ctx_handle) =
test_helpers::make_subsystem_context::<AllMessages, _>(pool.clone());
let metrics = Metrics::default();

let v = test_with_executor_params(ctx_handle, || {
validate_candidate_exhaustive(
ctx.sender(),
MockValidateCandidateBackend::with_hardcoded_result_list(vec![
Err(ValidationError::InvalidCandidate(WasmInvalidCandidate::Panic("foo".into()))),
// Throw an AWD error, we should still retry again.
Err(ValidationError::InvalidCandidate(WasmInvalidCandidate::AmbiguousWorkerDeath)),
// Throw another panic error.
Err(ValidationError::InvalidCandidate(WasmInvalidCandidate::Panic("bar".into()))),
]),
validation_data,
validation_code,
candidate_receipt,
Arc::new(pov),
PvfExecTimeoutKind::Backing,
&metrics,
)
});

assert_matches!(v, Err(ValidationFailed(s)) if s == "bar".to_string());
}

#[test]
fn candidate_validation_timeout_is_internal_error() {
let validation_data = PersistedValidationData { max_pov_size: 1024, ..Default::default() };
Expand Down
7 changes: 6 additions & 1 deletion node/core/pvf/src/error.rs
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ pub enum InvalidCandidate {
/// an `rlimit` (if set) or, again, invited OOM killer. Another possibility is a bug in
/// wasmtime allowed the PVF to gain control over the execution worker.
///
/// We attribute such an event to an invalid candidate in either case.
/// We attribute such an event to an *invalid candidate* in either case.
///
/// The rationale for this is that a glitch may lead to unfair rejecting candidate by a single
/// validator. If the glitch is somewhat more persistent the validator will reject all candidate
Expand All @@ -113,6 +113,11 @@ pub enum InvalidCandidate {
AmbiguousWorkerDeath,
/// PVF execution (compilation is not included) took more time than was allotted.
HardTimeout,
/// A panic occurred and we can't be sure whether the candidate is really invalid or some internal glitch occurred.
/// Whenever we are unsure, we can never treat an error as internal as we would abstain from voting. This is bad
/// because if the issue was due to the candidate, then all validators would abstain, stalling finality on the
/// chain. So we will first retry the candidate, and if the issue persists we are forced to vote invalid.
Panic(String),
}

impl From<PrepareError> for ValidationError {
Expand Down
8 changes: 5 additions & 3 deletions node/core/pvf/src/execute/queue.rs
Original file line number Diff line number Diff line change
Expand Up @@ -334,15 +334,17 @@ fn handle_job_finish(
Err(ValidationError::InvalidCandidate(InvalidCandidate::WorkerReportedError(err))),
None,
),
Outcome::InternalError { err, idle_worker } =>
(Some(idle_worker), Err(ValidationError::InternalError(err)), None),
Outcome::InternalError { err } => (None, Err(ValidationError::InternalError(err)), None),
Outcome::HardTimeout =>
(None, Err(ValidationError::InvalidCandidate(InvalidCandidate::HardTimeout)), None),
// "Maybe invalid" errors (will retry).
Outcome::IoErr => (
None,
Err(ValidationError::InvalidCandidate(InvalidCandidate::AmbiguousWorkerDeath)),
None,
),
Outcome::Panic { err } =>
(None, Err(ValidationError::InvalidCandidate(InvalidCandidate::Panic(err))), None),
};

queue.metrics.execute_finished();
Expand All @@ -356,7 +358,7 @@ fn handle_job_finish(
err
);
} else {
gum::debug!(
gum::trace!(
target: LOG_TARGET,
?artifact_id,
?worker,
Expand Down
21 changes: 16 additions & 5 deletions node/core/pvf/src/execute/worker_intf.rs
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,8 @@ pub async fn spawn(
}

/// Outcome of PVF execution.
///
/// If the idle worker token is not returned, it means the worker must be terminated.
pub enum Outcome {
/// PVF execution completed successfully and the result is returned. The worker is ready for
/// another job.
Expand All @@ -73,18 +75,23 @@ pub enum Outcome {
InvalidCandidate { err: String, idle_worker: IdleWorker },
/// An internal error happened during the validation. Such an error is most likely related to
/// some transient glitch.
InternalError { err: String, idle_worker: IdleWorker },
///
/// Should only ever be used for errors independent of the candidate. Therefore it may be a problem with the worker,
/// so we terminate it.
InternalError { err: String },
/// The execution time exceeded the hard limit. The worker is terminated.
HardTimeout,
/// An I/O error happened during communication with the worker. This may mean that the worker
/// process already died. The token is not returned in any case.
IoErr,
/// An unexpected panic has occurred in the execution worker.
Panic { err: String },
}

/// Given the idle token of a worker and parameters of work, communicates with the worker and
/// returns the outcome.
///
/// NOTE: Returning the `HardTimeout` or `IoErr` errors will trigger the child process being killed.
/// NOTE: Not returning the idle worker token in `Outcome` will trigger the child process being killed.
pub async fn start_work(
worker: IdleWorker,
artifact: ArtifactPathId,
Expand Down Expand Up @@ -171,8 +178,8 @@ pub async fn start_work(
Response::InvalidCandidate(err) =>
Outcome::InvalidCandidate { err, idle_worker: IdleWorker { stream, pid } },
Response::TimedOut => Outcome::HardTimeout,
Response::InternalError(err) =>
Outcome::InternalError { err, idle_worker: IdleWorker { stream, pid } },
Response::Panic(err) => Outcome::Panic { err },
Response::InternalError(err) => Outcome::InternalError { err },
}
}

Expand Down Expand Up @@ -223,7 +230,11 @@ pub enum Response {
InvalidCandidate(String),
/// The job timed out.
TimedOut,
/// Some internal error occurred. Should only be used for errors independent of the candidate.
/// An unexpected panic has occurred in the execution worker.
Panic(String),
/// Some internal error occurred.
///
/// Should only ever be used for errors independent of the candidate.
InternalError(String),
}

Expand Down
26 changes: 18 additions & 8 deletions node/core/pvf/worker/src/common.rs
Original file line number Diff line number Diff line change
Expand Up @@ -18,15 +18,12 @@ use crate::LOG_TARGET;
use cpu_time::ProcessTime;
use futures::never::Never;
use std::{
any::Any,
path::PathBuf,
sync::mpsc::{Receiver, RecvTimeoutError},
time::Duration,
};
use tokio::{
io,
net::UnixStream,
runtime::{Handle, Runtime},
};
use tokio::{io, net::UnixStream, runtime::Runtime};

/// Some allowed overhead that we account for in the "CPU time monitor" thread's sleeps, on the
/// child process.
Expand All @@ -44,7 +41,7 @@ pub fn worker_event_loop<F, Fut>(
node_version: Option<&str>,
mut event_loop: F,
) where
F: FnMut(Handle, UnixStream) -> Fut,
F: FnMut(UnixStream) -> Fut,
Fut: futures::Future<Output = io::Result<Never>>,
{
let worker_pid = std::process::id();
Expand All @@ -68,13 +65,12 @@ pub fn worker_event_loop<F, Fut>(

// Run the main worker loop.
let rt = Runtime::new().expect("Creates tokio runtime. If this panics the worker will die and the host will detect that and deal with it.");
let handle = rt.handle();
let err = rt
.block_on(async move {
let stream = UnixStream::connect(socket_path).await?;
let _ = tokio::fs::remove_file(socket_path).await;

let result = event_loop(handle.clone(), stream).await;
let result = event_loop(stream).await;

result
})
Expand Down Expand Up @@ -124,6 +120,20 @@ pub fn cpu_time_monitor_loop(
}
}

/// Attempt to convert an opaque panic payload to a string.
///
/// This is a best effort, and is not guaranteed to provide the most accurate value.
pub fn stringify_panic_payload(payload: Box<dyn Any + Send + 'static>) -> String {
match payload.downcast::<&'static str>() {
Ok(msg) => msg.to_string(),
Err(payload) => match payload.downcast::<String>() {
Ok(msg) => *msg,
// At least we tried...
Err(_) => "unknown panic payload".to_string(),
},
}
}

/// In case of node and worker version mismatch (as a result of in-place upgrade), send `SIGTERM`
/// to the node to tear it down and prevent it from raising disputes on valid candidates. Node
/// restart should be handled by the node owner. As node exits, unix sockets opened to workers
Expand Down
Loading