This repository was archived by the owner on Nov 15, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
PVF: Vote invalid on panics in execution thread (after a retry) #7155
Merged
paritytech-processbot
merged 29 commits into
master
from
mrcnski/pvf-catch-panics-in-execution
May 16, 2023
Merged
Changes from 5 commits
Commits
Show all changes
29 commits
Select commit
Hold shift + click to select a range
866a690
PVF: Remove `rayon` and some uses of `tokio`
mrcnski aff4779
PVF: Vote invalid on panics in execution thread (after a retry)
mrcnski fc481bc
Merge branch 'mrcnski/pvf-remove-rayon-tokio' into mrcnski/pvf-catch-…
mrcnski b03e1b0
Address a couple of TODOs
mrcnski 0fcedbe
Add some documentation to implementer's guide
mrcnski c7e44d1
Fix compile error
mrcnski dc0e9c9
Fix compile errors
mrcnski 67994f5
Fix compile error
mrcnski 181e89f
Update roadmap/implementers-guide/src/node/utility/candidate-validati…
mrcnski df22727
Address comments + couple other changes (see message)
mrcnski 5223995
Implement proper thread synchronization
mrcnski d4eb740
Catch panics in threads so we always notify condvar
mrcnski 850d2c0
Use `WaitOutcome` enum instead of bool condition variable
mrcnski 5885222
Merge branch 'mrcnski/pvf-remove-rayon-tokio' into mrcnski/pvf-catch-…
mrcnski 4f8f1cc
Fix retry timeouts to depend on exec timeout kind
mrcnski 87437b8
Merge remote-tracking branch 'origin/mrcnski/pvf-catch-panics-in-exec…
mrcnski f4c0b4b
Address review comments
mrcnski 883952c
Make the API for condvars in workers nicer
mrcnski cc89ab8
Add a doc
mrcnski 1967ddb
Use condvar for memory stats thread
mrcnski 6e7a13c
Small refactor
mrcnski 566a438
Merge branch 'mrcnski/pvf-remove-rayon-tokio' into mrcnski/pvf-catch-…
mrcnski 660dadd
Enumerate internal validation errors in an enum
mrcnski 187528c
Fix comment
mrcnski 6d7cbf0
Add a log
mrcnski a2d0c72
Fix test
mrcnski 6c54e42
Update variant naming
mrcnski 7da8cdf
Merge branch 'master' into mrcnski/pvf-catch-panics-in-execution
mrcnski da7d191
Address a missed TODO
mrcnski File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@s0me0ne-unkn0wn IIRC you've looked into this, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would a panic resolve itself after retry ? Maybe in case of memory pressure -> OOM ? But still, if we retry and wait too much time, malicious validators might have enough time to vote valid and the including block will be finalized - we currently have 2 relay chain blocks delay on finality which should help, but we'd have to be careful here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean in the case of backing you want the strict timeout to be applied to both retries cumulatively, not to a single one? Looks like a valid idea. There's no dedicated field to tell from which subsystem the request is coming yet, but you can check for execution timeout kind in the incoming
CandidateValidationMessage::ValidateFromExhaustiveorCandidateValidationMessage::ValidateFromChainState, it will be eitherPvfExecTimeoutKind::BackingorPvfExecTimeoutKind::Approvalso you can tell backing from the other stuff. Sounds a bit hacky probably, but will work.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, or also just a hardware glitch, spurious OS glitch, etc. We can't really tell why a panic happened, or if it was an issue with the candidate or not.1 Retrying lets us be more sure about our vote.
Footnotes
Well we do get a stringified error but we can't match on it or anything. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @s0me0ne-unkn0wn! Pushed some changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this should not be possible. If it takes too long, we noshow and more validators will cover for us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But still, if those new guys also retry, they will also no-show and malicious votes come in ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They will then be replaced by even more checkers. One noshow will be replaced by >1 more checkers, which will again be 2/3 honest. Anyhow, this perfect DoS of all honest nodes is certainly the absolute worst case, so yes we have to be careful when to not raise a dispute. In general, if in doubt raise one.
For hardware faults being the cause, I think we should look into having error checking. E.g. we already suggest/require ECC memory - we should have something similar for disk/db. The node should be able to detect a corrupted db and not just read garbage data and then dispute stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CPU hardware faults are extremely rare (I'm closely familiar with a single case, though a really interesting one), and memory faults are not so rare (I came across two cases in my life) but still not something that should be seriously taken into account (that's something that could be handled by governance reverting the slashes on a case-by-case basis), so do I understand correctly we are left with a single really concerning case of OOM? Can we tell the OOM from the other cases? Afair Linux's
oom_kill_task()sendsSIGKILLto the process instead ofSIGSEGV, could we use it to detect the condition?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Someone (I believe @burdges?) had the idea of setting a flag before each allocation and unsetting it after -- would be easy to do if we had the allocator wrapper that's been discussed. And I agree that we should focus on OOM, since we've seen it happen in the wild.