go/storage/committee: Optimize state sync#6242
Draft
martintomazic wants to merge 13 commits intomasterfrom
Draft
go/storage/committee: Optimize state sync#6242martintomazic wants to merge 13 commits intomasterfrom
martintomazic wants to merge 13 commits intomasterfrom
Conversation
✅ Deploy Preview for oasisprotocol-oasis-core canceled.
|
martintomazic
commented
Jun 29, 2025
4fffe1c to
2d6adba
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #6242 +/- ##
==========================================
+ Coverage 64.62% 64.70% +0.08%
==========================================
Files 696 699 +3
Lines 67803 67767 -36
==========================================
+ Hits 43817 43852 +35
+ Misses 19018 18871 -147
- Partials 4968 5044 +76 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
7e8acf2 to
f21200e
Compare
Also rename node to worker, to avoid confusion. Ideally, the parent package (storage) would have runtime as a prefix to make it clearer this is a runtime worker.
Logic was preserved, the only thing that changed is that context is passed explicitly and worker for creating checkpoints was renamed.
In addition state sync worker should return an error and it should be the caller responsibility to act accordingly. See e.g. new workers such as stateless client. Note that semantic changed slightly: Previously storage worker would wait for all state sync workers to finish. Now it will terminate when the first one finishes. Notice that this is not 100% true as previously state sync worker could panic (which would in that case shutdown the whole node).
Probably the timeout should be the client responsibility.
Additionally, observe that the parent (storage worker) is registered as background service, thus upon error inside state sync worker there is no need to manually request the node shutdown.
The code was broken into smaller functions. Also the scope of variables (including channels) have been reduced. Semantics as well as performance should stay the same.
The logic was preserved. Ideally, diff sync would only accept context, local storage backend, and client/interface to fetch diff. This would make it testable in isolation. Finally, use of undefined round should be moved out of it.
Previously, if the worker returned an error it would exit main for loop and wait for the waitgroup to be emptied. However, this is not possible as there is no one that is reading the fetched diffs.
In case of termination due to error exiting main for loop or canceled context there is no point in waiting for go routines to finish fetching/doing the cleanup. As long we cancel the context for them and use it properly in the select statements this should be safe and better.
f21200e to
c7f230b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #6241
WIP
TODO:
Motivation:
TODO
Benchmarks:
Preliminary benchmarks of
master:I think the behavior, we observed in #6238 was due to very fast fetchers or likely badger doing compaction, which caused the rounds pending finalization to go above 5k. (fixed by capping max non-finalized).
Next Steps
inFlightstructs or possibly simplify the fetcher worker doing to much stuff.Fix existing issues you found in the code:
master)nudgeAvailabilitymay act on the first block already.