Skip to content

go/storage/committee: Optimize state sync#6242

Draft
martintomazic wants to merge 13 commits intomasterfrom
martin/feature/optimize-state-sync
Draft

go/storage/committee: Optimize state sync#6242
martintomazic wants to merge 13 commits intomasterfrom
martin/feature/optimize-state-sync

Conversation

@martintomazic
Copy link
Contributor

@martintomazic martintomazic commented Jun 29, 2025

Closes #6241

WIP

TODO:

  • bench the new performance.
  • Make ready for review: Minor refactors/simplications + fix docs.

Motivation:
TODO

Benchmarks:
Preliminary benchmarks of master :

  1. finalization is around 2-5 times faster than the application.
  2. Fetching is also around 2-5 times faster than the syncing on my local machine with very average wifi.
    • I will repeat benchmarks in the cloud.
  3. Applying and finalizing is order of magnitude slower when syncing many blocks, compared to when syncing with synced worker.
    • This happens when checkpoint sync is disabled or just after it. Makes sense as badger has more to clean load.

I think the behavior, we observed in #6238 was due to very fast fetchers or likely badger doing compaction, which caused the rounds pending finalization to go above 5k. (fixed by capping max non-finalized).

Next Steps

  1. NIT: Consider refactoring further and get rid of inFlight structs or possibly simplify the fetcher worker doing to much stuff.

Fix existing issues you found in the code:

  • Unlikely issue of potentially registering availability to early. (master)
    • After syncing checkpoint there could be many block headers in the infinite channels pending diff sync, but nudgeAvailability may act on the first block already.
  • Existing issue in the master where genesis checkpoint may not be checkpointed if syncing from it (it will be however checkpointed on the first restart of the node).

@netlify
Copy link

netlify bot commented Jun 29, 2025

Deploy Preview for oasisprotocol-oasis-core canceled.

Name Link
🔨 Latest commit c7f230b
🔍 Latest deploy log https://app.netlify.com/projects/oasisprotocol-oasis-core/deploys/68a5b8018b623b0008242c90

@martintomazic martintomazic force-pushed the martin/feature/optimize-state-sync branch 9 times, most recently from 4fffe1c to 2d6adba Compare August 7, 2025 23:30
@codecov
Copy link

codecov bot commented Aug 7, 2025

Codecov Report

❌ Patch coverage is 79.02622% with 168 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.70%. Comparing base (d8d0e69) to head (2d6adba).

Files with missing lines Patch % Lines
go/worker/storage/statesync/worker.go 81.75% 40 Missing and 16 partials ⚠️
go/worker/storage/statesync/diff_sync.go 83.88% 37 Missing and 7 partials ⚠️
go/worker/storage/statesync/checkpointer.go 72.09% 27 Missing and 9 partials ⚠️
go/worker/storage/statesync/checkpoint_sync.go 59.52% 12 Missing and 5 partials ⚠️
go/worker/storage/statesync/prune.go 36.36% 13 Missing and 1 partial ⚠️
go/oasis-node/cmd/node/node_control.go 66.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6242      +/-   ##
==========================================
+ Coverage   64.62%   64.70%   +0.08%     
==========================================
  Files         696      699       +3     
  Lines       67803    67767      -36     
==========================================
+ Hits        43817    43852      +35     
+ Misses      19018    18871     -147     
- Partials     4968     5044      +76     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@martintomazic martintomazic force-pushed the martin/feature/optimize-state-sync branch 2 times, most recently from 7e8acf2 to f21200e Compare August 8, 2025 15:09
Also rename node to worker, to avoid confusion.

Ideally, the parent package (storage) would have runtime
as a prefix to make it clearer this is a runtime worker.
Logic was preserved, the only thing that changed is that context
is passed explicitly and worker for creating checkpoints was
renamed.
In addition state sync worker should return an error and it should
be the caller responsibility to act accordingly. See e.g. new
workers such as stateless client.

Note that semantic changed slightly: Previously storage worker
would wait for all state sync workers to finish. Now it will
terminate when the first one finishes. Notice that this is not
100% true as previously state sync worker could panic (which
would in that case shutdown the whole node).
Probably the timeout should be the client responsibility.
Additionally, observe that the parent (storage worker) is
registered as background service, thus upon error inside state
sync worker there is no need to manually request the node
shutdown.
The code was broken into smaller functions. Also the
scope of variables (including channels) have been reduced.

Semantics as well as performance should stay the same.
The logic was preserved. Ideally, diff sync would only accept
context, local storage backend, and client/interface to fetch
diff. This would make it testable in isolation.

Finally, use of undefined round should be moved out of it.
Previously, if the worker returned an error it would exit main
for loop and wait for the waitgroup to be emptied. However,
this is not possible as there is no one that is reading
the fetched diffs.
In case of termination due to error exiting main for loop or
canceled context there is no point in waiting for go routines
to finish fetching/doing the cleanup. As long we cancel the
context for them and use it properly in the select statements
this should be safe and better.
@martintomazic martintomazic force-pushed the martin/feature/optimize-state-sync branch from f21200e to c7f230b Compare August 20, 2025 11:56
@martintomazic martintomazic changed the title go/storage/committee: Optimize state sync POC go/storage/committee: Optimize state sync Aug 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize runtime iterative state sync

1 participant