go/storage/committee: Optimize state sync by martintomazic · Pull Request #6242 · oasisprotocol/oasis-core

martintomazic · 2025-06-29T22:18:22Z

WIP

TODO:

bench the new performance.
Make ready for review: Minor refactors/simplications + fix docs.

Motivation:
TODO

Benchmarks:
Preliminary benchmarks of master :

finalization is around 2-5 times faster than the application.
Fetching is also around 2-5 times faster than the syncing on my local machine with very average wifi.
- I will repeat benchmarks in the cloud.
Applying and finalizing is order of magnitude slower when syncing many blocks, compared to when syncing with synced worker.
- This happens when checkpoint sync is disabled or just after it. Makes sense as badger has more to clean load.

I think the behavior, we observed in #6238 was due to very fast fetchers or likely badger doing compaction, which caused the rounds pending finalization to go above 5k. (fixed by capping max non-finalized).

Next Steps

NIT: Consider refactoring further and get rid of inFlight structs or possibly simplify the fetcher worker doing to much stuff.

Fix existing issues you found in the code:

Unlikely issue of potentially registering availability to early. (master)
- After syncing checkpoint there could be many block headers in the infinite channels pending diff sync, but nudgeAvailability may act on the first block already.
Existing issue in the master where genesis checkpoint may not be checkpointed if syncing from it (it will be however checkpointed on the first restart of the node).

netlify · 2025-06-29T22:18:33Z

✅ Deploy Preview for oasisprotocol-oasis-core canceled.

Name	Link
🔨 Latest commit	`c7f230b`
🔍 Latest deploy log	https://app.netlify.com/projects/oasisprotocol-oasis-core/deploys/68a5b8018b623b0008242c90

go/worker/storage/committee/node.go

codecov · 2025-08-07T23:55:25Z

Codecov Report

❌ Patch coverage is 79.02622% with 168 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.70%. Comparing base (d8d0e69) to head (2d6adba).

Files with missing lines	Patch %	Lines
go/worker/storage/statesync/worker.go	81.75%	40 Missing and 16 partials ⚠️
go/worker/storage/statesync/diff_sync.go	83.88%	37 Missing and 7 partials ⚠️
go/worker/storage/statesync/checkpointer.go	72.09%	27 Missing and 9 partials ⚠️
go/worker/storage/statesync/checkpoint_sync.go	59.52%	12 Missing and 5 partials ⚠️
go/worker/storage/statesync/prune.go	36.36%	13 Missing and 1 partial ⚠️
go/oasis-node/cmd/node/node_control.go	66.66%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6242      +/-   ##
==========================================
+ Coverage   64.62%   64.70%   +0.08%     
==========================================
  Files         696      699       +3     
  Lines       67803    67767      -36     
==========================================
+ Hits        43817    43852      +35     
+ Misses      19018    18871     -147     
- Partials     4968     5044      +76

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Also rename node to worker, to avoid confusion. Ideally, the parent package (storage) would have runtime as a prefix to make it clearer this is a runtime worker.

Logic was preserved, the only thing that changed is that context is passed explicitly and worker for creating checkpoints was renamed.

In addition state sync worker should return an error and it should be the caller responsibility to act accordingly. See e.g. new workers such as stateless client. Note that semantic changed slightly: Previously storage worker would wait for all state sync workers to finish. Now it will terminate when the first one finishes. Notice that this is not 100% true as previously state sync worker could panic (which would in that case shutdown the whole node).

Probably the timeout should be the client responsibility.

Additionally, observe that the parent (storage worker) is registered as background service, thus upon error inside state sync worker there is no need to manually request the node shutdown.

The code was broken into smaller functions. Also the scope of variables (including channels) have been reduced. Semantics as well as performance should stay the same.

The logic was preserved. Ideally, diff sync would only accept context, local storage backend, and client/interface to fetch diff. This would make it testable in isolation. Finally, use of undefined round should be moved out of it.

Previously, if the worker returned an error it would exit main for loop and wait for the waitgroup to be emptied. However, this is not possible as there is no one that is reading the fetched diffs.

In case of termination due to error exiting main for loop or canceled context there is no point in waiting for go routines to finish fetching/doing the cleanup. As long we cancel the context for them and use it properly in the select statements this should be safe and better.

martintomazic commented Jun 29, 2025

View reviewed changes

go/worker/storage/committee/node.go Outdated Show resolved Hide resolved

martintomazic force-pushed the martin/feature/optimize-state-sync branch 9 times, most recently from 4fffe1c to 2d6adba Compare August 7, 2025 23:30

martintomazic force-pushed the martin/feature/optimize-state-sync branch 2 times, most recently from 7e8acf2 to f21200e Compare August 8, 2025 15:09

martintomazic mentioned this pull request Aug 19, 2025

go/worker/storage: Refactor state sync worker #6299

Draft

martintomazic added 13 commits August 19, 2025 14:31

go/worker/storage: Rename committee package to statesync

0b3d6e2

Also rename node to worker, to avoid confusion. Ideally, the parent package (storage) would have runtime as a prefix to make it clearer this is a runtime worker.

go/worker/storage/statesync: Move pruning to separate file

10b4705

go/worker/storage/statesync: Move checkpointert to separate file

88cf5f1

Logic was preserved, the only thing that changed is that context is passed explicitly and worker for creating checkpoints was renamed.

go/worker/storage/statesync: Remove redundant context

d76c553

Probably the timeout should be the client responsibility.

go/worker/storage/statesync: Do not panic

06bafd1

Additionally, observe that the parent (storage worker) is registered as background service, thus upon error inside state sync worker there is no need to manually request the node shutdown.

go/worker/storage/statesync: Move syncing methods at the bottom

c72ffb7

go/worker/storage/statesync: Refactor the code

170ffe8

The code was broken into smaller functions. Also the scope of variables (including channels) have been reduced. Semantics as well as performance should stay the same.

go/worker/storage/statesync: Prevent deadlock when terminating

05faf63

Previously, if the worker returned an error it would exit main for loop and wait for the waitgroup to be emptied. However, this is not possible as there is no one that is reading the fetched diffs.

go/worker/storage/statesync: Make diffsync independent worker

9a14979

go/worker/storage/statesync: Decouple diff sync in 3 workers

c7f230b

martintomazic force-pushed the martin/feature/optimize-state-sync branch from f21200e to c7f230b Compare August 20, 2025 11:56

martintomazic changed the title ~~go/storage/committee: Optimize state sync POC~~ go/storage/committee: Optimize state sync Aug 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

go/storage/committee: Optimize state sync#6242

go/storage/committee: Optimize state sync#6242
martintomazic wants to merge 13 commits intomasterfrom
martin/feature/optimize-state-sync

martintomazic commented Jun 29, 2025 •

edited

Loading

Uh oh!

netlify bot commented Jun 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

codecov bot commented Aug 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

martintomazic commented Jun 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Jun 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for oasisprotocol-oasis-core canceled.

Uh oh!

Uh oh!

codecov bot commented Aug 7, 2025

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

martintomazic commented Jun 29, 2025 •

edited

Loading

netlify bot commented Jun 29, 2025 •

edited

Loading