Dynamic state snapshots #326

DarianShawn · 2023-03-20T12:25:03Z

Description

Mostly come from geth PR 20152:

This PR creates a secondary data structure for storing the Dogechain state, called a snapshot. This snapshot is special as it dynamically follows the chain:

At the very bottom, the snapshot consists of a disk layer, which is essentially a semi-recent full flat dump of the account and storage contents. This is stored in LevelDB as a <hash> -> <account> mapping for the account trie and <account-hash><slot-hash> -> <slot-value> mapping for the storage tries. The layout permits fast iteration over the accounts and storage, which will be used for a new sync algorithm (not done yet).
Above the disk layer there is a tree of in-memory diff layers that each represent one block's worth of state mutations. Every time a new block is processed, it is linked on top of the existing diff tree, and the bottom layers flattened together to keep the maximum tree depth reasonable. At the very bottom, the first diff layer acts as an accumulator which only gets flattened into the disk layer when it outgrows it's memory allowance. This is done mostly to avoid thrashing LevelDB.

The snapshot can be built fully online, during the live operation of a Dogechain node. This is harder than it seems because rebuilding the snapshot for mainnet takes days, during which the in-memory garbage collection long deletes the state needed for a single capture. So we'll have to provide the first canonical initialized snapshot, in order to make the latter things simpler and easier.

The PR achieves this by gradually iterating the state tries and maintaining a marker to the account/storage slot position until which the snapshot was already generated. Every time a new block is executed, state mutations prior to the marker get applied directly (the ones afterwards get discarded) and the snapshot builder switches to iterating the new root hash.
There shouldn't be any reorgs, but validator still need to accept new block of a same block height. To achieve this, the builder operates on HEAD-128 and is capable of suspending/resuming if a state is missing (a restart will only write out some tries, not all cached in memory).

The benefit of the snapshot is that it acts as an acceleration structure for state accesses:

Instead of doing O(log N) disk reads (+leveldb overhead) to access an account / storage slot, the snapshot can provide direct, O(1) access time. This should be a small improvement in block processing and a huge improvement in eth_call evaluations.
The snapshot supports account and storage iteration at O(1) complexity per entry + sequential disk access, which should enable remote nodes to retrieve state data significantly cheaper than before (the sort order is the state trie leaf order, so responses can directly be assembled into tries too).
The presence of the snapshot can also enable more exotic use cases such as deleting and rebuilding the entire state trie (guerilla pruning) as well as building alternative state trie (e.g. binary vs. hexary), which might be needed in the future.

The downside of the snapshot is that the raw account and storage data is essentially duplicated. In the case of mainnet, this means an extra 8-12GB of SSD space used (estimate data, not done yet).

Changes include

Bugfix (non-breaking change that solves an issue)
New feature (non-breaking change that adds functionality)

Testing

I have tested this code with the official test suite
I have tested this code manually

Manual tests

Backward compatibility

Start up 4-validator network with 1 new version node, and 3 target version nodes.
Send several transactions including multical contract transactions, too.

It works as expected, and block execution of the newer version is a little faster than the target version.

Snapshot generation

Upgrade a full node of devnet to current version.
Try these methods:
- Enable snapshot with already up-to-day database when it starts up.
- Use a block recovery file to start up.
- Begin syncing from the genesis block when start up.

All works as expected, the generation done with almost same size. And the fastest one is "using a block recovery file".

Snapshot regeneration

Upgrade a validator of devnet to be the snapshot initialized one.
Disable snapshot feature (not enable by default, and synchronize mode by default) for 10-30 minutes.
Restart node with snapshot feature enable.
Stop and restart the node for several times during snapshot regeneration.

The regeneration works fine, and will resume if it restart. The regeneration only take minutes compare with first initialization.

Documentation update

Will update the cli documentation once the version bumped.

DarianShawn · 2023-04-07T09:29:15Z

The snapshot is half done without syncing protocol.
And syncing tests on Mainnet found out even worser performance.
So I'll just close this PR since we're adopting another repo for the next version.

DarianShawn added 30 commits December 17, 2022 12:01

Move state account and object to stypes package

b877b39

More types and tests

f59fd56

Extract address hash key

b605bf0

More comments on kvdb interface

2623c99

Remove iterator not use methods

07b8f71

Refactor leveldb test cases

742fef3

Extract datadir path joiner

26b7d42

Fix e2e reverify test

3998922

Fix lint error

d8e9872

Basic framework of snapshot layers

e611671

Add io reader/writer rlp mashaler/unmarshaler

4c9cdfb

Use schema for state db entry prefix

b1c94c9

Minimize statedb transanction lock range

724dd0c

Fix rlp lint error

2c4f578

Fix rlp test crash

765c96b

Fix rlp test failure

4f537ca

refactor kvdb module with interface and different implementations

a7a1885

Fix rlp output test failure

7257547

iterator of snapshot

d6b4701

migrate difflayer logic from geth

56562b4

migrate difflayer tests from geth

9ee9430

Pass difflayer tests

6211046

panic when data was corrupted

a26ce3d

migrate disklayer logic from geth

12b930f

Refactor storage with option mode instead of hard-to-read builder mode

d3632ad

Fix test file lint error

5297321

Remove conflict snapshot key prefix

464bde3

Strict iterator key match

624d218

Extract logger interface to kvdb package

aa80354

Replace snapshot hclog.Logger with kvdb.Logger

6fb0927

DarianShawn added 20 commits March 7, 2023 20:12

Wait group done no matter ibft validator or not

c55ce24

Refactor ibft consensus running context

46476fc

Seperate generate metrics from snapshot metrics

cd46f81

Fix panic of snapshot tests

749c808

Fix snapshot tests compile failed

2aab8f1

Use file mask constant directly in creating dir

2664b1b

Remove repeat account flush item count

50e6902

Remove not used metric structs

1c25645

Revert ibft consensus logic to make validator mining back

4440033

Fix lint

be0d698

Remove update submodule log due to too much printing

6f11a34

Merge branch 'dev' into feat-state-snapshot

3c49d51

Use Lock instead of TryLock due to golang 1.17 not exists method

766006a

Disable parallel tests on evm and state tests due to eating out memory

38a70e4

Larger leveldb block size for less CRC checksum

53a2688

Use prometheus metrics instead of custom ones in generating snapshot

0544d02

Parallel run evm and state tests on file instead of folder

b74dae5

Collect storage clean metrics

da27165

Collect dangling storage metrics

44633af

Collect snapshot generate used and eta metrics

b8b3d57

DarianShawn added feature New update to Dogechain bug fix Functionality that fixes a bug help-wanted I need technical help labels Mar 20, 2023

DarianShawn added this to the Release 1.3.0 milestone Mar 20, 2023

DarianShawn requested a review from abrahamcruise321 as a code owner March 20, 2023 12:25

DarianShawn self-assigned this Mar 20, 2023

DarianShawn requested a review from 0xcb9ff9 as a code owner March 20, 2023 12:25

DarianShawn closed this Apr 7, 2023

github-actions bot locked and limited conversation to collaborators Apr 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dynamic state snapshots #326

Dynamic state snapshots #326

Uh oh!

DarianShawn commented Mar 20, 2023

Uh oh!

DarianShawn commented Apr 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Dynamic state snapshots #326

Dynamic state snapshots #326

Uh oh!

Conversation

DarianShawn commented Mar 20, 2023

Description

Changes include

Testing

Manual tests

Backward compatibility

Snapshot generation

Snapshot regeneration

Documentation update

Uh oh!

DarianShawn commented Apr 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants