Skip to content

Conversation

@fmoletta
Copy link
Contributor

@fmoletta fmoletta commented Mar 28, 2025

Motivation
During state sync, we store the accounts hashes of the storages we failed to fetch along with their root path in the store so the storage healer can then read them and heal them. For this we used the SnapState table, where the whole pending storage paths map was a value in that table. This used to work fine at a smaller scale, but when this map gets too big reading and writing from it becomes very expensive and can disrupt other processes.
This PR moves the pending storage paths to their own table and changes how we interact with them:

  • The storage healer no longer fetches the whole map, but instead reads a specific amount of storages from it when its queue is not filled.
  • The storage healer no longer uses a channel, it instead reads incoming requests directly from the store
  • Fetchers that need to communicate with the storage healer now do so via adding paths to the store

Description

  • Remove storage heal paths from snap state
  • Add new DB table for storage heal paths
  • Remove channel from storage healer and instead manage incoming and outgoing storage heal paths through the store (This also solves the issues of the rebuilder not being able to input storage heal requests and the storage healer being kept alive indefinitely upon forced shutdown)

Closes #issue_number

@github-actions
Copy link

github-actions bot commented Mar 28, 2025

Lines of code report

Total lines added: 88
Total lines removed: 30
Total lines changed: 118

Detailed view
+------------------------------------------------------+-------+------+
| File                                                 | Lines | Diff |
+------------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync.rs                 | 566   | -10  |
+------------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync/state_healing.rs   | 123   | +4   |
+------------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync/state_sync.rs      | 238   | -4   |
+------------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync/storage_fetcher.rs | 247   | +3   |
+------------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync/storage_healing.rs | 87    | -14  |
+------------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync/trie_rebuild.rs    | 242   | +3   |
+------------------------------------------------------+-------+------+
| ethrex/crates/storage/api.rs                         | 231   | +3   |
+------------------------------------------------------+-------+------+
| ethrex/crates/storage/rlp.rs                         | 102   | +2   |
+------------------------------------------------------+-------+------+
| ethrex/crates/storage/store.rs                       | 1219  | +3   |
+------------------------------------------------------+-------+------+
| ethrex/crates/storage/store_db/in_memory.rs          | 572   | +13  |
+------------------------------------------------------+-------+------+
| ethrex/crates/storage/store_db/libmdbx.rs            | 1279  | +30  |
+------------------------------------------------------+-------+------+
| ethrex/crates/storage/store_db/redb.rs               | 1104  | +27  |
+------------------------------------------------------+-------+------+
| ethrex/crates/storage/utils.rs                       | 50    | -2   |
+------------------------------------------------------+-------+------+

@fmoletta fmoletta marked this pull request as ready for review March 31, 2025 22:22
@fmoletta fmoletta requested a review from a team as a code owner March 31, 2025 22:22
Comment on lines +803 to +811
// Delete read values
let txn = self.db.begin_write()?;
{
let mut table = txn.open_table(STORAGE_HEAL_PATHS_TABLE)?;
for (hash, _) in res.iter() {
table.remove(<H256 as Into<AccountHashRLP>>::into(*hash))?;
}
}
txn.commit()?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having the get_* method delete keys sounds confusing. Would it be too bad to split this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a rename? (get_and_remove_ o retrieve_ or something that gives the idea of consumption of elements?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I think take would be suitable here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the idea of splitting it too much as this has only one use case in which we want to delete as soon as we read

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated ba937d5

Comment on lines 774 to 784
fn set_storage_heal_paths(&self, paths: Vec<(H256, Vec<Nibbles>)>) -> Result<(), StoreError> {
let key_values = paths
.into_iter()
.map(|(hash, paths)| {
(
<H256 as Into<AccountHashRLP>>::into(hash),
<Vec<Nibbles> as Into<TriePathsRLP>>::into(paths),
)
})
.collect();
self.write_batch(STORAGE_HEAL_PATHS_TABLE, key_values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After merging #2336 this needs the following change:

Suggested change
fn set_storage_heal_paths(&self, paths: Vec<(H256, Vec<Nibbles>)>) -> Result<(), StoreError> {
let key_values = paths
.into_iter()
.map(|(hash, paths)| {
(
<H256 as Into<AccountHashRLP>>::into(hash),
<Vec<Nibbles> as Into<TriePathsRLP>>::into(paths),
)
})
.collect();
self.write_batch(STORAGE_HEAL_PATHS_TABLE, key_values)
async fn set_storage_heal_paths(&self, paths: Vec<(H256, Vec<Nibbles>)>) -> Result<(), StoreError> {
let key_values = paths
.into_iter()
.map(|(hash, paths)| {
(
<H256 as Into<AccountHashRLP>>::into(hash),
<Vec<Nibbles> as Into<TriePathsRLP>>::into(paths),
)
})
.collect();
self.write_batch(STORAGE_HEAL_PATHS_TABLE, key_values).await

Similar changes will be needed at the API level and for libmdbx.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated with merge!

Copy link
Contributor

@Oppen Oppen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments.

Copy link
Contributor

@ElFantasma ElFantasma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It LGTM

@fmoletta fmoletta added this pull request to the merge queue Apr 9, 2025
Merged via the queue into main with commit a3dc64e Apr 9, 2025
19 checks passed
@fmoletta fmoletta deleted the move-storage-heal-paths-to-own-table branch April 9, 2025 17:57
github-merge-queue bot pushed a commit that referenced this pull request Apr 10, 2025
**Motivation**
During snap sync, we download account ranges and then for each
downloaded account we request its storage and bytecodes. For these
requests we use fetcher processes that receive incoming messages from a
channel (storage roots, bytecode hashes, etc), place them on a queue,
and then group them in batches and spawn parallel processes to fetch
them. All fetchers share a common behaviour of reading requests,
batching, and fetching with differences concerning only the content of
the queue. In many cases, we have had many bugs due to how these
fetchers worked as we may update one of them and forget about the rest.
This PR aims to reduce the sources of bugs and maintain a unified
behaviour for fetchers by adding generic functions that represent the
fetcher behaviour.
In this PR we add the generic function `run_queue` that receives a
generic queue (a Vec<T>), and an asyn function that operates over a
batch in said queue.
<!-- Why does this pull request exist? What are its goals? -->

**Description**
* Add generic function `run_queue` to abstract queue logic from fetcher
processes
* Use `run_queue` in `bytecode_fetcher`, `large_storage_fecther`, and
`storage_fetcher`

*Considerations*
* As this PR was done with #2359 this won't be applied to the
`storage_healer` which will stop reading messages
* While the batch size could be a const generic instead of a regular
argument, doing so would force us to make the other generic arguments in
`run_queue` explicit, which looks pretty bad
<!-- A clear and concise general description of the changes this PR
introduces -->

<!-- Link to issues: Resolves #111, Resolves #222 -->

Closes #issue_number

---------

Co-authored-by: Mario Rugiero <[email protected]>
pedrobergamini pushed a commit to pedrobergamini/ethrex that referenced this pull request Aug 24, 2025
**Motivation**
During state sync, we store the accounts hashes of the storages we
failed to fetch along with their root path in the store so the storage
healer can then read them and heal them. For this we used the
`SnapState` table, where the whole pending storage paths map was a value
in that table. This used to work fine at a smaller scale, but when this
map gets too big reading and writing from it becomes very expensive and
can disrupt other processes.
This PR moves the pending storage paths to their own table and changes
how we interact with them:
* The storage healer no longer fetches the whole map, but instead reads
a specific amount of storages from it when its queue is not filled.
* The storage healer no longer uses a channel, it instead reads incoming
requests directly from the store
* Fetchers that need to communicate with the storage healer now do so
via adding paths to the store
<!-- Why does this pull request exist? What are its goals? -->

**Description**
* Remove storage heal paths from snap state
* Add new DB table for storage heal paths
* Remove channel from storage healer and instead manage incoming and
outgoing storage heal paths through the store (This also solves the
issues of the rebuilder not being able to input storage heal requests
and the storage healer being kept alive indefinitely upon forced
shutdown)
<!-- A clear and concise general description of the changes this PR
introduces -->

<!-- Link to issues: Resolves lambdaclass#111, Resolves lambdaclass#222 -->

Closes #issue_number
pedrobergamini pushed a commit to pedrobergamini/ethrex that referenced this pull request Aug 24, 2025
…s#2408)

**Motivation**
During snap sync, we download account ranges and then for each
downloaded account we request its storage and bytecodes. For these
requests we use fetcher processes that receive incoming messages from a
channel (storage roots, bytecode hashes, etc), place them on a queue,
and then group them in batches and spawn parallel processes to fetch
them. All fetchers share a common behaviour of reading requests,
batching, and fetching with differences concerning only the content of
the queue. In many cases, we have had many bugs due to how these
fetchers worked as we may update one of them and forget about the rest.
This PR aims to reduce the sources of bugs and maintain a unified
behaviour for fetchers by adding generic functions that represent the
fetcher behaviour.
In this PR we add the generic function `run_queue` that receives a
generic queue (a Vec<T>), and an asyn function that operates over a
batch in said queue.
<!-- Why does this pull request exist? What are its goals? -->

**Description**
* Add generic function `run_queue` to abstract queue logic from fetcher
processes
* Use `run_queue` in `bytecode_fetcher`, `large_storage_fecther`, and
`storage_fetcher`

*Considerations*
* As this PR was done with lambdaclass#2359 this won't be applied to the
`storage_healer` which will stop reading messages
* While the batch size could be a const generic instead of a regular
argument, doing so would force us to make the other generic arguments in
`run_queue` explicit, which looks pretty bad
<!-- A clear and concise general description of the changes this PR
introduces -->

<!-- Link to issues: Resolves lambdaclass#111, Resolves lambdaclass#222 -->

Closes #issue_number

---------

Co-authored-by: Mario Rugiero <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants