Skip to content

Conversation

@xqft
Copy link
Contributor

@xqft xqft commented Oct 1, 2025

Motivation

Our trie does a preorder traversal for recursive hashing, which allocates a lot of buffers (from the encoder and the keccak hasher) all the way down until it starts actually hashing a leaf. A way better approach to implement recursive hashing is to do a postorder traversal, in which we start allocating memory when we get to the lowest node that needs to be hashed, after which the memory is dealloc'd and the next (parent or sibling) node allocs again. This approach was inspired by risc0's trie, although we had a preorder traversal hashing for the commit() function already.

This PR also optimizes a BranchNode encoding by skipping our Encoder type and writing directly to a preallocated buffer, instead of using two buffers and copying data from one to the other (this is because the current Encoder can't know the length of the encoded data beforehand, but we can calculate it if we know what we are encoding). This could also be implemented for other node types, but they are the minority of node types and the perf. gains are negligible.

A third, smaller optimization is to prevent cloning the cached/computed hashes from every node.

Description

  • adds memoize_hashes to both Node and NodeRef to implement postorder traversal
  • adds utility functions to calculate encoded lengths of a NodeHash and a string of bytes
  • changes BranchNode::encode_raw() to encode into a single buffer
  • adds compute_hash_ref to return a reference of the cached hashed

Testing
This branch was used in a client snapsynced to Mainnet, running successfully from nov. 4 to nov. 5.

Flamegraphs block 23385900 Mainnet
this reduces cycles of hash_no_commit (trie hashing) by 40%, which is 5% of total cycles

before:
image

after:
image

Proving times RTX 4090

10m 07s -> 09m 52s (block 23426995 Mainnet)
16m 42s -> 16m 22s  (block 23426996 Mainnet)

@github-actions
Copy link

Benchmark for d9f57da

Click to view benchmark
Test Base PR %
Trie/cita-trie insert 10k 34.6±0.35ms 34.4±0.24ms -0.58%
Trie/cita-trie insert 1k 3.5±0.01ms 3.5±0.01ms 0.00%
Trie/ethrex-trie insert 10k 48.3±0.85ms 45.8±1.04ms -5.18%
Trie/ethrex-trie insert 1k 6.2±0.06ms 6.3±0.11ms +1.61%

@github-actions
Copy link

Benchmark for 74c4f47

Click to view benchmark
Test Base PR %
Trie/cita-trie insert 10k 40.0±2.23ms 38.8±2.10ms -3.00%
Trie/cita-trie insert 1k 3.5±0.09ms 3.6±0.26ms +2.86%
Trie/ethrex-trie insert 10k 32.6±1.41ms 31.5±0.74ms -3.37%
Trie/ethrex-trie insert 1k 5.3±0.04ms 5.1±0.02ms -3.77%

@github-actions
Copy link

Benchmark for 77c8cd7

Click to view benchmark
Test Base PR %
Trie/cita-trie insert 10k 36.1±1.05ms 35.1±0.48ms -2.77%
Trie/cita-trie insert 1k 3.5±0.07ms 3.6±0.12ms +2.86%
Trie/ethrex-trie insert 10k 31.3±0.91ms 30.7±0.74ms -1.92%
Trie/ethrex-trie insert 1k 5.2±0.08ms 5.1±0.06ms -1.92%

@github-actions
Copy link

Benchmark for ec5b844

Click to view benchmark
Test Base PR %
Trie/cita-trie insert 10k 36.4±1.64ms 36.7±1.64ms +0.82%
Trie/cita-trie insert 1k 3.6±0.05ms 3.6±0.17ms 0.00%
Trie/ethrex-trie insert 10k 31.7±0.64ms 31.1±1.71ms -1.89%
Trie/ethrex-trie insert 1k 5.3±0.03ms 5.2±0.06ms -1.89%

@github-actions
Copy link

Benchmark for c6d25c5

Click to view benchmark
Test Base PR %
Trie/cita-trie insert 10k 34.7±0.51ms 34.7±0.35ms 0.00%
Trie/cita-trie insert 1k 3.6±0.07ms 3.5±0.02ms -2.78%
Trie/ethrex-trie insert 10k 31.0±0.92ms 30.3±0.16ms -2.26%
Trie/ethrex-trie insert 1k 5.3±0.02ms 5.0±0.03ms -5.66%

@github-actions
Copy link

github-actions bot commented Nov 5, 2025

Benchmark for 77fe7f9

Click to view benchmark
Test Base PR %
Trie/cita-trie insert 10k 34.5±0.20ms 34.6±0.37ms +0.29%
Trie/cita-trie insert 1k 3.5±0.01ms 3.5±0.03ms 0.00%
Trie/ethrex-trie insert 10k 30.7±0.68ms 29.6±0.34ms -3.58%
Trie/ethrex-trie insert 1k 2.8±0.01ms 2.7±0.01ms -3.57%

@github-actions
Copy link

Benchmark for c85ba9d

Click to view benchmark
Test Base PR %
Trie/cita-trie insert 10k 29.7±2.48ms 30.4±2.85ms +2.36%
Trie/cita-trie insert 1k 2.9±0.06ms 2.9±0.13ms 0.00%
Trie/ethrex-trie insert 10k 29.0±2.47ms 26.9±1.67ms -7.24%
Trie/ethrex-trie insert 1k 2.3±0.05ms 2.2±0.02ms -4.35%

// Encoded as Vec<u8>
impl RLPEncode for NodeHash {
fn encode(&self, buf: &mut dyn bytes::BufMut) {
RLPEncode::encode(&Into::<Vec<u8>>::into(self), buf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we build a vec to encode here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm not sure, maybe to take advantage of NodeHash::as_ref()? might try changing it now that you mentioned it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'll do it in a different PR

}

// Duplicated to prealloc the buffer and avoid calculating the payload length twice
fn encode_to_vec(&self) -> Vec<u8> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should make encode_to_vec in the generic implementation call the length method instead, wdyt?

Copy link
Contributor Author

@xqft xqft Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing with that is that the generic length encodes into a buffer (a zero capacity Vec) and then returns the length of that buffer. Ideally RLPEncode should have a way to hint what the encoded size would be to prealloc the buffer (this is done manually in BranchNode::encode_to_vec), and default to a non-prealloced buffer.

If we call the current length we end up encoding twice

@github-actions
Copy link

Benchmark for e1fc597

Click to view benchmark
Test Base PR %
Trie/cita-trie insert 10k 29.9±1.92ms 37.0±2.16ms +23.75%
Trie/cita-trie insert 1k 3.0±0.06ms 2.9±0.03ms -3.33%
Trie/ethrex-trie insert 10k 30.5±0.70ms 29.9±1.14ms -1.97%
Trie/ethrex-trie insert 1k 2.3±0.04ms 2.3±0.07ms 0.00%

@xqft xqft enabled auto-merge November 10, 2025 20:07
@github-actions
Copy link

Benchmark for a4026d4

Click to view benchmark
Test Base PR %
Trie/cita-trie insert 10k 27.7±2.43ms 36.5±2.86ms +31.77%
Trie/cita-trie insert 1k 2.9±0.02ms 2.9±0.13ms 0.00%
Trie/ethrex-trie insert 10k 30.7±0.90ms 28.3±1.91ms -7.82%
Trie/ethrex-trie insert 1k 2.2±0.01ms 2.2±0.02ms 0.00%

@xqft xqft added this pull request to the merge queue Nov 10, 2025
Merged via the queue into main with commit 6367f53 Nov 10, 2025
51 of 53 checks passed
@xqft xqft deleted the l2/opt_rlp_buffer branch November 10, 2025 21:03
xqft added a commit that referenced this pull request Nov 11, 2025
…4723)

**Motivation**

Our trie does a preorder traversal for recursive hashing, which
allocates a lot of buffers (from the encoder and the keccak hasher) all
the way down until it starts actually hashing a leaf. A way better
approach to implement recursive hashing is to do a postorder traversal,
in which we start allocating memory when we get to the lowest node that
needs to be hashed, after which the memory is dealloc'd and the next
(parent or sibling) node allocs again. This approach was inspired by
risc0's trie, although we had a preorder traversal hashing for the
`commit()` function already.

This PR also optimizes a `BranchNode` encoding by skipping our `Encoder`
type and writing directly to a preallocated buffer, instead of using two
buffers and copying data from one to the other (this is because the
current `Encoder` can't know the length of the encoded data beforehand,
but we can calculate it if we know what we are encoding). This could
also be implemented for other node types, but they are the minority of
node types and the perf. gains are negligible.

A third, smaller optimization is to prevent cloning the cached/computed
hashes from every node.

**Description**
- adds `memoize_hashes` to both `Node` and `NodeRef` to implement
postorder traversal
- adds utility functions to calculate encoded lengths of a `NodeHash`
and a string of bytes
- changes `BranchNode::encode_raw()` to encode into a single buffer
- adds `compute_hash_ref` to return a reference of the cached hashed

**Testing**
This branch was used in a client snapsynced to Mainnet, running
successfully from nov. 4 to nov. 5.

**Flamegraphs** block 23385900 Mainnet
this reduces cycles of `hash_no_commit` (trie hashing) by 40%, which is
5% of total cycles

before:
<img width="3024" height="646" alt="image"
src="https://github.com/user-attachments/assets/a3f8bb05-eb80-4cf5-9347-984a7a7b4501"
/>

after:
<img width="3024" height="618" alt="image"
src="https://github.com/user-attachments/assets/1551d8a0-0829-4798-b244-5bca14b56f76"
/>

**Proving times** RTX 4090
```
10m 07s -> 09m 52s (block 23426995 Mainnet)
16m 42s -> 16m 22s  (block 23426996 Mainnet)
```

---------

Co-authored-by: Copilot <[email protected]>
Co-authored-by: Edgar <[email protected]>
Co-authored-by: Ivan Litteri <[email protected]>
Co-authored-by: Mario Rugiero <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

L2 Rollup client

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants