-
Notifications
You must be signed in to change notification settings - Fork 131
chore(l2): optimize trie hashing and BranchNode encoding for zkVMs
#4723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Benchmark for d9f57daClick to view benchmark
|
Benchmark for 74c4f47Click to view benchmark
|
Benchmark for 77c8cd7Click to view benchmark
|
Benchmark for ec5b844Click to view benchmark
|
Benchmark for c6d25c5Click to view benchmark
|
Benchmark for 77fe7f9Click to view benchmark
|
Benchmark for c85ba9dClick to view benchmark
|
| // Encoded as Vec<u8> | ||
| impl RLPEncode for NodeHash { | ||
| fn encode(&self, buf: &mut dyn bytes::BufMut) { | ||
| RLPEncode::encode(&Into::<Vec<u8>>::into(self), buf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we build a vec to encode here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm not sure, maybe to take advantage of NodeHash::as_ref()? might try changing it now that you mentioned it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'll do it in a different PR
| } | ||
|
|
||
| // Duplicated to prealloc the buffer and avoid calculating the payload length twice | ||
| fn encode_to_vec(&self) -> Vec<u8> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should make encode_to_vec in the generic implementation call the length method instead, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing with that is that the generic length encodes into a buffer (a zero capacity Vec) and then returns the length of that buffer. Ideally RLPEncode should have a way to hint what the encoded size would be to prealloc the buffer (this is done manually in BranchNode::encode_to_vec), and default to a non-prealloced buffer.
If we call the current length we end up encoding twice
Co-authored-by: Mario Rugiero <[email protected]>
Benchmark for e1fc597Click to view benchmark
|
Benchmark for a4026d4Click to view benchmark
|
…4723) **Motivation** Our trie does a preorder traversal for recursive hashing, which allocates a lot of buffers (from the encoder and the keccak hasher) all the way down until it starts actually hashing a leaf. A way better approach to implement recursive hashing is to do a postorder traversal, in which we start allocating memory when we get to the lowest node that needs to be hashed, after which the memory is dealloc'd and the next (parent or sibling) node allocs again. This approach was inspired by risc0's trie, although we had a preorder traversal hashing for the `commit()` function already. This PR also optimizes a `BranchNode` encoding by skipping our `Encoder` type and writing directly to a preallocated buffer, instead of using two buffers and copying data from one to the other (this is because the current `Encoder` can't know the length of the encoded data beforehand, but we can calculate it if we know what we are encoding). This could also be implemented for other node types, but they are the minority of node types and the perf. gains are negligible. A third, smaller optimization is to prevent cloning the cached/computed hashes from every node. **Description** - adds `memoize_hashes` to both `Node` and `NodeRef` to implement postorder traversal - adds utility functions to calculate encoded lengths of a `NodeHash` and a string of bytes - changes `BranchNode::encode_raw()` to encode into a single buffer - adds `compute_hash_ref` to return a reference of the cached hashed **Testing** This branch was used in a client snapsynced to Mainnet, running successfully from nov. 4 to nov. 5. **Flamegraphs** block 23385900 Mainnet this reduces cycles of `hash_no_commit` (trie hashing) by 40%, which is 5% of total cycles before: <img width="3024" height="646" alt="image" src="https://github.com/user-attachments/assets/a3f8bb05-eb80-4cf5-9347-984a7a7b4501" /> after: <img width="3024" height="618" alt="image" src="https://github.com/user-attachments/assets/1551d8a0-0829-4798-b244-5bca14b56f76" /> **Proving times** RTX 4090 ``` 10m 07s -> 09m 52s (block 23426995 Mainnet) 16m 42s -> 16m 22s (block 23426996 Mainnet) ``` --------- Co-authored-by: Copilot <[email protected]> Co-authored-by: Edgar <[email protected]> Co-authored-by: Ivan Litteri <[email protected]> Co-authored-by: Mario Rugiero <[email protected]>
Motivation
Our trie does a preorder traversal for recursive hashing, which allocates a lot of buffers (from the encoder and the keccak hasher) all the way down until it starts actually hashing a leaf. A way better approach to implement recursive hashing is to do a postorder traversal, in which we start allocating memory when we get to the lowest node that needs to be hashed, after which the memory is dealloc'd and the next (parent or sibling) node allocs again. This approach was inspired by risc0's trie, although we had a preorder traversal hashing for the
commit()function already.This PR also optimizes a
BranchNodeencoding by skipping ourEncodertype and writing directly to a preallocated buffer, instead of using two buffers and copying data from one to the other (this is because the currentEncodercan't know the length of the encoded data beforehand, but we can calculate it if we know what we are encoding). This could also be implemented for other node types, but they are the minority of node types and the perf. gains are negligible.A third, smaller optimization is to prevent cloning the cached/computed hashes from every node.
Description
memoize_hashesto bothNodeandNodeRefto implement postorder traversalNodeHashand a string of bytesBranchNode::encode_raw()to encode into a single buffercompute_hash_refto return a reference of the cached hashedTesting
This branch was used in a client snapsynced to Mainnet, running successfully from nov. 4 to nov. 5.
Flamegraphs block 23385900 Mainnet
this reduces cycles of
hash_no_commit(trie hashing) by 40%, which is 5% of total cyclesbefore:

after:

Proving times RTX 4090