Skip to content
This repository was archived by the owner on Jan 19, 2021. It is now read-only.

Use Map in place of leveldb#113

Closed
ryanio wants to merge 1 commit intomasterfrom
useMaps
Closed

Use Map in place of leveldb#113
ryanio wants to merge 1 commit intomasterfrom
useMaps

Conversation

@ryanio
Copy link
Contributor

@ryanio ryanio commented Apr 22, 2020

This PR tests removing leveldb dependencies in merkle-patricia-tree.

By using native maps as suggested from EthWork's performance review, we get the benefit of extremely fast and synchronous key-value lookups. Keys are encoded as hex strings.

Changes

  • Uses Map in place of leveldb
    • Removes need for readStream, scratchReadStream
  • Converts trie methods to synchronous
    • Removes need for semaphore, prioritizedTaskExecutor, walkController

ToDo

  • Resolve trie init race condition with leveldb
  • Add tests for leveldb functionality
  • Update usage examples in README.md

Diff files

db.ts, baseTrie.ts

Performance

Benchmark result run for v3.0.0 using memdown:

// benchmarks/random.js
sys: 588.170ms

Benchmark result run for this PR:

// benchmarks/random.ts
sys: 227.521ms

@github-actions
Copy link

Coverage Status

Coverage decreased (-4.4%) to 90.641% when pulling 50fd0fa on useMaps into 7583aa9 on master.

@ryanio ryanio marked this pull request as draft April 22, 2020 18:58
@ryanio ryanio changed the title Use in-memory JS maps in place of leveldb Use maps in place of leveldb Apr 22, 2020
@holgerd77
Copy link
Member

👍

TBH: I've problems to get the whole picture. Is this still working with leveldb, also now being synchronous? Why was this then asynchronous in the first place? Is this now slower when being used with leveldb and faster with maps?

@ryanio
Copy link
Contributor Author

ryanio commented Apr 22, 2020

👍

TBH: I've problems to get the whole picture. Is this still working with leveldb, also now being synchronous? Why was this then asynchronous in the first place? Is this now slower when being used with leveldb and faster with maps?

I’m still determining how we want to handle leveldb features and functionality, hence my prior questions around them.

Previously every trie query used leveldb.get and .set, which are rather slow async operations, at least compared to a native Maps implementation - especially when walking down a trie.

These changes totally convert the lib to use maps as its trie data store, except for some code to preserve initializing a Trie with a leveldb file. Remaining features that could be implemented to get closer to prior feature parity could be to save to the passed-in leveldb on every op, keeping it in sync, however I think a user of the lib could just do that themselves if it was important to them, is anyone using it like that right now? (We wouldn’t even need to provide leveldb as an init parameter if we provided a helper func to convert a leveldb to map, which would solve the init race condition detailed in the todo above).

I think I would be able to find more optimizations in the code now being in a sync environment, however this PR does a pretty good job at quickly converting the lib in how it used to function while also reducing complexity (e.g. I was able to fully refactor out the walkController).

@ryanio ryanio force-pushed the useMaps branch 2 times, most recently from 3bbb6d7 to 1381cf1 Compare April 23, 2020 18:25
@ryanio ryanio changed the title Use maps in place of leveldb Use Map in place of leveldb Apr 23, 2020
@ryanio ryanio mentioned this pull request Apr 23, 2020
Copy link
Member

@kumavis kumavis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this assumes that the tree at runtime is only backed by an in-memory synchronously readable store
and removes the possibility of lazily reading from disk/db
this is not an appropriate assumption for the state tree

this breaks compatibility with existing features in ganache
This will prevent the ethereumjs-vm from being used with any chain with a non trivial amount of data (long running ganache instance, public testnet, mainnet)

@ryanio I appreciate your effort here trying to simplify, improve perf, and remove the aging leveldb

but I OPPOSE this change, do not change the interface to make get/sets async

@kumavis
Copy link
Member

kumavis commented Apr 24, 2020

In my opinion, the correct change here is to identify a different generic async datastore interface, like https://github.com/ipfs/interface-datastore/

@kumavis
Copy link
Member

kumavis commented Apr 24, 2020

@ryanio how does perf differ with a leveldb interface wrapper over Map?

@ryanio
Copy link
Contributor Author

ryanio commented Apr 24, 2020

@kumavis thanks for your comments/feedback. Could you link me to how ganache uses MPT or ethereumjs-vm in a long running context? Are you passing in a leveldb?

I believe the ethereumjs-vm StateManager would remain async and could still interface with external db stores.

You are right that these changes don’t allow for lazy reading of a large trie, which was one of my main questions before getting started, but it doesn’t seem to be a primary use case of MPT so the performance enhancement is preferred?
I haven’t tried loading up a giant trie in a map yet to see how it performs. At what trie size does a lazy-reading leveldb instance outperform an in-memory map?

@ryanio how does perf differ with a leveldb interface wrapper over Map?

Hm what do you mean “with a leveldb interface wrapper over Map”?

@kumavis
Copy link
Member

kumavis commented Apr 24, 2020

Hm what do you mean “with a leveldb interface wrapper over Map”?

implementation of the DB class

export class DB {

but around a Map
perf comparison vs an in memory leveldb instance

@kumavis
Copy link
Member

kumavis commented Apr 24, 2020

You are right that these changes don’t allow for lazy reading of a large trie, which was one of my main questions before getting started, but it doesn’t seem to be a primary use case of MPT so the performance enhancement is preferred?

I've pinged a ganache dev to comment

I haven’t tried loading up a giant trie in a map yet to see how it performs. At what trie size does a lazy-reading leveldb instance outperform an in-memory map?

databases in general dont load everything into memory before use. if this is a toy, or only used for small tries its fine, but it was not originally built for that purpose.

is this refactor primarily guided by optimization?
if so, was MPT found to be the primary point of bad perf in some larger system, like the vm?
i'd like to caution letting microbenchmarks guide our arch

as an aside, i still think theres potential fat perf gains to be had by making the hashing lazy, as there are cases where we do a lot of hashing for values that become stale before we commit / query the hash. this is mostly speculation, but may offer practical perf gains for the vm

@sz-piotr
Copy link

Hey @ryanio great PR. As it turns out me and @msieczko were working on the same thing:
https://github.com/EthWorks/merkle-patricia-tree/tree/remove-async

Our code includes some improvements on the test and benchmark side as well as the async to sync change.

Maybe we can discuss this together. I send you an email (using the address from your commits).

@davidmurdoch
Copy link

Ganache needs to reliably persist the data to disk somehow; it is totally possible for users to create a multi-gigabyte trie database while using Ganache, and this is a use case we'd like to continue to support. I don't think you'll be able to fit a 4 GB+ trie in a Map (at least not without using node flags). On many developer's machines you'll be hitting SWAP if you try to store that much data in memory anyway, negating most of the performance benefits of storing in memory in the first place.

That said, multi-GB database are likely a very extreme edge case.

Possibly relevant: in the future Ganache will actually have two modes of db storage: disk and in-memory.

  • For disk: we'll put a machine-dependent in-memory LRU cache in front of the disk db. On memory-constrained machines we'll set a reasonable upper limit for this cache. Persistence will be guaranteed (no fire-and-forget lazy writes).
  • For in memory-mode, the new default: we'll set a reasonable upper limit for pure memory storage, only writing to disk when the limit is exceeded. We may implement lazy writes (TBD).

For perspective, users have requested the ability to use a geth database with ganache. IIRC, Mainnet's pruned state trie is well over 20GB. I'd love for Ganache to be able to support a full (non-archive) state in the future.

@marekkirejczyk
Copy link

marekkirejczyk commented Apr 27, 2020

@kumavis @davidmurdoch
Thanks for feedback.

Just to give you a bit more context:
The test on Ganache and similar tools (e.g. buidler.dev & waffle) are incredibly slow. A simple unit tests suite for average size smart contract can take couple minutes. This is problem across almost all projects. This does not take into account integration & e2e tests with ethereumjs-vm on the backend of more sophisticated tests.
This slows down development, make TDD hard, slows down CI etc
In a big picture makes writing tests more expensive and leads to lower test coverage.

In our understanding Ganache and Ethereumjs-vm is used mostly for this use case and we would like to optimise for that.

Do you know any actual projects that use ethereumjs-vm for storing gigabytes of data?

@holgerd77
Copy link
Member

Hi @marekkirejczyk, on the EthereumJS side we also have our ethereumjs-client project, where we want to keep the flexibility on storing larger amounts of data to disk and beside @kumavis and @davidmurdoch have mentioned significant current use cases respectively usage scenarios.

So removing leveldb support from this library won't be an option and we have to think of other possible ways of moving forward here, if we decide that it's worth it. PR from @ryanio is nevertheless super-useful to have some discussion ground here.

Eventually possible:

  • Support both async and a sync option in this library
  • Maintain two forks

@s1na also mentioned in our internal chat that it might be a mid-term goal worth following to support the flat DB structure from turbo-geth, will quote him here directly (Sina hope that's ok):

btw when reading Richard's message I remembered we can probably significantly speed up the VM if we adopt turbo-geth's flat database structure. Should've thought of this earlier..

how MPT works now is that you store every node (even intermediate ones from root to leaves) in the db. when you do a look-up, you have to do as many db look-ups as the depth of the trie. What turbo-geth (and now geth itself) does is to store the leaves (accounts and storage items) directly in the db with their addresses as key so you can retrieve them with 1 look-up. It gets however a bit more complicated because you then need to compute the trie root after the block ends to verify stateRoot and they have employed some caching mechanism for this
so it'd be a medium-term project, but i think most clients are thinking about going in this direction

Maybe that's also something to really consider seriously, this would likely rather happen on the StateManager level (to be extracted to its own package from the VM along with the next v5 release).

@ryanio
Copy link
Contributor Author

ryanio commented May 1, 2020

Closing this for now.

I think there are strong use cases for both async and sync uses of this library, so I will investigate how easily we may be able to add a Sync API to the existing implementation without creating a maintenance burden.

The flat DB structure from turbo-geth also seems like a great optimization to keep in mind as high priority for the async side of things.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants