Skip to content

Conversation

@holiman
Copy link
Contributor

@holiman holiman commented Nov 13, 2017

Currently, doing a dump of the entire state results in a ~9G large json file, which is very heavy to process.

This PR makes it possible to dump out the state in a more machine-friendly manner, whereby the dump output consists of a stream of json-objects, the first line containing the state root, and each line afterwards representing an account in the trie. Thus, a consumer of the resulting output does not need to have the full 9 Gb in memory during processing, nor does the geth export node.

Examples

Existing functionality:

build/bin/geth --datadir /tmp/foo  dump  0  | head
WARN [11-13|21:37:49] No etherbase set and no accounts found as default 
INFO [11-13|21:37:49] Allocated cache and file handles         database=/tmp/foo/geth/chaindata cache=128 handles=1024
INFO [11-13|21:37:49] Disk storage enabled for ethash caches   dir=/tmp/foo/geth/ethash count=3
INFO [11-13|21:37:49] Disk storage enabled for ethash DAGs     dir=/home/martin/.ethash count=2
INFO [11-13|21:37:49] Loaded most recent local header          number=0 hash=d4e567…cb8fa3 td=17179869184
INFO [11-13|21:37:49] Loaded most recent local full block      number=0 hash=d4e567…cb8fa3 td=17179869184
INFO [11-13|21:37:49] Loaded most recent local fast block      number=0 hash=d4e567…cb8fa3 td=17179869184
{
    "root": "d7f8974fb5ac78d9ac099b9ad5018bedc2ce0a72dad1827a1709da30580f0544",
    "accounts": {
        "000d836201318ec6899a67540690382780743280": {
            "balance": "200000000000000000000",
            "nonce": 0,
            "root": "56e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421",
            "codeHash": "c5d2460186f7233c927e7db2dcc703c0e500b653ca82273b7bfad8045d85a470",
            "code": "",
            "storage": {}

New functionality:

build/bin/geth --datadir /tmp/foo --iterative  dump  0  | head
WARN [11-13|21:37:55] No etherbase set and no accounts found as default 
INFO [11-13|21:37:55] Allocated cache and file handles         database=/tmp/foo/geth/chaindata cache=128 handles=1024
INFO [11-13|21:37:55] Disk storage enabled for ethash caches   dir=/tmp/foo/geth/ethash count=3
INFO [11-13|21:37:55] Disk storage enabled for ethash DAGs     dir=/home/martin/.ethash count=2
INFO [11-13|21:37:55] Loaded most recent local header          number=0 hash=d4e567…cb8fa3 td=17179869184
INFO [11-13|21:37:55] Loaded most recent local full block      number=0 hash=d4e567…cb8fa3 td=17179869184
INFO [11-13|21:37:55] Loaded most recent local fast block      number=0 hash=d4e567…cb8fa3 td=17179869184
{"root":"d7f8974fb5ac78d9ac099b9ad5018bedc2ce0a72dad1827a1709da30580f0544"}
{"address":"ae34861d342253194ffc6652dfde51ab44cad3fe","balance":"466215000000000000000","nonce":0,"root":"56e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421","codeHash":"c5d2460186f7233c927e7db2dcc703c0e500b653ca82273b7bfad8045d85a470","code":"","storage":{}}
{"address":"e6115b13f9795f7e956502d5074567dab945ce6b","balance":"100000000000000000000000","nonce":0,"root":"56e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421","codeHash":"c5d2460186f7233c927e7db2dcc703c0e500b653ca82273b7bfad8045d85a470","code":"","storage":{}}
{"address":"9d069197d1de50045a186f5ec744ac40e8af91c6","balance":"2000000000000000000000","nonce":0,"root":"56e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421","codeHash":"c5d2460186f7233c927e7db2dcc703c0e500b653ca82273b7bfad8045d85a470","code":"","storage":{}}
{"address":"1895a0eb4a4372722fcbc5afe6936f289c88a419","balance":"910000000000000000000","nonce":0,"root":"56e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421","codeHash":"c5d2460186f7233c927e7db2dcc703c0e500b653ca82273b7bfad8045d85a470","code":"","storage":{}}
{"address":"2d5b42fc59ebda0dfd66ae914bc28c1b0a6ef83a","balance":"206764195000000000000000","nonce":0,"root":"56e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421","codeHash":"c5d2460186f7233c927e7db2dcc703c0e500b653ca82273b7bfad8045d85a470","code":"","storage":{}}
{"address":"4a81abe4984c7c6bef63d69820e55743c61f201c","balance":"16011846000000000000000","nonce":0,"root":"56e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421","codeHash":"c5d2460186f7233c927e7db2dcc703c0e500b653ca82273b7bfad8045d85a470","code":"","storage":{}}
{"address":"4989e1ab5e7cd00746b3938ef0f0d064a2025ba5","balance":"2000000000000000000000","nonce":0,"root":"56e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421","codeHash":"c5d2460186f7233c927e7db2dcc703c0e500b653ca82273b7bfad8045d85a470","code":"","storage":{}}
{"address":"f114ff0d0f24eff896edde5471dea484824a99b3","balance":"13700000000000000000","nonce":0,"root":"56e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421","codeHash":"c5d2460186f7233c927e7db2dcc703c0e500b653ca82273b7bfad8045d85a470","code":"","storage":{}}
{"address":"92c13fe0d6ce87fd50e03def9fa6400509bd7073","balance":"40000000000000000000","nonce":0,"root":"56e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421","codeHash":"c5d2460186f7233c927e7db2dcc703c0e500b653ca82273b7bfad8045d85a470","code":"","storage":{}}

@karalabe
Copy link
Member

Just as a curiosity, Go does have a capability to generate huge json's in a streaming way and also to parse it as such (https://golang.org/pkg/encoding/json/#example_Decoder_Decode_stream). It needs a bit of manual work, but at least on the Go side we could keep the dump a valid json without needing to store it in memory. Wouldn't that perhaps be a better choice?

@Arachnid
Copy link
Contributor

@karalabe JSON-object-per-line is a pretty common and easily parsed format, though.

Copy link

@caner1234 caner1234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@holiman holiman force-pushed the dump_streaming_json branch from ff4d6c6 to 3e6a0fa Compare March 7, 2018 22:08
@holiman holiman requested a review from karalabe as a code owner March 7, 2018 22:08
@holiman
Copy link
Contributor Author

holiman commented Mar 7, 2018

I have now modified this PR to add --nocode and --nostorage. Since the state is so humongous nowadays, having the ability to not do those extra lookups probably saves a day or so in doing a full dump.

I would prefer to keep it jsonl format, since that format is easier to handle for less mature tools, such as bash scripting or small python scripts. If you do the export on a server, it may be nice to filter it using bash before downloading it for analysis in a more advanced framework.

cc @Arachnid

@yondonfu
Copy link
Contributor

yondonfu commented Apr 9, 2018

Perhaps since the --nocode and --nostorage state dump config options are being included in this PR, it might be worth considering adding the --minbalance and --maxbalance config options as well. Parity offers these options in its export state command and I could see the config options being helpful for additional filtering of the dumped state before analysis.

@holiman
Copy link
Contributor Author

holiman commented Apr 9, 2018

Well, that kind of makes sense, however, there are quite a lot of options that could potentially be interesting to add, depending on what you want to use it for at that particular time. Such as "only with this segment of code" or those two you mentioned.

However, the idea behind --nocode and --nostorage is that having those options there makes the dump a lot faster, simply because we don't have to do additional lookups into databases for code and storage-entries.

Aside from that, there's really not much we can do that will speed up the dump -- remaining things like filtering for low balance can be done in a post-processing step pretty easily by a bash script, and it will be roughly the same speed as if it was done within geth anyway.

fmt.Printf("%s\n", state.Dump())
excludeCode := ctx.GlobalIsSet(utils.ExcludeCodeFlag.Name)
excludeStorage := ctx.GlobalIsSet(utils.ExcludeStorageFlag.Name)
if ctx.GlobalIsSet(utils.IterativeOutputFlag.Name) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GlobalIsSet only returns whether the flag is set, not whether the boolean value is true. If I set --iterative=false, then GlobalIsSet will return true. You're looking for GlobalBool.

}
IterativeOutputFlag = cli.BoolFlag{
Name: "iterative",
Usage: "Print streaming json iteratively as json objects, delimited by lines",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps json -> JSON

delimited by lines -> delimited by newlines

}
ExcludeStorageFlag = cli.BoolFlag{
Name: "nostorage",
Usage: "When dumping state, do not include storage (saves db lookups)",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think "When dumping state" is needed, since the flag is only used during dump.

Copy link
Member

@karalabe karalabe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've enumaretad some changes. I can imagine there will be some more, but lets get these fixed to have an overview of how the code looks like afterwards.

}
ExcludeCodeFlag = cli.BoolFlag{
Name: "nocode",
Usage: "When dumping state, do not include code (saves db lookups)",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think "When dumping state" is needed, since the flag is only used during dump.

"github.com/ethereum/go-ethereum/rlp"
"github.com/ethereum/go-ethereum/trie"
"io"
"os"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do the same goimports fixup as previously.

json, err := json.MarshalIndent(self.RawDump(), "", " ")
// RawDump returns the entire state an a single large object
func (self *StateDB) RawDump(excludeCode, excludeStorage bool) Dump {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please drop this newline.

Storage map[string]string `json:"storage"`
}

// For output in a collected format, as one large map
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the format Dump is/represents ...

Storage map[string]string `json:"storage"`
}

// For line-by-line json output
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the format "IterativeDump is ..."


// IterativeDump dumps out accounts as json-objects, delimited by linebreaks on stdout
func (self *StateDB) IterativeDump(excludeCode, excludeStorage bool) {
self.performDump(newIterativeDump(os.Stdout), excludeCode, excludeStorage)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's a good design to hard code the output stream here. Please pass it as a parameter. Bonus points if you specify a *json.Encoder as a parameter to make it more obvious what it does.

CodeHash string `json:"codeHash"`
Code string `json:"code"`
Storage map[string]string `json:"storage"`
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of duplicating everything, you could achieve a similar effect by adding

Address *string `json:"address,omitempty"`

to DumpAccount.

return &Dump{
Accounts: make(map[string]DumpAccount),
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we only ever use this method once, I don't think it's worth the namespace pollution of adding one extra method. Lets just inline it.

})
}

func (self *StateDB) performDump(c collector, excludeCode, excludeStorage bool) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just call this dump instead.

@holiman
Copy link
Contributor Author

holiman commented May 7, 2018

Thanks for the pointers @karalabe , I think I've addressed all those concerns now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like these defensive code paths. They introduce weird unexpected behavior and promote bad use. I don't think we should check if output is nil. If the programmer called the method with invalid parameters, crash the thing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick, I think we should only use an indent of 2 characters. 4 seems a bit excessive tbh.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting construct here. If we don't have the preimage, won't this return common.Address{}? In that case, in the old dump, we override a ton of accounts with one another for which we don't have the preimage. In the new case, we'll have a ton of accounts with 0x0..0 address.

Copy link
Member

@karalabe karalabe May 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's cleaner if here you only pass the address, and don't try to convert it. It will simplify code later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets pass common.Address instead of string. It's a cleaner API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we use map[common.Address]DumpAccount instead of map[string]DumpAccount?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we use map[common.Hash][]byte instead of map[string]string?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addr will always be set if you do it like this, so it beats the purpose of omitempty. A better solution is to create the DumpAccount first without this field set, and then set it only if addr != 0x0..0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(*json.Encoder)(&self).Encode(struct {
	Root common.Hash `json:"root"`
}{root})

@holiman holiman force-pushed the dump_streaming_json branch from 84e8934 to 91e2e5e Compare June 9, 2018 18:24
@holiman
Copy link
Contributor Author

holiman commented Jun 9, 2018

Sorry for taking so long, I have addressed the concerns now and rebased on master.

@fjl
Copy link
Contributor

fjl commented Jul 24, 2018

...and now it's out of sync again ;). Do you still want this?

@holiman
Copy link
Contributor Author

holiman commented Aug 10, 2018

Yes, I do. I'll rebase it again one of these days...

@karalabe
Copy link
Member

We've been debating whether the dump command is even useful any more. If we're pushing out 15GB of text onto the stdout, is that realistic to process afterwards? Shouldn't we just delete the dump functionality and perhaps provide a sample _example code where we describe how to iterate over the state and access it? I.e. make dump a sample .go file, make it part of the test suite to ensure it works, but remove from geth core functionality.

@holiman
Copy link
Contributor Author

holiman commented Mar 4, 2019

Makes sense, however, isn't that basically what the dump.go is?
To be clear: you're talking about removing the geth dump flag/binding, and make dump.go a separate compileable example that is not included in the alltools?

@holiman
Copy link
Contributor Author

holiman commented May 17, 2019

I just rebased and squashed. Regarding

If we're pushing out 15GB of text onto the stdout, is that realistic to process afterwards?

It's pretty simple to just pipe the output to whatever handler you want, e..g a simple python script that does whatever it is you want to do, whether it's count all ether or send it to another database.

Example 1

[user@work go-ethereum]$ build/bin/geth dump --dump.iterative --dump.nocode 0  | head -n2
INFO [05-17|11:54:40.668] Bumping default cache on mainnet         provided=1024 updated=4096
WARN [05-17|11:54:40.668] Sanitizing cache to Go's GC limits       provided=4096 updated=2589
INFO [05-17|11:54:40.669] Maximum peer count                       ETH=50 LES=0 total=50
WARN [05-17|11:54:40.670] Failed to start smart card hub, disabling: dial unix /run/pcscd/pcscd.comm: connect: no such file or directory 
INFO [05-17|11:54:40.670] Allocated cache and file handles         database=/home/user/.ethereum/geth/chaindata cache=1294.00MiB handles=2048
INFO [05-17|11:54:40.761] Disk storage enabled for ethash caches   dir=/home/user/.ethereum/geth/ethash count=3
INFO [05-17|11:54:40.761] Disk storage enabled for ethash DAGs     dir=/home/user/.ethash               count=2
INFO [05-17|11:54:40.862] Loaded most recent local header          number=0 hash=d4e567…cb8fa3 td=17179869184 age=50y1mo3d
INFO [05-17|11:54:40.862] Loaded most recent local full block      number=0 hash=d4e567…cb8fa3 td=17179869184 age=50y1mo3d
INFO [05-17|11:54:40.862] Loaded most recent local fast block      number=0 hash=d4e567…cb8fa3 td=17179869184 age=50y1mo3d
{"root":"0xd7f8974fb5ac78d9ac099b9ad5018bedc2ce0a72dad1827a1709da30580f0544"}
{"balance":"466215000000000000000","nonce":0,"root":"56e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421","codeHash":"c5d2460186f7233c927e7db2dcc703c0e500b653ca82273b7bfad8045d85a470","code":"","storage":{},"address":"0xae34861d342253194ffc6652dfde51ab44cad3fe"}

Example 2

[user@work go-ethereum]$ build/bin/geth dump --dump.iterative --dump.nocode 0  2>/dev/null | head -n4
{"root":"0xd7f8974fb5ac78d9ac099b9ad5018bedc2ce0a72dad1827a1709da30580f0544"}
{"balance":"466215000000000000000","nonce":0,"root":"56e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421","codeHash":"c5d2460186f7233c927e7db2dcc703c0e500b653ca82273b7bfad8045d85a470","code":"","storage":{},"address":"0xae34861d342253194ffc6652dfde51ab44cad3fe"}
{"balance":"100000000000000000000000","nonce":0,"root":"56e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421","codeHash":"c5d2460186f7233c927e7db2dcc703c0e500b653ca82273b7bfad8045d85a470","code":"","storage":{},"address":"0xe6115b13f9795f7e956502d5074567dab945ce6b"}
{"balance":"2000000000000000000000","nonce":0,"root":"56e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421","codeHash":"c5d2460186f7233c927e7db2dcc703c0e500b653ca82273b7bfad8045d85a470","code":"","storage":{},"address":"0x9d069197d1de50045a186f5ec744ac40e8af91c6"}

@holiman
Copy link
Contributor Author

holiman commented May 21, 2019

Pushed a new change, so that tests pass (?) and so that stream-consumers can get the hashed key if the address (preimage) is missing. In case ppl want to e.g. investigate code, it might be ok to not have the preimage but still see the code.

@karalabe karalabe added this to the 1.9.0 milestone Jun 6, 2019
@jsvisa
Copy link
Contributor

jsvisa commented Jun 23, 2019

Is there any update on this PR?

@karalabe karalabe force-pushed the dump_streaming_json branch from a60ccb3 to 4528c8f Compare June 24, 2019 12:49
@holiman holiman requested a review from rjl493456442 as a code owner June 24, 2019 12:49
Copy link
Member

@karalabe karalabe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@karalabe karalabe merged commit 1da5e0e into ethereum:master Jun 24, 2019
gzliudan added a commit to gzliudan/XDPoSChain that referenced this pull request Jul 2, 2025
gzliudan added a commit to XinFinOrg/XDPoSChain that referenced this pull request Jul 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants