-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Closed
Description
As discussed with Daniel in discord:
- I initially found this issue while testing 2.60 sepolia before release and couldn't figure what was happening
- it started happening again a few days ago on my e3/main sepolia
- main issue is: start erigon, it keeps stuck doing nothing. you eventually see the "no good peers" logs, but it never recovers.
at that time I figured (by bisecting git history and using "start erigon and it immediately start downloading blocks (on a previously stuck node)" as the good/bad criteria) that a possible regression was introduced by this PR: #9747
after more debugging, I eventually figured what is happening under the hood:
- my home network is behind a shared firewall, so I can't get inbound connections, I totally rely on erigon dialing out peers.
- when this bug manifests, the sentries are created, but not started.
- looking at the code, the sentries are lazily started when the callback is triggered first time in: https://github.com/ledgerwatch/erigon/blob/main/p2p/sentry/sentry_grpc_server.go#L1043
- however, before that call, the status data is built by: https://github.com/ledgerwatch/erigon/blob/main/p2p/sentry/status_data_provider.go#L72
- notice there is a DB call to get the chain header
- digging into the chain header call stack, we eventually end up in this function: https://github.com/ledgerwatch/erigon/blob/main/core/rawdb/accessors_chain.go#L271
- I figured when this bug manifests, the DB contains the hash and block number (this part I checked against a synced chain, they both exist and match == canonical hash/block number), but the RLP doesn't (
ReadHeaderRLPreturnsnil). - that's the part I couldn't figure, if there is some situation where the DB ends up corrupted and trigger this bug, or is there a expected scenario where the canonical hash/block number exists in the DB, but the RLP was still not saved, and the code shouldn't be relying on the RLP being already saved.
- worth noticing that this other PR silenced the
ErrNoHeaderror which would make it evident in the error logs, not sure if it should always being silenced: Unnecessary Logs in sentry removed #10190
Metadata
Metadata
Assignees
Labels
No labels