Skip to content

Bor sync wedges on empty block after producer restart — insertSideChain same-stateRoot false-positive #2224

@praetoriansentry

Description

@praetoriansentry

Bor sync wedges on empty block after producer restart — insertSideChain same-stateRoot false-positive

Summary

After restarting a Bor validator whose local head is an empty (no-transaction) block produced just before the restart, Bor can refuse to sync past that block, dropping every peer with:

WARN Sidechain ghost-state attack detected   number=N sideroot=X canonroot=X
WARN Synchronisation failed, dropping peer   err="retrieved hash chain is invalid: sidechain ghost-state attack" mode=full

sideroot and canonroot are the same value — the check at core/blockchain.go:3562 is firing on identical state roots. The upstream-inherited logic assumes same-root + different-hash implies a shadow-state attack, which is only true when the block carried a state transition. For empty blocks (gasUsed=0, transactions=[]) the state root is just the parent's, so two distinct empty headers at the same height legitimately share a state root while having different hashes/seals.

A node in this state cannot recover by reconnecting, restarting Bor, or by debug_setHead to a block immediately behind the offending height — the side-block record persists in the database and re-triggers the same check on the next sync attempt.

Environment

  • Bor v2.8.0-beta
  • Devnet spawned via kurtosis-pos, 9 L2 nodes:
    • v1–v5: bor + heimdall-v2 validators
    • v6–v8: bor RPC nodes
    • v9: erigon RPC node
  • Heimdall span at the time of the incident (span id 11, blocks 1408–1535) had a single entry in selected_producers (val_id 3). Small validator set, weighted span selection — one validator can end up sole producer for a span.

Reproduction

  1. Spawn a devnet large enough that some spans have a single selected_producers entry (the kurtosis-pos default tends to produce this with small validator counts).
  2. Wait for an active span where one validator (call it vP) is the sole producer.
  3. Stop vP's bor EL and a second validator's bor EL (vS) within a second of each other, with vS's last canonical head being an empty block recently produced by vP.
  4. After a few seconds, start both back up.

Expected: vS resyncs to chain head.

Observed: vS stays stuck at its last pre-stop block (block number N). docker logs for vS's bor container streams the warn-drop pair above on every peer it talks to, plus continuous "Whitelisting milestone deferred err=chain out of sync" from the heimdall ws subscription.

Concrete trace from our run

Captured artifacts: commands.out, chaos.out.txt, reports/test-20260513-131241-test-1778692361/. Key timeline:

Time (EDT) Event
13:19:04 Stop l2-el-2-bor-heimdall-v2-validator (v2). v2's local head: block 1441, produced by v3, stateRoot 0x8f58ec…2dcc11, transactions: [], gasUsed: 0.
13:19:11 Stop l2-el-3-bor-heimdall-v2-validator (v3). v3 was the sole producer for span 11.
13:22:31 Start v2.
13:22:37 Start v3. v3 resumes producing 1442, 1443, …
13:34:41 onward v2 logs Sidechain ghost-state attack detected number=1441 sideroot=8f58ec..2dcc11 canonroot=8f58ec..2dcc11 against every peer; drops them; stays stuck at 1441 while v1/v3/v4/v5 advance past 2100.

Block 1441 on v2 (its local canonical):

number       1441
hash         0xa8397a7be178763f882deb7f893b28c326cc4ddf038296a073cc5e6ea597826e
parentHash   0xb1ba2c9854ea1d575173714ebcb730e0e4c6f385702b6cd81a00250d38a69f69
stateRoot    0x8f58ec83616416cdae4e71306e5ef9f7482facf9328957b8c28d6468332dcc11
transactions []
gasUsed      0
extraData    0x626f722d3300…   ("bor-3" vanity, then seal)

The 1441 the rest of the cluster has at the same height shares parentHash, stateRoot, transactionsRoot, receiptsRoot — the state didn't change — but has a different timestamp and extraData seal (likely produced by a succession-1+ backup after v3 stopped), so its block hash differs.

Root cause

core/blockchain.go, insertSideChain (current develop):

// blockchain.go:3534
func (bc *BlockChain) insertSideChain(block *types.Block, it *insertIterator, makeWitness bool) (*stateless.Witness, int, error) {
    ...
    for ; block != nil && errors.Is(err, consensus.ErrPrunedAncestor); block, err = it.next() {
        headers = append(headers, block.Header())
        if number := block.NumberU64(); current.Number.Uint64() >= number {
            canonical := bc.GetBlockByNumber(number)
            if canonical != nil && canonical.Hash() == block.Hash() {
                // re-import of a canon block, fine
                continue
            }
            if canonical != nil && canonical.Root() == block.Root() {       // <-- false positive
                log.Warn("Sidechain ghost-state attack detected", "number", block.NumberU64(),
                    "sideroot", block.Root(), "canonroot", canonical.Root())
                return nil, it.index, errors.New("sidechain ghost-state attack")
            }
        }
        ...
    }

This is upstream go-ethereum's pre-merge defense against attackers side-mining to a height where state was pruned and substituting their block to bypass full state verification. The premise is that a benign sidechain block at height N would produce a different state root from the canonical chain (because the txs at heights <= N differ across the two chains).

That premise doesn't hold for Bor:

  • Bor produces empty blocks (no txs) during quiet periods. For an empty block, the post-state equals the pre-state — block.Root() == parent.Root().
  • Bor's sprint/succession model lets a backup producer sign a block at the same height as the primary if the primary is missing. Two empty headers sharing a parent and produced by different signers — or even the same signer with a different timestamp — share a state root while having different hashes/seals.

So same-root + different-hash is a normal, expected condition in Bor, not an attack signal. The check turns it into a hard sync failure that also blacklists the offering peer for that sync attempt; with every honest peer in the network offering the canonical chain that disagrees with the node's local stale head, the node has no way out.

Why debug_setHead to height N-2 did not recover the node

We tried debug_setHead 0x59f (=1439) on v2. Logs confirm the rewind:

WARN  Rewinding blockchain to block            target=1439
INFO  Truncating from head                     type=state ohead=1442 tail=427 nhead=1441
INFO  Rewound to block with state              number=1440 hash=b1ba2c..a69f69
INFO  Truncating from head                     type=state ohead=1441 tail=427 nhead=1440
INFO  Rewound to block with state              number=1439 hash=280067..874280
INFO  Loaded most recent local block           number=1439 hash=280067..874280 td=2316 age=47m16s

But the warn-drop loop continued. Best guess: the side-block record for the old 1441 hash remains in the block database after SetHead (only the canonical pointer was rewound and state was truncated), and the downloader-side handling re-imports it as a side chain on the next sync attempt before reconciling with peers, so insertSideChain keeps hitting the same condition. We had to rewind well past the divergence to escape — a much larger setHead eventually let the node resync cleanly.

Suggested directions

Cheapest fix: scope the check to non-trivial blocks. If the candidate block carries no state delta vs its parent, this isn't a shadow-state attack pattern.

if canonical != nil && canonical.Root() == block.Root() {
    // Genuine shadow-state attacks substitute a block whose state diverges
    // from the canonical state at the same height. Two empty blocks with
    // the same parent legitimately share a state root — that isn't an
    // attack, just a different seal/timestamp.
    if len(block.Transactions()) == 0 && block.GasUsed() == 0 &&
        block.ParentHash() == canonical.ParentHash() &&
        block.TxHash() == canonical.TxHash() &&
        block.ReceiptHash() == canonical.ReceiptHash() {
        // fall through to normal side-chain insertion
    } else {
        log.Warn("Sidechain ghost-state attack detected", ...)
        return nil, it.index, errors.New("sidechain ghost-state attack")
    }
}

Stronger fix: the upstream check is a pre-merge defense. Bor's consensus model (validator-signed headers, Heimdall-anchored milestones, checkpoint-based finality on L1) doesn't rely on this in-line root match to defend against state-pruning shadow attacks. It may be reasonable to drop the check entirely in insertSideChain for Bor and lean on existing Bor-specific verification (sealer authorization, sprint rules, milestone whitelist) instead. Worth a security review before doing this — flagging as a path the reviewer should weigh in on, not a recommendation.

Either way, the recovery path needs work too: even after debug_setHead puts the head behind the offending block, the side-chain record can re-trigger the same condition on the next sync. Investigating whether SetHead should also evict diverged side-blocks at heights > target would help operators recover without having to rewind a "large number" of blocks past the divergence.

Workaround for operators

debug_setHead to a height well before the divergence (not just one or two blocks behind). RPC timeout on the call is normal for large rewinds; the operation continues server-side. After the rewind, expect a full resync from that height.

Artifacts

  • commands.out — full operator session capture (script(1) format, ANSI included)
  • chaos.out.txt — chaos-runner stdout/stderr from the partition scenario
  • reports/test-20260513-131241-test-1778692361/ — scenario.yaml, report.json, container logs, prom snapshots
  • pos--683f2274028c459982665e777b2bcdc9/ — full Kurtosis enclave dump

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions