Bor sync wedges on empty block after producer restart — insertSideChain same-stateRoot false-positive

# Bor sync wedges on empty block after producer restart — `insertSideChain` same-stateRoot false-positive

## Summary

After restarting a Bor validator whose local head is an empty (no-transaction) block produced just before the restart, Bor can refuse to sync past that block, dropping every peer with:

    WARN Sidechain ghost-state attack detected   number=N sideroot=X canonroot=X
    WARN Synchronisation failed, dropping peer   err="retrieved hash chain is invalid: sidechain ghost-state attack" mode=full

`sideroot` and `canonroot` are **the same value** — the check at `core/blockchain.go:3562` is firing on identical state roots. The upstream-inherited logic assumes same-root + different-hash implies a shadow-state attack, which is only true when the block carried a state transition. For empty blocks (gasUsed=0, transactions=\[\]) the state root is just the parent's, so two distinct empty headers at the same height legitimately share a state root while having different hashes/seals.

A node in this state cannot recover by reconnecting, restarting Bor, or by `debug_setHead` to a block immediately behind the offending height — the side-block record persists in the database and re-triggers the same check on the next sync attempt.

## Environment

- Bor `v2.8.0-beta`
- Devnet spawned via `kurtosis-pos`, 9 L2 nodes:
  - v1–v5: bor + heimdall-v2 validators
  - v6–v8: bor RPC nodes
  - v9: erigon RPC node
- Heimdall span at the time of the incident (span id 11, blocks 1408–1535) had a single entry in `selected_producers` (val_id 3). Small validator set, weighted span selection — one validator can end up sole producer for a span.

## Reproduction

1.  Spawn a devnet large enough that some spans have a single `selected_producers` entry (the kurtosis-pos default tends to produce this with small validator counts).
2.  Wait for an active span where one validator (call it `vP`) is the sole producer.
3.  Stop `vP`'s bor EL and a second validator's bor EL (`vS`) within a second of each other, with `vS`'s last canonical head being an empty block recently produced by `vP`.
4.  After a few seconds, start both back up.

Expected: `vS` resyncs to chain head.

Observed: `vS` stays stuck at its last pre-stop block (block number `N`). `docker logs` for `vS`'s bor container streams the warn-drop pair above on every peer it talks to, plus continuous `"Whitelisting milestone deferred err=chain out of sync"` from the heimdall ws subscription.

## Concrete trace from our run

Captured artifacts: `commands.out`, `chaos.out.txt`, `reports/test-20260513-131241-test-1778692361/`. Key timeline:

| Time (EDT) | Event |
|----|----|
| 13:19:04 | Stop `l2-el-2-bor-heimdall-v2-validator` (v2). v2's local head: block 1441, produced by v3, stateRoot `0x8f58ec…2dcc11`, `transactions: []`, `gasUsed: 0`. |
| 13:19:11 | Stop `l2-el-3-bor-heimdall-v2-validator` (v3). v3 was the sole producer for span 11. |
| 13:22:31 | Start v2. |
| 13:22:37 | Start v3. v3 resumes producing 1442, 1443, … |
| 13:34:41 onward | v2 logs `Sidechain ghost-state attack detected number=1441 sideroot=8f58ec..2dcc11 canonroot=8f58ec..2dcc11` against every peer; drops them; stays stuck at 1441 while v1/v3/v4/v5 advance past 2100. |

Block 1441 on v2 (its local canonical):

    number       1441
    hash         0xa8397a7be178763f882deb7f893b28c326cc4ddf038296a073cc5e6ea597826e
    parentHash   0xb1ba2c9854ea1d575173714ebcb730e0e4c6f385702b6cd81a00250d38a69f69
    stateRoot    0x8f58ec83616416cdae4e71306e5ef9f7482facf9328957b8c28d6468332dcc11
    transactions []
    gasUsed      0
    extraData    0x626f722d3300…   ("bor-3" vanity, then seal)

The 1441 the rest of the cluster has at the same height shares `parentHash`, `stateRoot`, `transactionsRoot`, `receiptsRoot` — the state didn't change — but has a different `timestamp` and `extraData` seal (likely produced by a succession-1+ backup after v3 stopped), so its block hash differs.

## Root cause

`core/blockchain.go`, `insertSideChain` (current `develop`):

``` go
// blockchain.go:3534
func (bc *BlockChain) insertSideChain(block *types.Block, it *insertIterator, makeWitness bool) (*stateless.Witness, int, error) {
    ...
    for ; block != nil && errors.Is(err, consensus.ErrPrunedAncestor); block, err = it.next() {
        headers = append(headers, block.Header())
        if number := block.NumberU64(); current.Number.Uint64() >= number {
            canonical := bc.GetBlockByNumber(number)
            if canonical != nil && canonical.Hash() == block.Hash() {
                // re-import of a canon block, fine
                continue
            }
            if canonical != nil && canonical.Root() == block.Root() {       // <-- false positive
                log.Warn("Sidechain ghost-state attack detected", "number", block.NumberU64(),
                    "sideroot", block.Root(), "canonroot", canonical.Root())
                return nil, it.index, errors.New("sidechain ghost-state attack")
            }
        }
        ...
    }
```

This is upstream go-ethereum's pre-merge defense against attackers side-mining to a height where state was pruned and substituting their block to bypass full state verification. The premise is that a benign sidechain block at height `N` would produce a *different* state root from the canonical chain (because the txs at heights `<= N` differ across the two chains).

That premise doesn't hold for Bor:

- Bor produces empty blocks (no txs) during quiet periods. For an empty block, the post-state equals the pre-state — `block.Root() == parent.Root()`.
- Bor's sprint/succession model lets a backup producer sign a block at the same height as the primary if the primary is missing. Two empty headers sharing a parent and produced by different signers — or even the same signer with a different timestamp — share a state root while having different hashes/seals.

So `same-root + different-hash` is a normal, expected condition in Bor, not an attack signal. The check turns it into a hard sync failure that also blacklists the offering peer for that sync attempt; with every honest peer in the network offering the canonical chain that disagrees with the node's local stale head, the node has no way out.

## Why `debug_setHead` to height `N-2` did not recover the node

We tried `debug_setHead 0x59f` (=1439) on v2. Logs confirm the rewind:

    WARN  Rewinding blockchain to block            target=1439
    INFO  Truncating from head                     type=state ohead=1442 tail=427 nhead=1441
    INFO  Rewound to block with state              number=1440 hash=b1ba2c..a69f69
    INFO  Truncating from head                     type=state ohead=1441 tail=427 nhead=1440
    INFO  Rewound to block with state              number=1439 hash=280067..874280
    INFO  Loaded most recent local block           number=1439 hash=280067..874280 td=2316 age=47m16s

But the warn-drop loop continued. Best guess: the side-block record for the old 1441 hash remains in the block database after `SetHead` (only the canonical pointer was rewound and state was truncated), and the downloader-side handling re-imports it as a side chain on the next sync attempt before reconciling with peers, so `insertSideChain` keeps hitting the same condition. We had to rewind well past the divergence to escape — a much larger setHead eventually let the node resync cleanly.

## Suggested directions

Cheapest fix: scope the check to non-trivial blocks. If the candidate block carries no state delta vs its parent, this isn't a shadow-state attack pattern.

``` go
if canonical != nil && canonical.Root() == block.Root() {
    // Genuine shadow-state attacks substitute a block whose state diverges
    // from the canonical state at the same height. Two empty blocks with
    // the same parent legitimately share a state root — that isn't an
    // attack, just a different seal/timestamp.
    if len(block.Transactions()) == 0 && block.GasUsed() == 0 &&
        block.ParentHash() == canonical.ParentHash() &&
        block.TxHash() == canonical.TxHash() &&
        block.ReceiptHash() == canonical.ReceiptHash() {
        // fall through to normal side-chain insertion
    } else {
        log.Warn("Sidechain ghost-state attack detected", ...)
        return nil, it.index, errors.New("sidechain ghost-state attack")
    }
}
```

Stronger fix: the upstream check is a pre-merge defense. Bor's consensus model (validator-signed headers, Heimdall-anchored milestones, checkpoint-based finality on L1) doesn't rely on this in-line root match to defend against state-pruning shadow attacks. It may be reasonable to drop the check entirely in `insertSideChain` for Bor and lean on existing Bor-specific verification (sealer authorization, sprint rules, milestone whitelist) instead. Worth a security review before doing this — flagging as a path the reviewer should weigh in on, not a recommendation.

Either way, the recovery path needs work too: even after `debug_setHead` puts the head behind the offending block, the side-chain record can re-trigger the same condition on the next sync. Investigating whether `SetHead` should also evict diverged side-blocks at heights `> target` would help operators recover without having to rewind a "large number" of blocks past the divergence.

## Workaround for operators

`debug_setHead` to a height well before the divergence (not just one or two blocks behind). RPC timeout on the call is normal for large rewinds; the operation continues server-side. After the rewind, expect a full resync from that height.

## Artifacts

- `commands.out` — full operator session capture (script(1) format, ANSI included)
- `chaos.out.txt` — chaos-runner stdout/stderr from the partition scenario
- `reports/test-20260513-131241-test-1778692361/` — scenario.yaml, report.json, container logs, prom snapshots
- `pos--683f2274028c459982665e777b2bcdc9/` — full Kurtosis enclave dump

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bor sync wedges on empty block after producer restart — insertSideChain same-stateRoot false-positive #2224

Bor sync wedges on empty block after producer restart — `insertSideChain` same-stateRoot false-positive

Summary

Environment

Reproduction

Concrete trace from our run

Root cause

Why `debug_setHead` to height `N-2` did not recover the node

Suggested directions

Workaround for operators

Artifacts

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Time (EDT)	Event
13:19:04	Stop `l2-el-2-bor-heimdall-v2-validator` (v2). v2's local head: block 1441, produced by v3, stateRoot `0x8f58ec…2dcc11`, `transactions: []`, `gasUsed: 0`.
13:19:11	Stop `l2-el-3-bor-heimdall-v2-validator` (v3). v3 was the sole producer for span 11.
13:22:31	Start v2.
13:22:37	Start v3. v3 resumes producing 1442, 1443, …
13:34:41 onward	v2 logs `Sidechain ghost-state attack detected number=1441 sideroot=8f58ec..2dcc11 canonroot=8f58ec..2dcc11` against every peer; drops them; stays stuck at 1441 while v1/v3/v4/v5 advance past 2100.

Bor sync wedges on empty block after producer restart — insertSideChain same-stateRoot false-positive #2224

Description

Bor sync wedges on empty block after producer restart — insertSideChain same-stateRoot false-positive

Summary

Environment

Reproduction

Concrete trace from our run

Root cause

Why debug_setHead to height N-2 did not recover the node

Suggested directions

Workaround for operators

Artifacts

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Bor sync wedges on empty block after producer restart — `insertSideChain` same-stateRoot false-positive

Why `debug_setHead` to height `N-2` did not recover the node