Pruned FGW nodes can accumulate class_definitions.definition IS NULL rows and never self-heal

## WARNING:
LLM-generated issue below. This is, however, a real issue that I observed while upgrading my nodes, and the LLM was able to fix it, so the issue seems legit.

## Summary

Two pruned nodes ended up with thousands of rows in `class_definitions` where `definition IS NULL`.

Once those rows exist, the feeder-gateway sync path appears to treat the classes as already present and never retries fetching them, so the database stays permanently inconsistent unless repaired manually.

## Impact

On affected nodes, execution starts failing for some contracts with errors like:

```text
Internal error in execution state reader error=Querying for class definition
```

We also saw `starknet_getClass` fail for affected class hashes.

Restarting the node does not fix it. Upgrading does not fix it. The rows just stay there.

## Environment

Observed on:
- `v0.22.0`
- pruned nodes
- `--storage.blockchain-history=1024`
- `--storage.state-tries=0`
- feeder-gateway sync path

## What we found

On one node:

```sql
select count(*) from class_definitions where definition is null;
-- 8100

select min(block_number), max(block_number)
from class_definitions
where definition is null;
-- 146, 261858
```

On the other node:

```sql
select count(*) from class_definitions where definition is null;
-- 8101
```

So this is not just a tip/pruning-edge problem. The broken rows were spread across a large historical range.

## Why this does not look like normal pruning

From the code, pruning appears to preserve the class blob and only detach the historical block number.

Relevant path:
- `crates/storage/src/connection/block.rs`

That means a healthy pruned row should look like:

- `definition IS NOT NULL`
- `block_number IS NULL`

Our broken rows looked like:

- `definition IS NULL`
- `block_number IS NOT NULL`

So pruning itself does not seem to explain the missing blobs.

## Likely failure mode

From reading the code, the sequence looks like this:

1. A state update inserts a placeholder row into `class_definitions` with `hash` and `block_number`, but no `definition` blob yet.
2. Later, class download/persist should fill in `definition`.
3. If that second step does not happen for any reason, the placeholder row remains in the database.
4. After that, normal feeder-gateway sync no longer retries that class, because it appears to only check whether a row exists, not whether the `definition` blob is populated.

Relevant paths:
- `crates/storage/src/connection/state_update.rs`
- `crates/pathfinder/src/state/sync/l2.rs` (`download_new_classes`)
- `crates/storage/src/connection/class.rs` (`class_definitions_exist`)

The check in the FGW sync path is effectively:

```sql
SELECT 1 FROM class_definitions WHERE hash = ?
```

So a placeholder row with `definition IS NULL` is treated the same as a fully populated class definition row.

There does appear to be logic in the checkpoint sync path to look for missing class definitions, but the normal FGW/state sync path does not seem to repair already-broken `definition IS NULL` rows.

## Workaround we tested

We hotfixed pathfinder locally to do two things:

- treat `definition IS NULL` as missing in FGW sync
- add a repair pass that scans `class_definitions WHERE definition IS NULL` and backfills them from feeder-gateway

That immediately started draining the broken rows and the nodes recovered.

One extra wrinkle we hit: for old Cairo 0 classes, the JSON returned by feeder-gateway can re-hash differently today, so a repair path cannot rely on the recomputed hash to choose which row to update. It has to write the fetched definition back into the originally requested DB row.

## What seems worth fixing

- In FGW sync, treat `definition IS NULL` the same as "missing"
- Add an automatic repair path for already-broken rows
- Ideally run that repair in the background so startup is not blocked
- Add a metric/log for the number of missing class definitions, because this is otherwise very hard to see until execution starts failing

## Why I'm opening this

This seems like a real consistency bug for pruned nodes, and it looks permanent once the DB reaches that state. We worked around it locally, but it would be good to have a proper fix upstream.

Happy to provide more details if useful.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pruned FGW nodes can accumulate class_definitions.definition IS NULL rows and never self-heal #3287

WARNING:

Summary

Impact

Environment

What we found

Why this does not look like normal pruning

Likely failure mode

Workaround we tested

What seems worth fixing

Why I'm opening this

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pruned FGW nodes can accumulate class_definitions.definition IS NULL rows and never self-heal #3287

Description

WARNING:

Summary

Impact

Environment

What we found

Why this does not look like normal pruning

Likely failure mode

Workaround we tested

What seems worth fixing

Why I'm opening this

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions