Skip to content

Pruned FGW nodes can accumulate class_definitions.definition IS NULL rows and never self-heal #3287

@vladimir-tikhonov

Description

@vladimir-tikhonov

WARNING:

LLM-generated issue below. This is, however, a real issue that I observed while upgrading my nodes, and the LLM was able to fix it, so the issue seems legit.

Summary

Two pruned nodes ended up with thousands of rows in class_definitions where definition IS NULL.

Once those rows exist, the feeder-gateway sync path appears to treat the classes as already present and never retries fetching them, so the database stays permanently inconsistent unless repaired manually.

Impact

On affected nodes, execution starts failing for some contracts with errors like:

Internal error in execution state reader error=Querying for class definition

We also saw starknet_getClass fail for affected class hashes.

Restarting the node does not fix it. Upgrading does not fix it. The rows just stay there.

Environment

Observed on:

  • v0.22.0
  • pruned nodes
  • --storage.blockchain-history=1024
  • --storage.state-tries=0
  • feeder-gateway sync path

What we found

On one node:

select count(*) from class_definitions where definition is null;
-- 8100

select min(block_number), max(block_number)
from class_definitions
where definition is null;
-- 146, 261858

On the other node:

select count(*) from class_definitions where definition is null;
-- 8101

So this is not just a tip/pruning-edge problem. The broken rows were spread across a large historical range.

Why this does not look like normal pruning

From the code, pruning appears to preserve the class blob and only detach the historical block number.

Relevant path:

  • crates/storage/src/connection/block.rs

That means a healthy pruned row should look like:

  • definition IS NOT NULL
  • block_number IS NULL

Our broken rows looked like:

  • definition IS NULL
  • block_number IS NOT NULL

So pruning itself does not seem to explain the missing blobs.

Likely failure mode

From reading the code, the sequence looks like this:

  1. A state update inserts a placeholder row into class_definitions with hash and block_number, but no definition blob yet.
  2. Later, class download/persist should fill in definition.
  3. If that second step does not happen for any reason, the placeholder row remains in the database.
  4. After that, normal feeder-gateway sync no longer retries that class, because it appears to only check whether a row exists, not whether the definition blob is populated.

Relevant paths:

  • crates/storage/src/connection/state_update.rs
  • crates/pathfinder/src/state/sync/l2.rs (download_new_classes)
  • crates/storage/src/connection/class.rs (class_definitions_exist)

The check in the FGW sync path is effectively:

SELECT 1 FROM class_definitions WHERE hash = ?

So a placeholder row with definition IS NULL is treated the same as a fully populated class definition row.

There does appear to be logic in the checkpoint sync path to look for missing class definitions, but the normal FGW/state sync path does not seem to repair already-broken definition IS NULL rows.

Workaround we tested

We hotfixed pathfinder locally to do two things:

  • treat definition IS NULL as missing in FGW sync
  • add a repair pass that scans class_definitions WHERE definition IS NULL and backfills them from feeder-gateway

That immediately started draining the broken rows and the nodes recovered.

One extra wrinkle we hit: for old Cairo 0 classes, the JSON returned by feeder-gateway can re-hash differently today, so a repair path cannot rely on the recomputed hash to choose which row to update. It has to write the fetched definition back into the originally requested DB row.

What seems worth fixing

  • In FGW sync, treat definition IS NULL the same as "missing"
  • Add an automatic repair path for already-broken rows
  • Ideally run that repair in the background so startup is not blocked
  • Add a metric/log for the number of missing class definitions, because this is otherwise very hard to see until execution starts failing

Why I'm opening this

This seems like a real consistency bug for pruned nodes, and it looks permanent once the DB reaches that state. We worked around it locally, but it would be good to have a proper fix upstream.

Happy to provide more details if useful.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions