WARNING:
LLM-generated issue below. This is, however, a real issue that I observed while upgrading my nodes, and the LLM was able to fix it, so the issue seems legit.
Summary
Two pruned nodes ended up with thousands of rows in class_definitions where definition IS NULL.
Once those rows exist, the feeder-gateway sync path appears to treat the classes as already present and never retries fetching them, so the database stays permanently inconsistent unless repaired manually.
Impact
On affected nodes, execution starts failing for some contracts with errors like:
Internal error in execution state reader error=Querying for class definition
We also saw starknet_getClass fail for affected class hashes.
Restarting the node does not fix it. Upgrading does not fix it. The rows just stay there.
Environment
Observed on:
v0.22.0
- pruned nodes
--storage.blockchain-history=1024
--storage.state-tries=0
- feeder-gateway sync path
What we found
On one node:
select count(*) from class_definitions where definition is null;
-- 8100
select min(block_number), max(block_number)
from class_definitions
where definition is null;
-- 146, 261858
On the other node:
select count(*) from class_definitions where definition is null;
-- 8101
So this is not just a tip/pruning-edge problem. The broken rows were spread across a large historical range.
Why this does not look like normal pruning
From the code, pruning appears to preserve the class blob and only detach the historical block number.
Relevant path:
crates/storage/src/connection/block.rs
That means a healthy pruned row should look like:
definition IS NOT NULL
block_number IS NULL
Our broken rows looked like:
definition IS NULL
block_number IS NOT NULL
So pruning itself does not seem to explain the missing blobs.
Likely failure mode
From reading the code, the sequence looks like this:
- A state update inserts a placeholder row into
class_definitions with hash and block_number, but no definition blob yet.
- Later, class download/persist should fill in
definition.
- If that second step does not happen for any reason, the placeholder row remains in the database.
- After that, normal feeder-gateway sync no longer retries that class, because it appears to only check whether a row exists, not whether the
definition blob is populated.
Relevant paths:
crates/storage/src/connection/state_update.rs
crates/pathfinder/src/state/sync/l2.rs (download_new_classes)
crates/storage/src/connection/class.rs (class_definitions_exist)
The check in the FGW sync path is effectively:
SELECT 1 FROM class_definitions WHERE hash = ?
So a placeholder row with definition IS NULL is treated the same as a fully populated class definition row.
There does appear to be logic in the checkpoint sync path to look for missing class definitions, but the normal FGW/state sync path does not seem to repair already-broken definition IS NULL rows.
Workaround we tested
We hotfixed pathfinder locally to do two things:
- treat
definition IS NULL as missing in FGW sync
- add a repair pass that scans
class_definitions WHERE definition IS NULL and backfills them from feeder-gateway
That immediately started draining the broken rows and the nodes recovered.
One extra wrinkle we hit: for old Cairo 0 classes, the JSON returned by feeder-gateway can re-hash differently today, so a repair path cannot rely on the recomputed hash to choose which row to update. It has to write the fetched definition back into the originally requested DB row.
What seems worth fixing
- In FGW sync, treat
definition IS NULL the same as "missing"
- Add an automatic repair path for already-broken rows
- Ideally run that repair in the background so startup is not blocked
- Add a metric/log for the number of missing class definitions, because this is otherwise very hard to see until execution starts failing
Why I'm opening this
This seems like a real consistency bug for pruned nodes, and it looks permanent once the DB reaches that state. We worked around it locally, but it would be good to have a proper fix upstream.
Happy to provide more details if useful.
WARNING:
LLM-generated issue below. This is, however, a real issue that I observed while upgrading my nodes, and the LLM was able to fix it, so the issue seems legit.
Summary
Two pruned nodes ended up with thousands of rows in
class_definitionswheredefinition IS NULL.Once those rows exist, the feeder-gateway sync path appears to treat the classes as already present and never retries fetching them, so the database stays permanently inconsistent unless repaired manually.
Impact
On affected nodes, execution starts failing for some contracts with errors like:
We also saw
starknet_getClassfail for affected class hashes.Restarting the node does not fix it. Upgrading does not fix it. The rows just stay there.
Environment
Observed on:
v0.22.0--storage.blockchain-history=1024--storage.state-tries=0What we found
On one node:
On the other node:
So this is not just a tip/pruning-edge problem. The broken rows were spread across a large historical range.
Why this does not look like normal pruning
From the code, pruning appears to preserve the class blob and only detach the historical block number.
Relevant path:
crates/storage/src/connection/block.rsThat means a healthy pruned row should look like:
definition IS NOT NULLblock_number IS NULLOur broken rows looked like:
definition IS NULLblock_number IS NOT NULLSo pruning itself does not seem to explain the missing blobs.
Likely failure mode
From reading the code, the sequence looks like this:
class_definitionswithhashandblock_number, but nodefinitionblob yet.definition.definitionblob is populated.Relevant paths:
crates/storage/src/connection/state_update.rscrates/pathfinder/src/state/sync/l2.rs(download_new_classes)crates/storage/src/connection/class.rs(class_definitions_exist)The check in the FGW sync path is effectively:
So a placeholder row with
definition IS NULLis treated the same as a fully populated class definition row.There does appear to be logic in the checkpoint sync path to look for missing class definitions, but the normal FGW/state sync path does not seem to repair already-broken
definition IS NULLrows.Workaround we tested
We hotfixed pathfinder locally to do two things:
definition IS NULLas missing in FGW syncclass_definitions WHERE definition IS NULLand backfills them from feeder-gatewayThat immediately started draining the broken rows and the nodes recovered.
One extra wrinkle we hit: for old Cairo 0 classes, the JSON returned by feeder-gateway can re-hash differently today, so a repair path cannot rely on the recomputed hash to choose which row to update. It has to write the fetched definition back into the originally requested DB row.
What seems worth fixing
definition IS NULLthe same as "missing"Why I'm opening this
This seems like a real consistency bug for pruned nodes, and it looks permanent once the DB reaches that state. We worked around it locally, but it would be good to have a proper fix upstream.
Happy to provide more details if useful.