-
Notifications
You must be signed in to change notification settings - Fork 5
feat: auto-recover from pruned node errors during extraction #80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: auto-recover from pruned node errors during extraction #80
Conversation
When connecting to a pruned CosmosSDK node, the extractor now automatically detects "lowest height is X" errors and restarts extraction from the lowest available height instead of failing. Key changes: - Add ErrHeightNotAvailable error type to signal pruned block detection - Add prunedNodeSignal for thread-safe signaling without errgroup race - Parse lowest available height from gRPC error messages - Retry loop in extractBlocksAndTransactions auto-restarts from new height - Unit tests for regex parsing, error type, and concurrent signaling The implementation uses a separate signaling mechanism instead of returning errors from goroutines to avoid errgroup's context cancellation race condition where multiple workers could trigger "Processing cancelled by user" before the restart logic could capture the lowest available height.
fmorency
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Cordtus!
As discussed on TG, you could do it in multiple stages.
- Try extracting at block 1. If it works, start the full extraction routine from block 1.
- If it fails, detect the block from the error message (N)
- Try extracting at block N. If it works, start the full extraction routine from block N
- If it fails, detect the block from the error message again (N'). If N' == N, exit with error.
- Try extracting at block N'. If it works, start the full extraction routine from block N'. N = N'
Repeat 3-5. You could even add some heuristic if it fails too many time, like N' = N + 100.
E.g.
> Can I extract from block 1?
No, the lowest is block 12345
> Can I extract from block 12345?
No, the lowest is block 15555 (we hit another node, that happens)
> Can I extract from block 15555?
No, the lowest is block 23456 (we hit yet another node, that happens)
> Can I extract from block 23456?
No, the lowest is block 23666 (we hit yet another node, that happens)
> Aight, can I extract from block 23766 (heuristic N + 100)
Yes, that works, let's go!
Phase 1 is "detect what the minimum block is". It can fail even if you get the minimum block from the error message, because we're never sure which node we'll hit, and they could all be at a different height, thus repeat max M times and/or use a heuristic if we can't determine the minimum block after 3 tries.
You wouldn't need the new signal at all and should be able to re-use existing code without (much) modifications.
Phase 2 is the full extraction loop.
You could reuse the approach above once Cosmos merges cosmos/cosmos-sdk#25647 and cosmos/cosmos-sdk#25648. Instead of parsing the error message to find the minimum block, you'd poke the gRPC endpoint. Same retry logic.
WDYT?
Replace reactive recovery during extraction with a single probe at startup. The extractor now always verifies the start height is available before spawning workers, automatically adjusting if the node is pruned. Changes: - Remove prunedNodeSignal struct and retry loop from block.go - Add GetEarliestBlockHeight() to utils/block.go - Always probe earliest available height in setBlockRange() - Remove block_test.go (tested removed code) This approach is simpler (~215 fewer lines), handles all scenarios (fresh start, resume, node re-synced higher), and avoids thread synchronization complexity.
5babe96 to
d922277
Compare
|
Reworked this one, and updated the description accordingly. |
fmorency
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Much better! A single minor comment.
nit: Add more tests but we can do that in another PR. Would love e2e tests to cover some edge-cases like the ones I described in my other review.
| // It probes block 1 to check if the node is an archive node or pruned. | ||
| // For archive nodes, returns 1. For pruned nodes, parses the error message | ||
| // to extract the lowest available height. | ||
| func GetEarliestBlockHeight(gRPCClient *client.GRPCClient, maxRetries uint) (uint64, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this function a little more robust like suggested in the first review?
I saw cases there the lowest height from error wasn't working because the query hit another node and it didn't have that height. I.e., the other node has a lowest height higher than previously reported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I didn't consider the case of load balancers using nodes with varying heights.. is this common?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I didn't consider the case of load balancers using nodes with varying heights.. is this common?
I encountered this issue multiple times while building this project, primarily with Osmosis and the Hub. I'm not sure if it's common, but I believe it's common enough to address. I'm surprised you didn't encounter this issue during your tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I typically use my own nodes for dev/testing (mainly because of issues like this with public ones).
I will have something for this early tomorrow.
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
3fa0312 to
c5502c4
Compare
Adds fallback to startup probe: if extraction hits a higher pruning boundary than initially detected, adjust start height and retry. Handles load-balanced endpoints where backend nodes vary.
c5502c4 to
2b1bea4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements automatic recovery from pruned node errors during blockchain data extraction by probing for the earliest available block at startup and adjusting extraction parameters accordingly.
Changes:
- Added utility functions to detect earliest available block height by probing block 1 and parsing pruning error messages
- Modified extraction startup logic to query earliest available block when starting with an empty database
- Implemented retry loop in batch extraction to handle pruned node errors during extraction
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/utils/block.go | Added GetEarliestBlockHeight and ParseLowestHeightFromError functions to detect and parse pruning boundaries |
| internal/utils/block_test.go | Added unit tests for ParseLowestHeightFromError error message parsing |
| internal/extractor/extractor.go | Updated setBlockRange to probe earliest block for empty databases; added retry loop for pruned node recovery during batch extraction |
| internal/extractor/block.go | Refactored error handling for context cancellation; standardized import ordering |
| go.mod | Updated Go version to 1.25.5 |
| README.md | Updated Go version requirement to 1.25.5 |
| .github/workflows/release.yml | Updated GO_VERSION to 1.25.5 |
| .github/workflows/ci.yml | Updated GO_VERSION to 1.25.5 |
| internal/utils/grpc.go | Standardized import ordering |
| internal/metrics/server.go | Standardized import ordering |
| internal/metrics/server_test.go | Standardized import ordering |
| internal/client/client.go | Standardized import ordering |
| cmd/yaci/postgres.go | Standardized import ordering |
| cmd/yaci/extract.go | Standardized import ordering |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Co-authored-by: Copilot <[email protected]>
…nto feat/pruned-node-recovery
Summary
Simplifies pruned node handling by probing at startup instead of reactive recovery during extraction.
Changes
Details
Since there is no current gRPC method that can access the required value directly until cosmos/cosmos-sdk #25647 is merged.
In the meantime, we can fetch that same value by forcing a particular error as a silly workaround:
latest indexed + 1(no probe)1, which if not exists will return the earliest available heightTest
Unit test for error message parsing in
internal/utils/block_test.go.Tested manually using a pruned Juno node - behaves as expected.