-
Notifications
You must be signed in to change notification settings - Fork 75
Split export_geth_traces into 24 tasks
#105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
fb1f643 to
333fd7e
Compare
…t/split-task-export-geth-traces
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm but I have a couple of questions.
- Would we need to scale prod composer to avoid tasks from being killed randomly?
- I notice that we sometime hit the rate limit for traces (for example here). It's not a blocker since it seems to work after retrying but I remember we had an issue in evmchain-etl where where the code did not handle retries correctly and we had duplicate blocks. Should we reduce the number of active tasks?
@gulshngill Thx for review.
I take your point, though -- the main purpose here is reliability, not performance... although the performance increase is pretty huge... so we can definitely afford to slow it down a bit if needs be. If you're okay with it, I suggest we leave as is, then revisit this if necessary. |
|
Yup it's something we can definitely adjust later on too 👍 |
Increase reliability for
polygon_export_dagTask
export_geth_tracestakes up to 8 hours. If it fails just before it completes, data is very late. This PR helps by splitting the task in 24 tasks, each for 1 hour of block time. For simplicity, I also split up the downstream tasks (extract_contracts&extract_tokens) into 24 tasks each as well. In the load_dag, rather than 24 wait operators for each of these 3 entities, I consolidated all the wait operators into one.Greater efficiency could be achieved by calling
get_block_range()just once, and chunking the block range into 24 chunks (equal in terms ofblock_number, rather thanblock_timestamp). But in this case, I've chosen a naive approach, making this call 24 times, since it's very fast already, and simpler.Testing
tokensload_all_partitions=True(to check that nothing is broken there)