[DO NOT MERGE] Make janitor transaction submission non-blocking#1120
[DO NOT MERGE] Make janitor transaction submission non-blocking#1120
Conversation
Changed clear_old_round_data() to use fire-and-forget transaction submission instead of waiting for finalization. This addresses a correlation between janitor execution and listener task stalls. Root cause analysis: - Each task (listener, miner, janitor) gets a cloned Client instance - However, these clones share the same underlying backend through Arc - Client contains ChainClient (OnlineClient<PolkadotConfig>) - OnlineClient stores backend: Arc<dyn Backend<T>> - Client::new() creates the backend wrapped in Arc::new(backend) - When Client is cloned, it clones the Arc reference, not the backend This architecture means all tasks share: - The same ReconnectingRpcClient - The same ChainHeadBackend - The same WebSocket connection The synchronous wait for transaction finalization was blocking shared resources at the connection/backend level, preventing the listener from receiving new block notifications. Benefits: - Removes resource contention between janitor and listener tasks - Makes janitor task fully non-blocking - Better overall system responsiveness - Appropriate for cleanup operations that don't require confirmation This change works in conjunction with the timeout-based listener implementation to prevent subscription stalls.
This is all correct
"synchronous" is perhaps the wrong word to use here; everything is asynchronous and non-blocking :)
The As a result of this, if that underlying chainHead subscription stalls for some reason, then you will also no longer get the I would definitely be very interested if you find that there is something interrupting/stalling the underlying subscription on the
Resources are still shared between tasks. If you want to avoid this, the only thing you can do is construct a new
Not really true since
I agree that if you don't care about guaranteeing that the cleanup thing makes it into a block, then
Nothing is blocking in the sense that this is all async code :) My expectation is that you won't see any difference w.r.t the stalling of tasks with this PR, since it doesn't fundamentally change anything. That said, if it does make a difference it will be very interesting! If you get to the point where you can see the stalling issue happen reasonably reliably then we can def see if this sort of change has any impact on it or not! My guess at the moment is that it's more likely that the janitor task would timeout as an effect of the stalling rather than the stalling being an effect of the janitor task. That all said, I think we just need to play around a bit. One way to check on the stalling might just be to have a small runner on the same machine connecting to the same node(s) and all it does is subscribe to finalized blocks and nothing else. I suspect we would see this also stall at the same times as the worker, but would be an interesting test to do :) Otherwise, we may simply need more logging (or to enable more logging) to slowly hunt down the source of this issue! |
|
Thanks @jsdw !!!! Great comment 🙇
I think it makes sense then to park this PR. And get deployed on WAH a version with #1119 which |
But shouldn't the Note also that the janitor task is only triggered once we detect a round increment (for which we need the listener task not to be stalled...). �[2m2025-07-30T23:05:27.335644Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Block #12389927, Phase Export(2) - nothing to do
�[2m2025-07-30T23:05:35.334052Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Block #12389928, Phase Export(1) - nothing to do
�[2m2025-07-30T23:05:39.355584Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Block #12389929, Phase Export(0) - nothing to do
�[2m2025-07-30T23:05:47.351553Z�[0m �[34mDEBUG�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Detected round increment 226 -> 227
�[2m2025-07-30T23:05:47.351572Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Sent janitor tick for round 227
�[2m2025-07-30T23:05:47.351576Z�[0m �[34mDEBUG�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Round increment in Off phase, signaling snapshot cleanup
�[2m2025-07-30T23:05:47.351599Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Running janitor cleanup for round 227
�[2m2025-07-30T23:05:47.351653Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Scanning round 226 for old submissions
�[2m2025-07-30T23:05:47.411724Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Janitor found no old submissions to clean up
�[2m2025-07-30T23:49:00.748383Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T00:06:52.969237Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T00:49:12.974870Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T01:20:47.192416Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T01:20:52.438180Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T01:21:00.574383Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T02:03:38.221614Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T02:46:12.979884Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T03:27:11.507256Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T03:39:37.079436Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T04:03:55.030211Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T04:45:07.343189Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T05:27:42.786072Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T05:29:12.829266Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T06:11:12.705342Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T06:52:36.782570Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T07:35:30.571017Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T08:18:54.667867Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersionLogs also showing the updater running fine after listener is stalled: �[2m2025-07-30T23:05:27.335644Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Block #12389927, Phase Export(2) - nothing to do
�[2m2025-07-30T23:05:35.334052Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Block #12389928, Phase Export(1) - nothing to do
�[2m2025-07-30T23:05:39.355584Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Block #12389929, Phase Export(0) - nothing to do
�[2m2025-07-30T23:05:47.351553Z�[0m �[34mDEBUG�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Detected round increment 226 -> 227
�[2m2025-07-30T23:05:47.351572Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Sent janitor tick for round 227
�[2m2025-07-30T23:05:47.351576Z�[0m �[34mDEBUG�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Round increment in Off phase, signaling snapshot cleanup
�[2m2025-07-30T23:05:47.351599Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Running janitor cleanup for round 227
�[2m2025-07-30T23:05:47.351653Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Scanning round 226 for old submissions
�[2m2025-07-30T23:05:47.411724Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Janitor found no old submissions to clean up
�[2m2025-07-30T23:49:00.748383Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T00:06:52.969237Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T00:49:12.974870Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T01:20:47.192416Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T01:20:52.438180Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T01:21:00.574383Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T02:03:38.221614Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T02:46:12.979884Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T03:27:11.507256Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T03:39:37.079436Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T04:03:55.030211Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T04:45:07.343189Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T05:27:42.786072Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T05:29:12.829266Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T06:11:12.705342Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T06:52:36.782570Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T07:35:30.571017Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T08:18:54.667867Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion |
|
A subscriber-only miner to be ideally deployed and tested on the same runner where we run the official miner on WAH to check if it also has the stalling issue is here: #1121 |
Changed
clear_old_round_data()to use fire-and-forget transaction submission instead of waiting for finalization. This addresses a correlation between janitor execution and listener task stalls.Root cause analysis:
This architecture means all tasks share:
The synchronous wait for transaction finalization was blocking shared resources at the connection/backend level, potentially preventing the listener from receiving new block notifications (this has to be proved though!).
Even forgetting the stalling issue, the fire-and-forget approach is beneficial in any case for the janitor task so this PR still offers the following benefits:
This change works in conjunction with the timeout-based listener implementation introduced by #1119 to reduce / prevent subscription stalls.
NOTE: while the updater task shares the same client/backend with the listener and the janitor, it doesn't appear to have the same blocking behavior that was potentially causing issues with the janitor. The key difference is: