[DO NOT MERGE] Make janitor transaction submission non-blocking by sigurpol · Pull Request #1120 · paritytech/polkadot-staking-miner

sigurpol · 2025-08-05T08:46:24Z

Changed clear_old_round_data() to use fire-and-forget transaction submission instead of waiting for finalization. This addresses a correlation between janitor execution and listener task stalls.

Root cause analysis:

Each task (listener, miner, janitor) gets a cloned Client instance
However, these clones share the same underlying backend through Arc
Client contains ChainClient (OnlineClient)
OnlineClient stores backend: Arc<dyn Backend>
Client::new() creates the backend wrapped in Arc::new(backend)
When Client is cloned, it clones the Arc reference, not the backend

This architecture means all tasks share:

The same ReconnectingRpcClient
The same ChainHeadBackend
The same WebSocket connection

The synchronous wait for transaction finalization was blocking shared resources at the connection/backend level, potentially preventing the listener from receiving new block notifications (this has to be proved though!).

Even forgetting the stalling issue, the fire-and-forget approach is beneficial in any case for the janitor task so this PR still offers the following benefits:

Removes resource contention between janitor and listener tasks
Makes janitor task fully non-blocking
Better overall system responsiveness
Appropriate for cleanup operations that don't require confirmation

This change works in conjunction with the timeout-based listener implementation introduced by #1119 to reduce / prevent subscription stalls.

NOTE: while the updater task shares the same client/backend with the listener and the janitor, it doesn't appear to have the same blocking behavior that was potentially causing issues with the janitor. The key difference is:

Janitor: Was submitting transactions and waiting for finalization (blocking operation)
Updater: Only performs read operations and local updates (non-blocking)

Changed clear_old_round_data() to use fire-and-forget transaction submission instead of waiting for finalization. This addresses a correlation between janitor execution and listener task stalls. Root cause analysis: - Each task (listener, miner, janitor) gets a cloned Client instance - However, these clones share the same underlying backend through Arc - Client contains ChainClient (OnlineClient<PolkadotConfig>) - OnlineClient stores backend: Arc<dyn Backend<T>> - Client::new() creates the backend wrapped in Arc::new(backend) - When Client is cloned, it clones the Arc reference, not the backend This architecture means all tasks share: - The same ReconnectingRpcClient - The same ChainHeadBackend - The same WebSocket connection The synchronous wait for transaction finalization was blocking shared resources at the connection/backend level, preventing the listener from receiving new block notifications. Benefits: - Removes resource contention between janitor and listener tasks - Makes janitor task fully non-blocking - Better overall system responsiveness - Appropriate for cleanup operations that don't require confirmation This change works in conjunction with the timeout-based listener implementation to prevent subscription stalls.

jsdw · 2025-08-05T09:31:09Z

Changed clear_old_round_data() to use fire-and-forget transaction submission instead of waiting for finalization. This addresses a correlation between janitor execution and listener task stalls.

Root cause analysis:
* Each task (listener, miner, janitor) gets a cloned Client instance

* However, these clones share the same underlying backend through Arc

* Client contains ChainClient (OnlineClient)

* OnlineClient stores backend: Arc<dyn Backend>

* Client::new() creates the backend wrapped in Arc::new(backend)

* When Client is cloned, it clones the Arc reference, not the backend
This architecture means all tasks share:
* The same ReconnectingRpcClient

* The same ChainHeadBackend

* The same WebSocket connection

This is all correct

The synchronous wait for transaction finalization was blocking shared resources at the connection/backend level, potentially preventing the listener from receiving new block notifications (this has to be proved though!).

"synchronous" is perhaps the wrong word to use here; everything is asynchronous and non-blocking :)

submit is similar to submit_and_watch; both submit the transaction and receive events back. submit just "returns" once the events indicate the tx was validated (or more) or else errored, whereas submit_and_watch allows you to continue watching the events.

The Finalized event is a little special in that we do hold off on giving this event back until we have seen the same block announced via the underlying block subscription (why: because once we have seen the block announced via the subscription, it is "pinned" and we have a guarantee that we can access details about it etc. Before this happens, we may get a block back which we then can't query for events or whatever).

As a result of this, if that underlying chainHead subscription stalls for some reason, then you will also no longer get the Finalized event for a submitted transaction by default. You can actually override this behaviour by setting ChainHeadBackendBuilder::submit_transactions_ignoring_follow_events, when you instantiate the backend, but this is generally not recommended as it can lead to errors owing to losing the guarantee that we can access the block hashes we see in Finalized events.

I would definitely be very interested if you find that there is something interrupting/stalling the underlying subscription on the ChainHeadBackend; offhand I can't see what this would be though, as this stream is driven by a spawned task and does not wait or block on any other task.

Even forgetting the stalling issue, the fire-and-forget approach is beneficial in any case for the janitor task so this PR still offers the following benefits:
* Removes resource contention between janitor and listener tasks

Resources are still shared between tasks. If you want to avoid this, the only thing you can do is construct a new OnlineClient for each task, to avoid any underlying resources being shared at all.

* Makes janitor task fully non-blocking

Not really true since submit and submit_and_watch both basically do the same thing; the latter just waits for more events :)

* Better overall system responsiveness

* Appropriate for cleanup operations that don't require confirmation

I agree that if you don't care about guaranteeing that the cleanup thing makes it into a block, then submit is better than submit_and_watch. If you care about this then prefer the latter.

NOTE: while the updater task shares the same client/backend with the listener and the janitor, it doesn't appear to have the same blocking behavior that was potentially causing issues with the janitor. The key difference is:
* Janitor: Was submitting transactions and waiting for finalization (blocking operation)

* Updater: Only performs read operations and local updates (non-blocking)

Nothing is blocking in the sense that this is all async code :)

My expectation is that you won't see any difference w.r.t the stalling of tasks with this PR, since it doesn't fundamentally change anything. That said, if it does make a difference it will be very interesting!

If you get to the point where you can see the stalling issue happen reasonably reliably then we can def see if this sort of change has any impact on it or not! My guess at the moment is that it's more likely that the janitor task would timeout as an effect of the stalling rather than the stalling being an effect of the janitor task.

That all said, I think we just need to play around a bit. One way to check on the stalling might just be to have a small runner on the same machine connecting to the same node(s) and all it does is subscribe to finalized blocks and nothing else. I suspect we would see this also stall at the same times as the worker, but would be an interesting test to do :)

Otherwise, we may simply need more logging (or to enable more logging) to slowly hunt down the source of this issue!

sigurpol · 2025-08-05T09:46:02Z

Thanks @jsdw !!!! Great comment 🙇

If you get to the point where you can see the stalling issue happen reasonably reliably then we can def see if this sort of change has any impact on it or not! My guess at the moment is that it's more likely that the janitor task would timeout as an effect of the stalling rather than the stalling being an effect of the janitor task.

I think it makes sense then to park this PR. And get deployed on WAH a version with #1119 which might make a difference 🤞 . Test few days with that, see if the stalling issue happens again. If it does, add more verbose logging probably, maybe add a client per task etc. I'll try to see if I can deploy on the same runner a dummy subscriber-only runner, need to check with devops guys.

sigurpol · 2025-08-05T09:54:14Z

My guess at the moment is that it's more likely that the janitor task would timeout as an effect of the stalling rather than the stalling being an effect of the janitor task.

But shouldn't the updater task stall too in this case? Which is not what happens though, from the logs I can see the updater task to be up and running normally, just the listener stalled.
If the janitor completed successfully, why does the stalling happen right after janitor execution?

Note also that the janitor task is only triggered once we detect a round increment (for which we need the listener task not to be stalled...).
And once we saw the listener stalled, we saw that the janitor has actually completed its task:

�[2m2025-07-30T23:05:27.335644Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Block #12389927, Phase Export(2) - nothing to do
�[2m2025-07-30T23:05:35.334052Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Block #12389928, Phase Export(1) - nothing to do
�[2m2025-07-30T23:05:39.355584Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Block #12389929, Phase Export(0) - nothing to do
�[2m2025-07-30T23:05:47.351553Z�[0m �[34mDEBUG�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Detected round increment 226 -> 227
�[2m2025-07-30T23:05:47.351572Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Sent janitor tick for round 227
�[2m2025-07-30T23:05:47.351576Z�[0m �[34mDEBUG�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Round increment in Off phase, signaling snapshot cleanup
�[2m2025-07-30T23:05:47.351599Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Running janitor cleanup for round 227
�[2m2025-07-30T23:05:47.351653Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Scanning round 226 for old submissions
�[2m2025-07-30T23:05:47.411724Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Janitor found no old submissions to clean up
�[2m2025-07-30T23:49:00.748383Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T00:06:52.969237Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T00:49:12.974870Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T01:20:47.192416Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T01:20:52.438180Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T01:21:00.574383Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T02:03:38.221614Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T02:46:12.979884Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T03:27:11.507256Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T03:39:37.079436Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T04:03:55.030211Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T04:45:07.343189Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T05:27:42.786072Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T05:29:12.829266Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T06:11:12.705342Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T06:52:36.782570Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T07:35:30.571017Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T08:18:54.667867Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion

Logs also showing the updater running fine after listener is stalled:

�[2m2025-07-30T23:05:27.335644Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Block #12389927, Phase Export(2) - nothing to do
�[2m2025-07-30T23:05:35.334052Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Block #12389928, Phase Export(1) - nothing to do
�[2m2025-07-30T23:05:39.355584Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Block #12389929, Phase Export(0) - nothing to do
�[2m2025-07-30T23:05:47.351553Z�[0m �[34mDEBUG�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Detected round increment 226 -> 227
�[2m2025-07-30T23:05:47.351572Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Sent janitor tick for round 227
�[2m2025-07-30T23:05:47.351576Z�[0m �[34mDEBUG�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Round increment in Off phase, signaling snapshot cleanup
�[2m2025-07-30T23:05:47.351599Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Running janitor cleanup for round 227
�[2m2025-07-30T23:05:47.351653Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Scanning round 226 for old submissions
�[2m2025-07-30T23:05:47.411724Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Janitor found no old submissions to clean up
�[2m2025-07-30T23:49:00.748383Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T00:06:52.969237Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T00:49:12.974870Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T01:20:47.192416Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T01:20:52.438180Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T01:21:00.574383Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T02:03:38.221614Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T02:46:12.979884Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T03:27:11.507256Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T03:39:37.079436Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T04:03:55.030211Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T04:45:07.343189Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T05:27:42.786072Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T05:29:12.829266Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T06:11:12.705342Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T06:52:36.782570Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T07:35:30.571017Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T08:18:54.667867Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion

sigurpol · 2025-08-05T13:12:43Z

A subscriber-only miner to be ideally deployed and tested on the same runner where we run the official miner on WAH to check if it also has the stalling issue is here: #1121

sigurpol requested review from Ank4n, jsdw and kianenigma August 5, 2025 08:46

Clippy

fd7da84

sigurpol changed the title ~~Make janitor transaction submission non-blocking~~ [DO NOT MERGE] Make janitor transaction submission non-blocking Aug 5, 2025

sigurpol marked this pull request as draft August 5, 2025 09:46

sigurpol closed this Sep 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DO NOT MERGE] Make janitor transaction submission non-blocking#1120

[DO NOT MERGE] Make janitor transaction submission non-blocking#1120
sigurpol wants to merge 2 commits intomainfrom
stalling-subscription-janitor-non-blocking

sigurpol commented Aug 5, 2025 •

edited

Loading

Uh oh!

jsdw commented Aug 5, 2025 •

edited

Loading

Uh oh!

sigurpol commented Aug 5, 2025

Uh oh!

sigurpol commented Aug 5, 2025 •

edited

Loading

Uh oh!

sigurpol commented Aug 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sigurpol commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jsdw commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sigurpol commented Aug 5, 2025

Uh oh!

sigurpol commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sigurpol commented Aug 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sigurpol commented Aug 5, 2025 •

edited

Loading

jsdw commented Aug 5, 2025 •

edited

Loading

sigurpol commented Aug 5, 2025 •

edited

Loading