Skip to content

[DO NOT MERGE] Make janitor transaction submission non-blocking#1120

Closed
sigurpol wants to merge 2 commits intomainfrom
stalling-subscription-janitor-non-blocking
Closed

[DO NOT MERGE] Make janitor transaction submission non-blocking#1120
sigurpol wants to merge 2 commits intomainfrom
stalling-subscription-janitor-non-blocking

Conversation

@sigurpol
Copy link
Copy Markdown
Contributor

@sigurpol sigurpol commented Aug 5, 2025

Changed clear_old_round_data() to use fire-and-forget transaction submission instead of waiting for finalization. This addresses a correlation between janitor execution and listener task stalls.

Root cause analysis:

  • Each task (listener, miner, janitor) gets a cloned Client instance
  • However, these clones share the same underlying backend through Arc
  • Client contains ChainClient (OnlineClient)
  • OnlineClient stores backend: Arc<dyn Backend>
  • Client::new() creates the backend wrapped in Arc::new(backend)
  • When Client is cloned, it clones the Arc reference, not the backend

This architecture means all tasks share:

  • The same ReconnectingRpcClient
  • The same ChainHeadBackend
  • The same WebSocket connection

The synchronous wait for transaction finalization was blocking shared resources at the connection/backend level, potentially preventing the listener from receiving new block notifications (this has to be proved though!).

Even forgetting the stalling issue, the fire-and-forget approach is beneficial in any case for the janitor task so this PR still offers the following benefits:

  • Removes resource contention between janitor and listener tasks
  • Makes janitor task fully non-blocking
  • Better overall system responsiveness
  • Appropriate for cleanup operations that don't require confirmation

This change works in conjunction with the timeout-based listener implementation introduced by #1119 to reduce / prevent subscription stalls.

NOTE: while the updater task shares the same client/backend with the listener and the janitor, it doesn't appear to have the same blocking behavior that was potentially causing issues with the janitor. The key difference is:

  • Janitor: Was submitting transactions and waiting for finalization (blocking operation)
  • Updater: Only performs read operations and local updates (non-blocking)

Changed clear_old_round_data() to use fire-and-forget transaction submission
instead of waiting for finalization. This addresses a correlation between
janitor execution and listener task stalls.

Root cause analysis:
- Each task (listener, miner, janitor) gets a cloned Client instance
- However, these clones share the same underlying backend through Arc
- Client contains ChainClient (OnlineClient<PolkadotConfig>)
- OnlineClient stores backend: Arc<dyn Backend<T>>
- Client::new() creates the backend wrapped in Arc::new(backend)
- When Client is cloned, it clones the Arc reference, not the backend

This architecture means all tasks share:
- The same ReconnectingRpcClient
- The same ChainHeadBackend
- The same WebSocket connection

The synchronous wait for transaction finalization was blocking shared
resources at the connection/backend level, preventing the listener
from receiving new block notifications.

Benefits:
- Removes resource contention between janitor and listener tasks
- Makes janitor task fully non-blocking
- Better overall system responsiveness
- Appropriate for cleanup operations that don't require confirmation

This change works in conjunction with the timeout-based listener
implementation to prevent subscription stalls.
@sigurpol sigurpol requested review from Ank4n, jsdw and kianenigma August 5, 2025 08:46
@jsdw
Copy link
Copy Markdown
Contributor

jsdw commented Aug 5, 2025

Changed clear_old_round_data() to use fire-and-forget transaction submission instead of waiting for finalization. This addresses a correlation between janitor execution and listener task stalls.

Root cause analysis:

* Each task (listener, miner, janitor) gets a cloned Client instance

* However, these clones share the same underlying backend through Arc

* Client contains ChainClient (OnlineClient)

* OnlineClient stores backend: Arc<dyn Backend>

* Client::new() creates the backend wrapped in Arc::new(backend)

* When Client is cloned, it clones the Arc reference, not the backend

This architecture means all tasks share:

* The same ReconnectingRpcClient

* The same ChainHeadBackend

* The same WebSocket connection

This is all correct

The synchronous wait for transaction finalization was blocking shared resources at the connection/backend level, potentially preventing the listener from receiving new block notifications (this has to be proved though!).

"synchronous" is perhaps the wrong word to use here; everything is asynchronous and non-blocking :)

submit is similar to submit_and_watch; both submit the transaction and receive events back. submit just "returns" once the events indicate the tx was validated (or more) or else errored, whereas submit_and_watch allows you to continue watching the events.

The Finalized event is a little special in that we do hold off on giving this event back until we have seen the same block announced via the underlying block subscription (why: because once we have seen the block announced via the subscription, it is "pinned" and we have a guarantee that we can access details about it etc. Before this happens, we may get a block back which we then can't query for events or whatever).

As a result of this, if that underlying chainHead subscription stalls for some reason, then you will also no longer get the Finalized event for a submitted transaction by default. You can actually override this behaviour by setting ChainHeadBackendBuilder::submit_transactions_ignoring_follow_events, when you instantiate the backend, but this is generally not recommended as it can lead to errors owing to losing the guarantee that we can access the block hashes we see in Finalized events.

I would definitely be very interested if you find that there is something interrupting/stalling the underlying subscription on the ChainHeadBackend; offhand I can't see what this would be though, as this stream is driven by a spawned task and does not wait or block on any other task.

Even forgetting the stalling issue, the fire-and-forget approach is beneficial in any case for the janitor task so this PR still offers the following benefits:

* Removes resource contention between janitor and listener tasks

Resources are still shared between tasks. If you want to avoid this, the only thing you can do is construct a new OnlineClient for each task, to avoid any underlying resources being shared at all.

* Makes janitor task fully non-blocking

Not really true since submit and submit_and_watch both basically do the same thing; the latter just waits for more events :)

* Better overall system responsiveness

* Appropriate for cleanup operations that don't require confirmation

I agree that if you don't care about guaranteeing that the cleanup thing makes it into a block, then submit is better than submit_and_watch. If you care about this then prefer the latter.

NOTE: while the updater task shares the same client/backend with the listener and the janitor, it doesn't appear to have the same blocking behavior that was potentially causing issues with the janitor. The key difference is:

* Janitor: Was submitting transactions and waiting for finalization (blocking operation)

* Updater: Only performs read operations and local updates (non-blocking)

Nothing is blocking in the sense that this is all async code :)

My expectation is that you won't see any difference w.r.t the stalling of tasks with this PR, since it doesn't fundamentally change anything. That said, if it does make a difference it will be very interesting!

If you get to the point where you can see the stalling issue happen reasonably reliably then we can def see if this sort of change has any impact on it or not! My guess at the moment is that it's more likely that the janitor task would timeout as an effect of the stalling rather than the stalling being an effect of the janitor task.

That all said, I think we just need to play around a bit. One way to check on the stalling might just be to have a small runner on the same machine connecting to the same node(s) and all it does is subscribe to finalized blocks and nothing else. I suspect we would see this also stall at the same times as the worker, but would be an interesting test to do :)

Otherwise, we may simply need more logging (or to enable more logging) to slowly hunt down the source of this issue!

@sigurpol
Copy link
Copy Markdown
Contributor Author

sigurpol commented Aug 5, 2025

Thanks @jsdw !!!! Great comment 🙇

If you get to the point where you can see the stalling issue happen reasonably reliably then we can def see if this sort of change has any impact on it or not! My guess at the moment is that it's more likely that the janitor task would timeout as an effect of the stalling rather than the stalling being an effect of the janitor task.

I think it makes sense then to park this PR. And get deployed on WAH a version with #1119 which might make a difference 🤞 . Test few days with that, see if the stalling issue happens again. If it does, add more verbose logging probably, maybe add a client per task etc. I'll try to see if I can deploy on the same runner a dummy subscriber-only runner, need to check with devops guys.

@sigurpol sigurpol changed the title Make janitor transaction submission non-blocking [DO NOT MERGE] Make janitor transaction submission non-blocking Aug 5, 2025
@sigurpol sigurpol marked this pull request as draft August 5, 2025 09:46
@sigurpol
Copy link
Copy Markdown
Contributor Author

sigurpol commented Aug 5, 2025

My guess at the moment is that it's more likely that the janitor task would timeout as an effect of the stalling rather than the stalling being an effect of the janitor task.

But shouldn't the updater task stall too in this case? Which is not what happens though, from the logs I can see the updater task to be up and running normally, just the listener stalled.
If the janitor completed successfully, why does the stalling happen right after janitor execution?

Note also that the janitor task is only triggered once we detect a round increment (for which we need the listener task not to be stalled...).
And once we saw the listener stalled, we saw that the janitor has actually completed its task:

�[2m2025-07-30T23:05:27.335644Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Block #12389927, Phase Export(2) - nothing to do
�[2m2025-07-30T23:05:35.334052Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Block #12389928, Phase Export(1) - nothing to do
�[2m2025-07-30T23:05:39.355584Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Block #12389929, Phase Export(0) - nothing to do
�[2m2025-07-30T23:05:47.351553Z�[0m �[34mDEBUG�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Detected round increment 226 -> 227
�[2m2025-07-30T23:05:47.351572Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Sent janitor tick for round 227
�[2m2025-07-30T23:05:47.351576Z�[0m �[34mDEBUG�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Round increment in Off phase, signaling snapshot cleanup
�[2m2025-07-30T23:05:47.351599Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Running janitor cleanup for round 227
�[2m2025-07-30T23:05:47.351653Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Scanning round 226 for old submissions
�[2m2025-07-30T23:05:47.411724Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Janitor found no old submissions to clean up
�[2m2025-07-30T23:49:00.748383Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T00:06:52.969237Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T00:49:12.974870Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T01:20:47.192416Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T01:20:52.438180Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T01:21:00.574383Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T02:03:38.221614Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T02:46:12.979884Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T03:27:11.507256Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T03:39:37.079436Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T04:03:55.030211Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T04:45:07.343189Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T05:27:42.786072Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T05:29:12.829266Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T06:11:12.705342Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T06:52:36.782570Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T07:35:30.571017Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T08:18:54.667867Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion

Logs also showing the updater running fine after listener is stalled:

�[2m2025-07-30T23:05:27.335644Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Block #12389927, Phase Export(2) - nothing to do
�[2m2025-07-30T23:05:35.334052Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Block #12389928, Phase Export(1) - nothing to do
�[2m2025-07-30T23:05:39.355584Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Block #12389929, Phase Export(0) - nothing to do
�[2m2025-07-30T23:05:47.351553Z�[0m �[34mDEBUG�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Detected round increment 226 -> 227
�[2m2025-07-30T23:05:47.351572Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Sent janitor tick for round 227
�[2m2025-07-30T23:05:47.351576Z�[0m �[34mDEBUG�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Round increment in Off phase, signaling snapshot cleanup
�[2m2025-07-30T23:05:47.351599Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Running janitor cleanup for round 227
�[2m2025-07-30T23:05:47.351653Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Scanning round 226 for old submissions
�[2m2025-07-30T23:05:47.411724Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m Janitor found no old submissions to clean up
�[2m2025-07-30T23:49:00.748383Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T00:06:52.969237Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T00:49:12.974870Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T01:20:47.192416Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T01:20:52.438180Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T01:21:00.574383Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T02:03:38.221614Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T02:46:12.979884Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T03:27:11.507256Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T03:39:37.079436Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T04:03:55.030211Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T04:45:07.343189Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T05:27:42.786072Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T05:29:12.829266Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T06:11:12.705342Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T06:52:36.782570Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T07:35:30.571017Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion
�[2m2025-07-31T08:18:54.667867Z�[0m �[35mTRACE�[0m �[2mpolkadot-staking-miner�[0m�[2m:�[0m upgrade to version: 1018013 failed: SameVersion

@sigurpol
Copy link
Copy Markdown
Contributor Author

sigurpol commented Aug 5, 2025

A subscriber-only miner to be ideally deployed and tested on the same runner where we run the official miner on WAH to check if it also has the stalling issue is here: #1121

@sigurpol sigurpol closed this Sep 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants