Fix stalling issue if subscription.next() is stuck by sigurpol · Pull Request #1119 · paritytech/polkadot-staking-miner

sigurpol · 2025-08-02T12:06:16Z

Changes

In the listener task:

Replaced tokio::select! with tokio::time::timeout that directly wraps subscription.next()
Simplified the timeout logic - now it's a straightforward 60-second timeout on each subscription call
Fixed the stalling issue - the timeout will always fire after 60 seconds, even if subscription.next() is stuck

The key improvement is that the timeout now wraps the potentially hanging operation directly, providing more reliable detection of a stalled subscription. This should eliminate the future starvation issue that could prevent the original timeout from firing.

For runtime_upgrade_task we follow a different approach. We expect to receive a runtime upgrade very infrequently so a timeout approach doesn't work well. What we do now, is to check periodically every hour if the updater subscription is healthy and if it's not, we recreate it. This way we avoid to recreate it unconditionally, with the unnecessary overhead and with the risk to recreate the connection just while we are actually receiving an update.

1. Replaced tokio::select! with tokio::time::timeout that directly wraps subscription.next() 2. Simplified the timeout logic - now it's a straightforward 60-second timeout on each subscription call 3. Fixed the stalling issue - the timeout will always fire after 60 seconds, even if subscription.next() is stuck The key improvement is that the timeout now wraps the potentially hanging operation directly, guaranteeing detection of a stalled subscription. This eliminates the future starvation issue that could prevent the original timeout from firing.

sigurpol · 2025-08-02T12:15:54Z

NOTE: CI is failing because it would require rust > 1.85.0 to build polkadot binaries. This is actually fixed by #1118 which I hope to get merged very soon.

jsdw · 2025-08-04T10:36:04Z

src/main.rs

+			std::time::Duration::from_secs(60 * 60),
+			update_stream.next(),
+		)
+		.await
+		{
+			Ok(maybe_update) => match maybe_update {


The code looks better than the previous version, but I would expect runtime updates to happen fairly infrequently, and so I would assume that this is constantly logging the "No runtime updates received for 1 hour" message and recreating the runtime update subscription?

This makes me wonder about adding an internal timeout to Subxt so that we can say "if we don't receive any event for X seconds then error". At least in the case of the ChainHead APIs (wheer runtime updates come on the same stream as other block events), this would be a more elegant way to detect block event + runtime update streams failing.

For now though, I guess recreating the subscription every hour isn't the end of the world, though it does risk being recreated at just such a moment as to miss an actual update (via the legacy APIs, at least) :)

very good point! And definitely agree that a "better" fix would likely involve the subxt library implementing its own health checks or heartbeat mechanism. This 1h timeout is definitely kicking too often for something which happens so infrequently.
Maybe for the updater, I could at least raise the timeout to, dunno, 24h?

bfb9a48: I've added a periodic check for the updater subscription so we just check if the connection is healthy once per hour - and we recreate the subscription only if the check fails instead of doing it every time.

jsdw · 2025-08-04T10:36:49Z

The code change looks like an improvement to me; just a Q about the runtime update timeout :)

Added RPC health checks every hour to detect hung connections without the need to recreate the updater subscription every hour.

To investigate the issue of the stalling subscription (see #1119, doesn't actually try to mine and clean-up. The idea is to run it ideally on the same runner which runs the real miner on WAH and observes if it stalls in this ultra-simplified version. If it does, we can safely excluse any side effect coming from the janitor task or the miner task itself.

…ut and diagnostics Problem: The listener task occasionally stalls after janitor cleanup operations, despite PR #1119's subscription timeout improvements. Critical observation: the runtime updater task continues receiving updates while sharing the same ChainClient and ChainHeadBackend, proving the underlying subscription infrastructure is presumably healthy. This seems to suggest that the hang occurs in listener's internal processing, not in the subscription itself. Root Cause Analysis: - Since updater task works, we assume subscription.next() is NOT blocking - The hang then must occur AFTER subscription.next() returns successfully - Current timeout only wraps subscription.next(), missing internal processing hangs - Janitor's synchronous submit_and_watch() + wait_tx_in_finalized_block() shouldn't create transient resource contention affecting listener's storage queries but for a cleanup task, the `submit` only approach is better. Solution: 1. Comprehensive timeout protection: - Wrap entire block processing logic (not just subscription.next()) - Add 90s timeout for all post-subscription processing - Detect "zombie" processing where subscription works but processing hangs 2. Enhanced diagnostics: - Add detailed trace logging throughout processing pipeline. Extra logging is very verbose with `trace` level - Track storage query durations with prometheus metrics - Separate subscription health from block processing health metrics - Monitor exact hang points (storage queries, phase checks, BlockDetails) 3. Improve janitor task: - Change clear_old_round_data() from submit_and_watch() to submit() and remove wait_tx_in_finalized_block() to avoid blocking New Metrics: - staking_miner_storage_query_duration_ms - staking_miner_block_state_duration_ms - staking_miner_block_details_duration_ms - staking_miner_block_processing_stalls_total - staking_miner_last_block_processing_timestamp Most of these metrics can be removed once we have successfully verified the fix or better tracked the stalling issue. This fix aim to address what we suspect to be the actual problem (internal processing hangs) rather than the subscription stalls, based on the key evidence that the updater task continues working normally.

…ut and diagnostics (#1123) Problem: The listener task occasionally stalls after janitor cleanup operations, despite PR #1119's subscription timeout improvements. Critical observation: the runtime updater task continues receiving updates while sharing the same ChainClient and ChainHeadBackend, proving the underlying subscription infrastructure is presumably healthy. This seems to suggest that the hang occurs in listener's internal processing, not in the subscription itself. Root Cause Analysis: - Since updater task works, we assume subscription.next() is NOT blocking - The hang then must occur AFTER subscription.next() returns successfully - Current timeout only wraps subscription.next(), missing internal processing hangs - Janitor's synchronous submit_and_watch() + wait_tx_in_finalized_block() shouldn't create transient resource contention affecting listener's storage queries but for a cleanup task, the `submit` only approach is better. Solution: 1. Comprehensive timeout protection: - Wrap entire block processing logic (not just subscription.next()) - Add 90s timeout for all post-subscription processing - Detect "zombie" processing where subscription works but processing hangs 2. Enhanced diagnostics: - Add detailed trace logging throughout processing pipeline. Extra logging is very verbose with `trace` level - Track storage query durations with prometheus metrics - Separate subscription health from block processing health metrics - Monitor exact hang points (storage queries, phase checks, BlockDetails) 3. Improve janitor task: - Change clear_old_round_data() from submit_and_watch() to submit() and remove wait_tx_in_finalized_block() to avoid blocking New Metrics: - staking_miner_storage_query_duration_ms - staking_miner_block_state_duration_ms - staking_miner_block_details_duration_ms - staking_miner_block_processing_stalls_total - staking_miner_last_block_processing_timestamp Most of these metrics can be removed once we have successfully verified the fix or better tracked the stalling issue. This fix aim to address what we suspect to be the actual problem (internal processing hangs) rather than the subscription stalls, based on the key evidence that the updater task continues working normally.

sigurpol requested review from Ank4n, jsdw and kianenigma August 2, 2025 12:20

Fix potential stalling in the runtime_upgrade_task

cd5c010

jsdw reviewed Aug 4, 2025

View reviewed changes

jsdw approved these changes Aug 4, 2025

View reviewed changes

jsdw mentioned this pull request Aug 4, 2025

ChainHead APIs: add option to timeout subscriptions if no events received in N seconds paritytech/subxt#2050

Open

sigurpol added 2 commits August 4, 2025 14:03

Add periodic health checks to runtime upgrade subscription

bfb9a48

Added RPC health checks every hour to detect hung connections without the need to recreate the updater subscription every hour.

Fix fmt

fda21cb

sigurpol merged commit ba1fe68 into main Aug 4, 2025
9 of 10 checks passed

sigurpol deleted the stalling-subscription-connection branch August 4, 2025 15:09

sigurpol mentioned this pull request Aug 5, 2025

[DO NOT MERGE] Make janitor transaction submission non-blocking #1120

Closed

sigurpol mentioned this pull request Aug 5, 2025

[EXPERIMENTAL - DO NOT MERGE] Dummy miner only listening to new blocks #1121

Closed

sigurpol mentioned this pull request Aug 8, 2025

Tentatively fix listener task stalling issue with comprehensive timeout and diagnostics #1123

Merged

sigurpol mentioned this pull request Aug 25, 2025

Improve stalling issues with comprehensive timeout handling #1133

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix stalling issue if subscription.next() is stuck#1119

Fix stalling issue if subscription.next() is stuck#1119
sigurpol merged 4 commits intomainfrom
stalling-subscription-connection

sigurpol commented Aug 2, 2025 •

edited

Loading

Uh oh!

sigurpol commented Aug 2, 2025

Uh oh!

jsdw Aug 4, 2025

Uh oh!

jsdw Aug 4, 2025

Uh oh!

sigurpol Aug 4, 2025 •

edited

Loading

Uh oh!

sigurpol Aug 4, 2025

Uh oh!

jsdw commented Aug 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sigurpol commented Aug 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Uh oh!

sigurpol commented Aug 2, 2025

Uh oh!

jsdw Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

jsdw Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

sigurpol Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sigurpol Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

jsdw commented Aug 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sigurpol commented Aug 2, 2025 •

edited

Loading

sigurpol Aug 4, 2025 •

edited

Loading