Skip to content

Fix stalling issue if subscription.next() is stuck#1119

Merged
sigurpol merged 4 commits intomainfrom
stalling-subscription-connection
Aug 4, 2025
Merged

Fix stalling issue if subscription.next() is stuck#1119
sigurpol merged 4 commits intomainfrom
stalling-subscription-connection

Conversation

@sigurpol
Copy link
Copy Markdown
Contributor

@sigurpol sigurpol commented Aug 2, 2025

Changes

In the listener task:

  1. Replaced tokio::select! with tokio::time::timeout that directly wraps subscription.next()
  2. Simplified the timeout logic - now it's a straightforward 60-second timeout on each subscription call
  3. Fixed the stalling issue - the timeout will always fire after 60 seconds, even if subscription.next() is stuck

The key improvement is that the timeout now wraps the potentially hanging operation directly, providing more reliable detection of a stalled subscription. This should eliminate the future starvation issue that could prevent the original timeout from firing.

For runtime_upgrade_task we follow a different approach. We expect to receive a runtime upgrade very infrequently so a timeout approach doesn't work well. What we do now, is to check periodically every hour if the updater subscription is healthy and if it's not, we recreate it. This way we avoid to recreate it unconditionally, with the unnecessary overhead and with the risk to recreate the connection just while we are actually receiving an update.

1. Replaced tokio::select! with tokio::time::timeout that
directly wraps subscription.next()
2. Simplified the timeout logic - now it's a
straightforward 60-second timeout on each subscription
call
3. Fixed the stalling issue - the timeout will always
fire after 60 seconds, even if subscription.next() is
stuck

The key improvement is that the timeout now wraps the
potentially hanging operation directly, guaranteeing
detection of a stalled subscription. This eliminates the
future starvation issue that could prevent the original
timeout from firing.
@sigurpol
Copy link
Copy Markdown
Contributor Author

sigurpol commented Aug 2, 2025

NOTE: CI is failing because it would require rust > 1.85.0 to build polkadot binaries. This is actually fixed by #1118 which I hope to get merged very soon.

@sigurpol sigurpol requested review from Ank4n, jsdw and kianenigma August 2, 2025 12:20
src/main.rs Outdated
Comment on lines +221 to +226
std::time::Duration::from_secs(60 * 60),
update_stream.next(),
)
.await
{
Ok(maybe_update) => match maybe_update {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks better than the previous version, but I would expect runtime updates to happen fairly infrequently, and so I would assume that this is constantly logging the "No runtime updates received for 1 hour" message and recreating the runtime update subscription?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me wonder about adding an internal timeout to Subxt so that we can say "if we don't receive any event for X seconds then error". At least in the case of the ChainHead APIs (wheer runtime updates come on the same stream as other block events), this would be a more elegant way to detect block event + runtime update streams failing.

For now though, I guess recreating the subscription every hour isn't the end of the world, though it does risk being recreated at just such a moment as to miss an actual update (via the legacy APIs, at least) :)

Copy link
Copy Markdown
Contributor Author

@sigurpol sigurpol Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very good point! And definitely agree that a "better" fix would likely involve the subxt library implementing its own health checks or heartbeat mechanism. This 1h timeout is definitely kicking too often for something which happens so infrequently.
Maybe for the updater, I could at least raise the timeout to, dunno, 24h?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bfb9a48: I've added a periodic check for the updater subscription so we just check if the connection is healthy once per hour - and we recreate the subscription only if the check fails instead of doing it every time.

@jsdw
Copy link
Copy Markdown
Contributor

jsdw commented Aug 4, 2025

The code change looks like an improvement to me; just a Q about the runtime update timeout :)

Added RPC health checks every hour to detect hung connections without
the need to recreate the updater subscription every hour.
@sigurpol sigurpol merged commit ba1fe68 into main Aug 4, 2025
9 of 10 checks passed
@sigurpol sigurpol deleted the stalling-subscription-connection branch August 4, 2025 15:09
sigurpol added a commit that referenced this pull request Aug 5, 2025
To investigate the issue of the stalling subscription (see #1119,
doesn't actually try to mine and clean-up.

The idea is to run it ideally on the same runner which runs the real
miner on WAH and observes if it stalls in this ultra-simplified version.

If it does, we can safely excluse any side effect coming from the
janitor task or the miner task itself.
sigurpol added a commit that referenced this pull request Aug 8, 2025
…ut and diagnostics

Problem: The listener task occasionally stalls after janitor cleanup operations,
despite PR #1119's subscription timeout improvements. Critical observation: the
runtime updater task continues receiving updates while sharing the same ChainClient
and ChainHeadBackend, proving the underlying subscription infrastructure is presumably
healthy.
This seems to suggest that the hang occurs in listener's internal processing, not in the
subscription itself.

Root Cause Analysis:
- Since updater task works, we assume subscription.next() is NOT blocking
- The hang then must occur AFTER subscription.next() returns successfully
- Current timeout only wraps subscription.next(), missing internal processing hangs
- Janitor's synchronous submit_and_watch() + wait_tx_in_finalized_block() shouldn't
  create transient resource contention affecting listener's storage queries but for
  a cleanup task, the `submit` only approach is better.

Solution:
1. Comprehensive timeout protection:
   - Wrap entire block processing logic (not just subscription.next())
   - Add 90s timeout for all post-subscription processing
   - Detect "zombie" processing where subscription works but processing hangs

2. Enhanced diagnostics:
   - Add detailed trace logging throughout processing pipeline. Extra
     logging is very verbose with `trace` level
   - Track storage query durations with prometheus metrics
   - Separate subscription health from block processing health metrics
   - Monitor exact hang points (storage queries, phase checks, BlockDetails)

3. Improve janitor task:
   - Change clear_old_round_data() from submit_and_watch() to submit()
     and remove wait_tx_in_finalized_block() to avoid blocking

New Metrics:
- staking_miner_storage_query_duration_ms
- staking_miner_block_state_duration_ms
- staking_miner_block_details_duration_ms
- staking_miner_block_processing_stalls_total
- staking_miner_last_block_processing_timestamp

Most of these metrics can be removed once we have successfully verified
the fix or better tracked the stalling issue.

This fix aim to address what we suspect to be the actual problem (internal
processing hangs) rather than the subscription stalls, based on the key evidence
that the updater task continues working normally.
sigurpol added a commit that referenced this pull request Aug 8, 2025
…ut and diagnostics (#1123)

Problem: The listener task occasionally stalls after janitor cleanup operations,
despite PR #1119's subscription timeout improvements. Critical observation: the
runtime updater task continues receiving updates while sharing the same ChainClient
and ChainHeadBackend, proving the underlying subscription infrastructure is presumably
healthy.
This seems to suggest that the hang occurs in listener's internal processing, not in the
subscription itself.

Root Cause Analysis:
- Since updater task works, we assume subscription.next() is NOT blocking
- The hang then must occur AFTER subscription.next() returns successfully
- Current timeout only wraps subscription.next(), missing internal processing hangs
- Janitor's synchronous submit_and_watch() + wait_tx_in_finalized_block() shouldn't
  create transient resource contention affecting listener's storage queries but for
  a cleanup task, the `submit` only approach is better.

Solution:
1. Comprehensive timeout protection:
   - Wrap entire block processing logic (not just subscription.next())
   - Add 90s timeout for all post-subscription processing
   - Detect "zombie" processing where subscription works but processing hangs

2. Enhanced diagnostics:
   - Add detailed trace logging throughout processing pipeline. Extra
     logging is very verbose with `trace` level
   - Track storage query durations with prometheus metrics
   - Separate subscription health from block processing health metrics
   - Monitor exact hang points (storage queries, phase checks, BlockDetails)

3. Improve janitor task:
   - Change clear_old_round_data() from submit_and_watch() to submit()
     and remove wait_tx_in_finalized_block() to avoid blocking

New Metrics:
- staking_miner_storage_query_duration_ms
- staking_miner_block_state_duration_ms
- staking_miner_block_details_duration_ms
- staking_miner_block_processing_stalls_total
- staking_miner_last_block_processing_timestamp

Most of these metrics can be removed once we have successfully verified
the fix or better tracked the stalling issue.

This fix aim to address what we suspect to be the actual problem (internal
processing hangs) rather than the subscription stalls, based on the key evidence
that the updater task continues working normally.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants