Fix stalling issue if subscription.next() is stuck#1119
Conversation
1. Replaced tokio::select! with tokio::time::timeout that directly wraps subscription.next() 2. Simplified the timeout logic - now it's a straightforward 60-second timeout on each subscription call 3. Fixed the stalling issue - the timeout will always fire after 60 seconds, even if subscription.next() is stuck The key improvement is that the timeout now wraps the potentially hanging operation directly, guaranteeing detection of a stalled subscription. This eliminates the future starvation issue that could prevent the original timeout from firing.
|
NOTE: CI is failing because it would require rust > 1.85.0 to build polkadot binaries. This is actually fixed by #1118 which I hope to get merged very soon. |
src/main.rs
Outdated
| std::time::Duration::from_secs(60 * 60), | ||
| update_stream.next(), | ||
| ) | ||
| .await | ||
| { | ||
| Ok(maybe_update) => match maybe_update { |
There was a problem hiding this comment.
The code looks better than the previous version, but I would expect runtime updates to happen fairly infrequently, and so I would assume that this is constantly logging the "No runtime updates received for 1 hour" message and recreating the runtime update subscription?
There was a problem hiding this comment.
This makes me wonder about adding an internal timeout to Subxt so that we can say "if we don't receive any event for X seconds then error". At least in the case of the ChainHead APIs (wheer runtime updates come on the same stream as other block events), this would be a more elegant way to detect block event + runtime update streams failing.
For now though, I guess recreating the subscription every hour isn't the end of the world, though it does risk being recreated at just such a moment as to miss an actual update (via the legacy APIs, at least) :)
There was a problem hiding this comment.
very good point! And definitely agree that a "better" fix would likely involve the subxt library implementing its own health checks or heartbeat mechanism. This 1h timeout is definitely kicking too often for something which happens so infrequently.
Maybe for the updater, I could at least raise the timeout to, dunno, 24h?
There was a problem hiding this comment.
bfb9a48: I've added a periodic check for the updater subscription so we just check if the connection is healthy once per hour - and we recreate the subscription only if the check fails instead of doing it every time.
|
The code change looks like an improvement to me; just a Q about the runtime update timeout :) |
Added RPC health checks every hour to detect hung connections without the need to recreate the updater subscription every hour.
To investigate the issue of the stalling subscription (see #1119, doesn't actually try to mine and clean-up. The idea is to run it ideally on the same runner which runs the real miner on WAH and observes if it stalls in this ultra-simplified version. If it does, we can safely excluse any side effect coming from the janitor task or the miner task itself.
…ut and diagnostics Problem: The listener task occasionally stalls after janitor cleanup operations, despite PR #1119's subscription timeout improvements. Critical observation: the runtime updater task continues receiving updates while sharing the same ChainClient and ChainHeadBackend, proving the underlying subscription infrastructure is presumably healthy. This seems to suggest that the hang occurs in listener's internal processing, not in the subscription itself. Root Cause Analysis: - Since updater task works, we assume subscription.next() is NOT blocking - The hang then must occur AFTER subscription.next() returns successfully - Current timeout only wraps subscription.next(), missing internal processing hangs - Janitor's synchronous submit_and_watch() + wait_tx_in_finalized_block() shouldn't create transient resource contention affecting listener's storage queries but for a cleanup task, the `submit` only approach is better. Solution: 1. Comprehensive timeout protection: - Wrap entire block processing logic (not just subscription.next()) - Add 90s timeout for all post-subscription processing - Detect "zombie" processing where subscription works but processing hangs 2. Enhanced diagnostics: - Add detailed trace logging throughout processing pipeline. Extra logging is very verbose with `trace` level - Track storage query durations with prometheus metrics - Separate subscription health from block processing health metrics - Monitor exact hang points (storage queries, phase checks, BlockDetails) 3. Improve janitor task: - Change clear_old_round_data() from submit_and_watch() to submit() and remove wait_tx_in_finalized_block() to avoid blocking New Metrics: - staking_miner_storage_query_duration_ms - staking_miner_block_state_duration_ms - staking_miner_block_details_duration_ms - staking_miner_block_processing_stalls_total - staking_miner_last_block_processing_timestamp Most of these metrics can be removed once we have successfully verified the fix or better tracked the stalling issue. This fix aim to address what we suspect to be the actual problem (internal processing hangs) rather than the subscription stalls, based on the key evidence that the updater task continues working normally.
…ut and diagnostics (#1123) Problem: The listener task occasionally stalls after janitor cleanup operations, despite PR #1119's subscription timeout improvements. Critical observation: the runtime updater task continues receiving updates while sharing the same ChainClient and ChainHeadBackend, proving the underlying subscription infrastructure is presumably healthy. This seems to suggest that the hang occurs in listener's internal processing, not in the subscription itself. Root Cause Analysis: - Since updater task works, we assume subscription.next() is NOT blocking - The hang then must occur AFTER subscription.next() returns successfully - Current timeout only wraps subscription.next(), missing internal processing hangs - Janitor's synchronous submit_and_watch() + wait_tx_in_finalized_block() shouldn't create transient resource contention affecting listener's storage queries but for a cleanup task, the `submit` only approach is better. Solution: 1. Comprehensive timeout protection: - Wrap entire block processing logic (not just subscription.next()) - Add 90s timeout for all post-subscription processing - Detect "zombie" processing where subscription works but processing hangs 2. Enhanced diagnostics: - Add detailed trace logging throughout processing pipeline. Extra logging is very verbose with `trace` level - Track storage query durations with prometheus metrics - Separate subscription health from block processing health metrics - Monitor exact hang points (storage queries, phase checks, BlockDetails) 3. Improve janitor task: - Change clear_old_round_data() from submit_and_watch() to submit() and remove wait_tx_in_finalized_block() to avoid blocking New Metrics: - staking_miner_storage_query_duration_ms - staking_miner_block_state_duration_ms - staking_miner_block_details_duration_ms - staking_miner_block_processing_stalls_total - staking_miner_last_block_processing_timestamp Most of these metrics can be removed once we have successfully verified the fix or better tracked the stalling issue. This fix aim to address what we suspect to be the actual problem (internal processing hangs) rather than the subscription stalls, based on the key evidence that the updater task continues working normally.
Changes
In the listener task:
The key improvement is that the timeout now wraps the potentially hanging operation directly, providing more reliable detection of a stalled subscription. This should eliminate the future starvation issue that could prevent the original timeout from firing.
For
runtime_upgrade_taskwe follow a different approach. We expect to receive a runtime upgrade very infrequently so a timeout approach doesn't work well. What we do now, is to check periodically every hour if the updater subscription is healthy and if it's not, we recreate it. This way we avoid to recreate it unconditionally, with the unnecessary overhead and with the risk to recreate the connection just while we are actually receiving an update.