Skip to content

cumulus/metrics: Measure the time of including a tx in a backed block#8902

Open
lexnv wants to merge 27 commits intomasterfrom
lexnv/cumulus-to-rpc-layer-metrics
Open

cumulus/metrics: Measure the time of including a tx in a backed block#8902
lexnv wants to merge 27 commits intomasterfrom
lexnv/cumulus-to-rpc-layer-metrics

Conversation

@lexnv
Copy link
Copy Markdown
Contributor

@lexnv lexnv commented Jun 18, 2025

This PR introduces a metric in Cumulus to measure the time it took a transaction received on the RPC layer to be included a backed block.

  • A new TransactionMonitorHandle is exposed from the transaction RPC V2 API to listen on monitoring events emitted by substrate
  • The monitoring event contains the block of the transaction with the timestamp associated
  • A cumulus task is added to group the backed events from the relay chain with the monitoring events exposed by the RPC layer
  • The spawning of RPC spawn_tasks now returns the additional TransactionMonitorHandle
  • The metric is exposed under the following histogram: parachain_transaction_backed_duration

Testing Done

  • Started a small network with 2 validators and 2 collators (charlie and dave -- using small_network.toml as base)
  • Used subxt with a minimal working code to send a transaction to charlie
  • Used the RPC-V2 API (ie ChainHeadBackend)
use std::sync::Arc;
use subxt::{
    OnlineClient, PolkadotConfig,
    backend::{chain_head::ChainHeadBackend, rpc::RpcClient},
};
use subxt_signer::sr25519::dev;

#[subxt::subxt(runtime_metadata_insecure_url = "ws://127.0.0.1:44919")]
pub mod polkadot {}

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let rpc = RpcClient::from_url("ws://127.0.0.1:44919".to_string()).await?;
    let backend = ChainHeadBackend::builder().build_with_background_driver(rpc.clone());
    let api: OnlineClient<PolkadotConfig> = OnlineClient::from_backend(Arc::new(backend)).await?;

    let from = dev::charlie();
    let dest = dev::dave().public_key().into();
    let tx = Box::new(polkadot::tx().balances().transfer_allow_death(dest, 10_000));
    api.tx()
        .sign_and_submit_then_watch_default(&tx, &from)
        .await?
        .wait_for_finalized_success()
        .await?;

    Ok(())
}
  • After the transaction gets finalized:
parachain_transaction_backed_duration_sum{chain="asset-hub-rococo-local"} 13.097583666
parachain_transaction_backed_duration_count{chain="asset-hub-rococo-local"} 1

Closes #8383

cc @paritytech/sdk-node

lexnv added 14 commits June 16, 2025 17:10
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
@lexnv lexnv self-assigned this Jun 18, 2025
@lexnv lexnv added the T0-node This PR/Issue is related to the topic “node”. label Jun 18, 2025
@lexnv
Copy link
Copy Markdown
Contributor Author

lexnv commented Jun 19, 2025

/cmd prdoc --audience node_dev,node_operator --bump patch

@lexnv
Copy link
Copy Markdown
Contributor Author

lexnv commented Jun 20, 2025

/cmd prdoc --audience node_dev --bump patch

@paritytech-workflow-stopper
Copy link
Copy Markdown

All GitHub workflows were cancelled due to failure one of the required jobs.
Failed workflow url: https://github.com/paritytech/polkadot-sdk/actions/runs/15775302503
Failed job name: build-templates-node

@sandreim
Copy link
Copy Markdown
Contributor

sandreim commented Jun 20, 2025

Getting it in a backed block is half way, and it is useful. Can we also add a metric to measure the transaction time to finality? This is for example what Mythical is measuring as latency.

@lexnv
Copy link
Copy Markdown
Contributor Author

lexnv commented Jun 23, 2025

Getting it in a backed block is half way, and it is useful. Can we also add a metric to measure the transaction time to finality? This is for example what Mythical is measuring as latency.

The tx time to transition into different states (including finality) should be solved by:

/// The parachain informant service.
pub struct ParachainInformant<Block: BlockT> {
/// Relay chain interface to interact with the relay chain.
relay_chain_interface: Arc<dyn RelayChainInterface>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: feel free to ignore, but this can also rely on generics. RelayChainInterface is Send + Sync so we just need clone here. A generic for ParachainInformant like RI: RelayChainInterface + Clone can still do it. I don't think there is a big concern with Arc approach, but previously the code did not use Arc.

All in all I posted this nit because I am curious if there was a reason for why you chose Arc here. Either way should be fine though.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like Arc<dyn> better in general because generics are slower to compile and more importantly, much more cumbersome to work with. So I support this decision 👍

(sp_core::H256::from_slice(block_hash.as_ref()), submitted_at),
};

if self.backed_blocks.peek(&block_hash).is_some() {
Copy link
Copy Markdown
Contributor

@iulianbarbu iulianbarbu Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dq: I imagine the in_block event will hardly be sent after the block is backed, so this is a very improbable corner case, right?

/// Ensures the RPC backed blocks reflect into the metrics and
/// performs the parachain logging.
async fn handle_import_notification(&mut self, n: RelayHeader) {
let candidate_events = match self.relay_chain_interface.candidate_events(n.hash()).await {
Copy link
Copy Markdown
Contributor

@iulianbarbu iulianbarbu Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't found this info that easy, but are these in order from oldest to most recent?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, had to also dig a bit to catch up with the prior context:

)?;

let mut transaction_v2_handles = Vec::with_capacity(2);
transaction_v2_handles.push(rpc_server_handle.transaction_v2_handle);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dq: what's the difference between the rpc_server_module and the in_memory_rpc?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, the in-memory RPC is not externally exposed to customers. Still, it allows substrate subsystems to call into the same functionality, however, I believe we had a good reason back then to use in-memory (but cant remember offhand):

/// A running RPC service that can perform in-memory RPC queries.
#[derive(Clone)]
pub struct RpcHandlers {
// This is legacy and may be removed at some point, it was for WASM stuff before smoldot was a
// thing. https://github.com/paritytech/polkadot-sdk/pull/5038#discussion_r1694971805
rpc_module: Arc<RpcModule<()>>,
// This can be used to introspect the port the RPC server is listening on. SDK consumers are
// depending on this and it should be supported even if in-memory query support is removed.
listen_addresses: Vec<Multiaddr>,
}

);

// Check the monitor produces the same event
let result = tokio::time::timeout(std::time::Duration::from_secs(5), tx_monitor.next())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not sure how 5 seconds stands against our CI hiccups. Thought that this should be more resilient, as in 10x what we'd expect to wait under normal CI load. If 5s is the number, please ignore.

Suggested change
let result = tokio::time::timeout(std::time::Duration::from_secs(5), tx_monitor.next())
let result = tokio::time::timeout(std::time::Duration::from_secs(5), tx_monitor.next())

Copy link
Copy Markdown
Contributor

@sistemd sistemd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to see the parachain informant turn into a proper type

/// The parachain informant service.
pub struct ParachainInformant<Block: BlockT> {
/// Relay chain interface to interact with the relay chain.
relay_chain_interface: Arc<dyn RelayChainInterface>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like Arc<dyn> better in general because generics are slower to compile and more importantly, much more cumbersome to work with. So I support this decision 👍

@sandreim
Copy link
Copy Markdown
Contributor

sandreim commented Oct 6, 2025

Why do we want to add this extra complexity here ?

This can be easily done from an external application/process like this:

  • submit tx, note submission_timestamp
  • observe TX status -> InBlock
  • observe the block import, we should see the tx hash there, note the parachain block head hash
  • observe the RC CandidateBacked event which contains the para block head hash paraHead: H256
  • compute the time to backed as rc_block_ts - submission_timestamp

LE: Overall I'd put all of this monitoring stuff into https://github.com/paritytech/polkadot-introspector and keep the node lean.

@sandreim
Copy link
Copy Markdown
Contributor

sandreim commented Oct 6, 2025

AFAIU it could make sense to do it, if it is not easy to do it from the outside -> how does the tool get the txs ? is it feasible to submit them from the tool ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

T0-node This PR/Issue is related to the topic “node”.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RPC: Add backed event to the metrics

8 participants