Skip to content

turbine: Fix and cleanup redundant metric#9287

Merged
steviez merged 2 commits intoanza-xyz:masterfrom
steviez:turbine_fix_metrics1
Nov 26, 2025
Merged

turbine: Fix and cleanup redundant metric#9287
steviez merged 2 commits intoanza-xyz:masterfrom
steviez:turbine_fix_metrics1

Conversation

@steviez
Copy link

@steviez steviez commented Nov 26, 2025

Problem

The broadcast-process-shreds-interrupted-stats and broadcast-process-shreds-stats metrics have duplicate fields for the number of shreds per slot (probably an artifact of when we rolled merkle shreds out):

  • num_data_shreds & num_merkle_data_shreds
  • num_coding_shreds & num_merkle_coding_shreds

Additionally, there is currently a bug that will report the wrong values when a slot is interrupted. When the slot is interrupted, we call the below function to create a LAST_IN_SLOT shred which will signal to the rest of the cluster that we're abandoning the block:

if self.slot != bank.slot() {
// Finish previous slot if it was interrupted.
if !self.completed {
let shreds =
self.finish_prev_slot(keypair, bank.ticks_per_slot() as u8, process_stats);

In the function, we accumulate into the passed in stats:
fn finish_prev_slot(
&mut self,
keypair: &Keypair,
max_ticks_in_slot: u8,
stats: &mut ProcessShredsStats,
) -> Vec<Shred> {
if self.completed {
return vec![];
}
// Set the reference_tick as if the PoH completed for this slot
let reference_tick = max_ticks_in_slot;
let shreds: Vec<_> =
Shredder::new(self.slot, self.parent, reference_tick, self.shred_version)
.unwrap()
.make_merkle_shreds_from_entries(
keypair,
&[], // entries
true, // is_last_in_slot,
self.chained_merkle_root,
self.next_shred_index,
self.next_code_index,
&self.reed_solomon_cache,
stats,
)
.inspect(|shred| stats.record_shred(shred))
.collect();

but then report metrics on self.process_shred_stats without accumulating the just modified stats:
self.report_and_reset_stats(/*was_interrupted:*/ true);

Instead, the stats object now has some counters updated for the interrupted slot that will get aggregated into self.process_shred_stats (the next "active" slot) later on:
self.process_shreds_stats += *process_stats;

Assuming we generated S shreds in this function, this results in us under reporting the number of shreds for the interrupted slot by S and over reporting the number of shreds for the next slot by S

Summary of Changes

  • First commit: Operate on self.process_shred_stats within finish_prev_slot()
  • Second commit: Remove the duplicate metric and keep the num_data_shreds; including merkle seemed redundant to me since all shreds are now of the Merkle variant

There is probably some more cleanup that could be done; ie, this feels unnecessary:

.inspect(|shred| {
process_stats.record_shred(shred);
let next_index = match shred.shred_type() {
ShredType::Code => &mut self.next_code_index,
ShredType::Data => &mut self.next_shred_index,
};
*next_index = (*next_index).max(shred.index() + 1);

However, I decided to keep the PR slimmer to keep our options for BP open

@steviez steviez force-pushed the turbine_fix_metrics1 branch from ab93c62 to 18fb482 Compare November 26, 2025 05:07
@codecov-commenter
Copy link

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.6%. Comparing base (df2b614) to head (18fb482).

Additional details and impacted files
@@            Coverage Diff            @@
##           master    #9287     +/-   ##
=========================================
- Coverage    82.6%    82.6%   -0.1%     
=========================================
  Files         892      892             
  Lines      321007   320992     -15     
=========================================
- Hits       265334   265306     -28     
- Misses      55673    55686     +13     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@steviez steviez marked this pull request as ready for review November 26, 2025 05:45
@steviez steviez requested a review from a team as a code owner November 26, 2025 05:45
@steviez steviez requested a review from AshwinSekar November 26, 2025 05:45
Copy link

@gregcusack gregcusack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change looks mostly good. but why do we have ProcessShredStats as a member of StandardBroadcastRun (see:

process_shreds_stats: ProcessShredsStats,
) and we create a ProcessShredStats that we pass around within StandardBroadcastRun (see:
let mut process_stats = ProcessShredsStats::default();
let receive_results = broadcast_utils::recv_slot_entries(
receiver,
&mut self.carryover_entry,
&mut process_stats,
)?;
// TODO: Confirm that last chunk of coding shreds
// will not be lost or delayed for too long.
self.process_receive_results(
keypair,
blockstore,
socket_sender,
blockstore_sender,
receive_results,
&mut process_stats,
)? Would it just make sense to fully get rid of the one we pass around within StandardBroadcastRun since that duplication seems to be part of the root problem here.

@steviez
Copy link
Author

steviez commented Nov 26, 2025

Would it just make sense to fully get rid of the one we pass around within StandardBroadcastRun since that duplication seems to be part of the root problem here.

Yes, the multiple stats objects are part of the problem. One of the reasons for that is this function:

let receive_results = broadcast_utils::recv_slot_entries(
receiver,
&mut self.carryover_entry,
&mut process_stats,
)?;

That functions updates some stats before we know whether those stats are for the previous (interrupted) or new (next) slot. We don't figure that out until later (link below) so we need to hold them in a separate object until we do:

if self.slot != bank.slot() {

Generally speaking, yes, I think there could be more cleanup on this stats object (ie split the struct between stuff updated by Shredder and stuff updated by BroadcastRun implementations). The duplicate objects problem could potentially be resolved with some refactoring.

But, I decided to make a more surgical change to keep BP options open (as I want to use this metric to help support another perf change that I may try to BP)

@gregcusack gregcusack self-requested a review November 26, 2025 16:27
Copy link

@gregcusack gregcusack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh yess i see what you're talking about. missed that. ya could use a refactor in a follow up. in that case, this pr lgtm!! thanks for finding/debugging this!

@steviez steviez merged commit 89ead14 into anza-xyz:master Nov 26, 2025
47 checks passed
@steviez steviez deleted the turbine_fix_metrics1 branch November 26, 2025 17:10
@steviez steviez added the v3.1 Backport to v3.1 branch label Nov 26, 2025
@mergify
Copy link

mergify bot commented Nov 26, 2025

Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis.

mergify bot pushed a commit that referenced this pull request Nov 26, 2025
The broadcast-process-shred-stats (and interrupted variant) have
duplicate fields for the number of data and coding shred. This is
probably a relic of when we rolled merkle shreds out. So, remove the
duplicate fields.

This also addresses a small issue where some shreds would be counted
in metrics for the wrong slot when we have an interrupted slot

(cherry picked from commit 89ead14)
steviez added a commit that referenced this pull request Nov 26, 2025
…9303)

turbine: Fix and cleanup redundant metric (#9287)

The broadcast-process-shred-stats (and interrupted variant) have
duplicate fields for the number of data and coding shred. This is
probably a relic of when we rolled merkle shreds out. So, remove the
duplicate fields.

This also addresses a small issue where some shreds would be counted
in metrics for the wrong slot when we have an interrupted slot

(cherry picked from commit 89ead14)

Co-authored-by: steviez <[email protected]>
AvhiMaz pushed a commit to AvhiMaz/agave that referenced this pull request Nov 28, 2025
The broadcast-process-shred-stats (and interrupted variant) have
duplicate fields for the number of data and coding shred. This is
probably a relic of when we rolled merkle shreds out. So, remove the
duplicate fields.

This also addresses a small issue where some shreds would be counted
in metrics for the wrong slot when we have an interrupted slot
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

v3.1 Backport to v3.1 branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants