turbine: Fix and cleanup redundant metric by steviez · Pull Request #9287 · anza-xyz/agave

steviez · 2025-11-26T05:03:18Z

Problem

The broadcast-process-shreds-interrupted-stats and broadcast-process-shreds-stats metrics have duplicate fields for the number of shreds per slot (probably an artifact of when we rolled merkle shreds out):

num_data_shreds & num_merkle_data_shreds
num_coding_shreds & num_merkle_coding_shreds

Additionally, there is currently a bug that will report the wrong values when a slot is interrupted. When the slot is interrupted, we call the below function to create a LAST_IN_SLOT shred which will signal to the rest of the cluster that we're abandoning the block:

agave/turbine/src/broadcast_stage/standard_broadcast_run.rs

Lines 208 to 212 in df2b614

    
           if self.slot != bank.slot() { 
        
               // Finish previous slot if it was interrupted. 
        
               if !self.completed { 
        
                   let shreds = 
        
                       self.finish_prev_slot(keypair, bank.ticks_per_slot() as u8, process_stats);

In the function, we accumulate into the passed in stats:

agave/turbine/src/broadcast_stage/standard_broadcast_run.rs

Lines 77 to 102 in df2b614

    
           fn finish_prev_slot( 
        
               &mut self, 
        
               keypair: &Keypair, 
        
               max_ticks_in_slot: u8, 
        
               stats: &mut ProcessShredsStats, 
        
           ) -> Vec<Shred> { 
        
               if self.completed { 
        
                   return vec![]; 
        
               } 
        
               // Set the reference_tick as if the PoH completed for this slot 
        
               let reference_tick = max_ticks_in_slot; 
        
               let shreds: Vec<_> = 
        
                   Shredder::new(self.slot, self.parent, reference_tick, self.shred_version) 
        
                       .unwrap() 
        
                       .make_merkle_shreds_from_entries( 
        
                           keypair, 
        
                           &[],  // entries 
        
                           true, // is_last_in_slot, 
        
                           self.chained_merkle_root, 
        
                           self.next_shred_index, 
        
                           self.next_code_index, 
        
                           &self.reed_solomon_cache, 
        
                           stats, 
        
                       ) 
        
                       .inspect(|shred| stats.record_shred(shred)) 
        
                       .collect();

but then report metrics on self.process_shred_stats without accumulating the just modified stats:

agave/turbine/src/broadcast_stage/standard_broadcast_run.rs

Line 106 in df2b614

self.report_and_reset_stats(/*was_interrupted:*/ true);

Instead, the stats object now has some counters updated for the interrupted slot that will get aggregated into self.process_shred_stats (the next "active" slot) later on:

agave/turbine/src/broadcast_stage/standard_broadcast_run.rs

Line 331 in df2b614

self.process_shreds_stats += *process_stats;

Assuming we generated S shreds in this function, this results in us under reporting the number of shreds for the interrupted slot by S and over reporting the number of shreds for the next slot by S

Summary of Changes

First commit: Operate on self.process_shred_stats within finish_prev_slot()
Second commit: Remove the duplicate metric and keep the num_data_shreds; including merkle seemed redundant to me since all shreds are now of the Merkle variant

There is probably some more cleanup that could be done; ie, this feels unnecessary:

agave/turbine/src/broadcast_stage/standard_broadcast_run.rs

Lines 135 to 141 in df2b614

    
           .inspect(|shred| { 
        
               process_stats.record_shred(shred); 
        
               let next_index = match shred.shred_type() { 
        
                   ShredType::Code => &mut self.next_code_index, 
        
                   ShredType::Data => &mut self.next_shred_index, 
        
               }; 
        
               *next_index = (*next_index).max(shred.index() + 1);

However, I decided to keep the PR slimmer to keep our options for BP open

codecov-commenter · 2025-11-26T05:37:02Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.6%. Comparing base (df2b614) to head (18fb482).

Additional details and impacted files

@@            Coverage Diff            @@
##           master    #9287     +/-   ##
=========================================
- Coverage    82.6%    82.6%   -0.1%     
=========================================
  Files         892      892             
  Lines      321007   320992     -15     
=========================================
- Hits       265334   265306     -28     
- Misses      55673    55686     +13

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

gregcusack

this change looks mostly good. but why do we have ProcessShredStats as a member of StandardBroadcastRun (see:

agave/turbine/src/broadcast_stage/standard_broadcast_run.rs

Line 32 in 9169c96

process_shreds_stats: ProcessShredsStats,

) and we create a ProcessShredStats that we pass around within StandardBroadcastRun (see:

agave/turbine/src/broadcast_stage/standard_broadcast_run.rs

Lines 457 to 471 in 9169c96

    
           let mut process_stats = ProcessShredsStats::default(); 
        
           let receive_results = broadcast_utils::recv_slot_entries( 
        
               receiver, 
        
               &mut self.carryover_entry, 
        
               &mut process_stats, 
        
           )?; 
        
           // TODO: Confirm that last chunk of coding shreds 
        
           // will not be lost or delayed for too long. 
        
           self.process_receive_results( 
        
               keypair, 
        
               blockstore, 
        
               socket_sender, 
        
               blockstore_sender, 
        
               receive_results, 
        
               &mut process_stats,

)? Would it just make sense to fully get rid of the one we pass around within StandardBroadcastRun since that duplication seems to be part of the root problem here.

steviez · 2025-11-26T15:57:18Z

Would it just make sense to fully get rid of the one we pass around within StandardBroadcastRun since that duplication seems to be part of the root problem here.

Yes, the multiple stats objects are part of the problem. One of the reasons for that is this function:

agave/turbine/src/broadcast_stage/standard_broadcast_run.rs

Lines 458 to 462 in 9169c96

    
           let receive_results = broadcast_utils::recv_slot_entries( 
        
               receiver, 
        
               &mut self.carryover_entry, 
        
               &mut process_stats, 
        
           )?;

That functions updates some stats before we know whether those stats are for the previous (interrupted) or new (next) slot. We don't figure that out until later (link below) so we need to hold them in a separate object until we do:

agave/turbine/src/broadcast_stage/standard_broadcast_run.rs

Line 208 in 9169c96

if self.slot != bank.slot() {

Generally speaking, yes, I think there could be more cleanup on this stats object (ie split the struct between stuff updated by Shredder and stuff updated by BroadcastRun implementations). The duplicate objects problem could potentially be resolved with some refactoring.

But, I decided to make a more surgical change to keep BP options open (as I want to use this metric to help support another perf change that I may try to BP)

gregcusack

ahh yess i see what you're talking about. missed that. ya could use a refactor in a follow up. in that case, this pr lgtm!! thanks for finding/debugging this!

mergify · 2025-11-26T17:11:06Z

Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis.

The broadcast-process-shred-stats (and interrupted variant) have duplicate fields for the number of data and coding shred. This is probably a relic of when we rolled merkle shreds out. So, remove the duplicate fields. This also addresses a small issue where some shreds would be counted in metrics for the wrong slot when we have an interrupted slot (cherry picked from commit 89ead14)

…9303) turbine: Fix and cleanup redundant metric (#9287) The broadcast-process-shred-stats (and interrupted variant) have duplicate fields for the number of data and coding shred. This is probably a relic of when we rolled merkle shreds out. So, remove the duplicate fields. This also addresses a small issue where some shreds would be counted in metrics for the wrong slot when we have an interrupted slot (cherry picked from commit 89ead14) Co-authored-by: steviez <[email protected]>

The broadcast-process-shred-stats (and interrupted variant) have duplicate fields for the number of data and coding shred. This is probably a relic of when we rolled merkle shreds out. So, remove the duplicate fields. This also addresses a small issue where some shreds would be counted in metrics for the wrong slot when we have an interrupted slot

steviez added 2 commits November 25, 2025 23:07

turbine: Fix metrics issue where shreds counted on wrong slot

8db2ced

turbine: Remove duplicate metric

18fb482

steviez force-pushed the turbine_fix_metrics1 branch from ab93c62 to 18fb482 Compare November 26, 2025 05:07

steviez marked this pull request as ready for review November 26, 2025 05:45

steviez requested a review from a team as a code owner November 26, 2025 05:45

steviez requested a review from AshwinSekar November 26, 2025 05:45

AshwinSekar approved these changes Nov 26, 2025

View reviewed changes

gregcusack reviewed Nov 26, 2025

View reviewed changes

gregcusack self-requested a review November 26, 2025 16:27

gregcusack approved these changes Nov 26, 2025

View reviewed changes

steviez merged commit 89ead14 into anza-xyz:master Nov 26, 2025
47 checks passed

steviez deleted the turbine_fix_metrics1 branch November 26, 2025 17:10

steviez added the v3.1 Backport to v3.1 branch label Nov 26, 2025

mergify bot mentioned this pull request Nov 26, 2025

v3.1: turbine: Fix and cleanup redundant metric (backport of #9287) #9303

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

turbine: Fix and cleanup redundant metric#9287

turbine: Fix and cleanup redundant metric#9287
steviez merged 2 commits intoanza-xyz:masterfrom
steviez:turbine_fix_metrics1

steviez commented Nov 26, 2025 •

edited

Loading

Uh oh!

codecov-commenter commented Nov 26, 2025

Uh oh!

gregcusack left a comment •

edited

Loading

Uh oh!

steviez commented Nov 26, 2025

Uh oh!

gregcusack left a comment

Uh oh!

Uh oh!

mergify bot commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	if self.slot != bank.slot() {
	// Finish previous slot if it was interrupted.
	if !self.completed {
	let shreds =
	self.finish_prev_slot(keypair, bank.ticks_per_slot() as u8, process_stats);

	fn finish_prev_slot(
	&mut self,
	keypair: &Keypair,
	max_ticks_in_slot: u8,
	stats: &mut ProcessShredsStats,
	) -> Vec<Shred> {
	if self.completed {
	return vec![];
	}
	// Set the reference_tick as if the PoH completed for this slot
	let reference_tick = max_ticks_in_slot;
	let shreds: Vec<_> =
	Shredder::new(self.slot, self.parent, reference_tick, self.shred_version)
	.unwrap()
	.make_merkle_shreds_from_entries(
	keypair,
	&[], // entries
	true, // is_last_in_slot,
	self.chained_merkle_root,
	self.next_shred_index,
	self.next_code_index,
	&self.reed_solomon_cache,
	stats,
	)
	.inspect(\|shred\| stats.record_shred(shred))
	.collect();

	.inspect(\|shred\| {
	process_stats.record_shred(shred);
	let next_index = match shred.shred_type() {
	ShredType::Code => &mut self.next_code_index,
	ShredType::Data => &mut self.next_shred_index,
	};
	next_index = (next_index).max(shred.index() + 1);

	let mut process_stats = ProcessShredsStats::default();
	let receive_results = broadcast_utils::recv_slot_entries(
	receiver,
	&mut self.carryover_entry,
	&mut process_stats,
	)?;
	// TODO: Confirm that last chunk of coding shreds
	// will not be lost or delayed for too long.
	self.process_receive_results(
	keypair,
	blockstore,
	socket_sender,
	blockstore_sender,
	receive_results,
	&mut process_stats,

Conversation

steviez commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Summary of Changes

Uh oh!

codecov-commenter commented Nov 26, 2025

Codecov Report

Uh oh!

gregcusack left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

steviez commented Nov 26, 2025

Uh oh!

gregcusack left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mergify bot commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

steviez commented Nov 26, 2025 •

edited

Loading

gregcusack left a comment •

edited

Loading