turbine: Fix and cleanup redundant metric#9287
Conversation
ab93c62 to
18fb482
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #9287 +/- ##
=========================================
- Coverage 82.6% 82.6% -0.1%
=========================================
Files 892 892
Lines 321007 320992 -15
=========================================
- Hits 265334 265306 -28
- Misses 55673 55686 +13 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
this change looks mostly good. but why do we have ProcessShredStats as a member of StandardBroadcastRun (see:
ProcessShredStats that we pass around within StandardBroadcastRun (see: agave/turbine/src/broadcast_stage/standard_broadcast_run.rs
Lines 457 to 471 in 9169c96
StandardBroadcastRun since that duplication seems to be part of the root problem here.
Yes, the multiple stats objects are part of the problem. One of the reasons for that is this function: agave/turbine/src/broadcast_stage/standard_broadcast_run.rs Lines 458 to 462 in 9169c96 That functions updates some stats before we know whether those stats are for the previous (interrupted) or new (next) slot. We don't figure that out until later (link below) so we need to hold them in a separate object until we do: Generally speaking, yes, I think there could be more cleanup on this stats object (ie split the struct between stuff updated by Shredder and stuff updated by BroadcastRun implementations). The duplicate objects problem could potentially be resolved with some refactoring. But, I decided to make a more surgical change to keep BP options open (as I want to use this metric to help support another perf change that I may try to BP) |
gregcusack
left a comment
There was a problem hiding this comment.
ahh yess i see what you're talking about. missed that. ya could use a refactor in a follow up. in that case, this pr lgtm!! thanks for finding/debugging this!
|
Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis. |
The broadcast-process-shred-stats (and interrupted variant) have duplicate fields for the number of data and coding shred. This is probably a relic of when we rolled merkle shreds out. So, remove the duplicate fields. This also addresses a small issue where some shreds would be counted in metrics for the wrong slot when we have an interrupted slot (cherry picked from commit 89ead14)
…9303) turbine: Fix and cleanup redundant metric (#9287) The broadcast-process-shred-stats (and interrupted variant) have duplicate fields for the number of data and coding shred. This is probably a relic of when we rolled merkle shreds out. So, remove the duplicate fields. This also addresses a small issue where some shreds would be counted in metrics for the wrong slot when we have an interrupted slot (cherry picked from commit 89ead14) Co-authored-by: steviez <[email protected]>
The broadcast-process-shred-stats (and interrupted variant) have duplicate fields for the number of data and coding shred. This is probably a relic of when we rolled merkle shreds out. So, remove the duplicate fields. This also addresses a small issue where some shreds would be counted in metrics for the wrong slot when we have an interrupted slot
Problem
The
broadcast-process-shreds-interrupted-statsandbroadcast-process-shreds-statsmetrics have duplicate fields for the number of shreds per slot (probably an artifact of when we rolled merkle shreds out):num_data_shreds&num_merkle_data_shredsnum_coding_shreds&num_merkle_coding_shredsAdditionally, there is currently a bug that will report the wrong values when a slot is interrupted. When the slot is interrupted, we call the below function to create a
LAST_IN_SLOTshred which will signal to the rest of the cluster that we're abandoning the block:agave/turbine/src/broadcast_stage/standard_broadcast_run.rs
Lines 208 to 212 in df2b614
In the function, we accumulate into the passed in
stats:agave/turbine/src/broadcast_stage/standard_broadcast_run.rs
Lines 77 to 102 in df2b614
but then report metrics on
self.process_shred_statswithout accumulating the just modifiedstats:agave/turbine/src/broadcast_stage/standard_broadcast_run.rs
Line 106 in df2b614
Instead, the
statsobject now has some counters updated for the interrupted slot that will get aggregated intoself.process_shred_stats(the next "active" slot) later on:agave/turbine/src/broadcast_stage/standard_broadcast_run.rs
Line 331 in df2b614
Assuming we generated
Sshreds in this function, this results in us under reporting the number of shreds for the interrupted slot bySand over reporting the number of shreds for the next slot bySSummary of Changes
self.process_shred_statswithinfinish_prev_slot()num_data_shreds; includingmerkleseemed redundant to me since all shreds are now of the Merkle variantThere is probably some more cleanup that could be done; ie, this feels unnecessary:
agave/turbine/src/broadcast_stage/standard_broadcast_run.rs
Lines 135 to 141 in df2b614
However, I decided to keep the PR slimmer to keep our options for BP open