Conversation
|
This should be backported to v3.1 and v3.0 before the next testnet restart |
steviez
left a comment
There was a problem hiding this comment.
So I guess the thought here is that we want visibility into what all the nodes in the cluster (or at least the ones reporting metrics) see in gossip ? This metric will allow us to determine what percent nodes see, but obviously we won't know which nodes aren't visible to some operator unless we're able to get logs from them.
some nodes saw 80% of stake online and some did not.
The ones that don't see 80% won't start and won't emit validator-new and all the regular "steady-state" metrics.
@alexpyattaev & @gregcusack - Thoughts on whether you think this metric would be helpful ? I know y'all (and maybe others) looked into the most recent failed restarts during Breakpoint
|
Well we know quite well what the problem is (gossip handshake is slowing things down too much). Gossip today reports how many nodes are online (which is not quite the same), so these metrics might be helpful in case we get issues, we are designing a fix for gossip join. |
|
This will provide visibility into a class of bugs that might only manifest during WFSM. We should have added it after this bug: We expect some noise in the % of stake visible in gossip, but both the 2024 bug and the recent one produced multiple modes which we wouldn't expect just from noise. Seeing multiple modes early in a restart would provide more time to investigate while the cluster is WFSM. |
alexpyattaev
left a comment
There was a problem hiding this comment.
I think this extra visibility is good to have. One real concern is the wrong shred_version stat - it should be always zero in current gossip, so likely that tracking it is not useful. I'd prefer to remove it, any propagation of gossip messages with wrong shred_version is considered to be a bug in gossip.
steviez
left a comment
There was a problem hiding this comment.
This will provide visibility into a class of bugs that might only manifest during WFSM.
Cool, I'm onboard and will defer approval & merging of this one to Alex and/or Greg as I'll likely be reviewing the v3.1 and/or v3.1 BP's.
Thanks for the PR Will !
|
|
For the future, I think we probably could have continued in this PR. I can't see the setting since the PR is closed now, but there is this option "Maintainers are allowed to edit this pull request" that most people have set. Ie, here it is from #9961 Also, Will opened the PR and responded to my question pretty quickly so I think let's let him continue if he's wanting/willing. We want the change soon but not a vuln or anything so I think it is fine if another couple days pass before we merge |
|
@willhickey - I think CI failed for a reason that isn't related to your changes. Would you mind rebasing to tip of master ? Also, while we're at it, would you mind squashing the commits down ? We don't enforce it uniformly but if you already have to rebase then you'll have to force push. Not a requirement tho |
… 0 and the associated code will be cleaned up in a future PR
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #9803 +/- ##
=======================================
Coverage 82.5% 82.5%
=======================================
Files 844 844
Lines 316758 316761 +3
=======================================
+ Hits 261578 261628 +50
+ Misses 55180 55133 -47 🚀 New features to boost your workflow:
|
|
@gregcusack maybe worth backporting this? |
yesssss good idea |
|
Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis. |
* Add wfsm metric. Add trace logging for peers. * Remove trace logging, since peers are already logged by gossip * Remove wrong_shred_stake from wfsm_gossip metric. This will always be 0 and the associated code will be cleaned up in a future PR (cherry picked from commit 14a5055)

Problem
The 2025-12-03 testnet restart failed because some nodes saw 80% of stake online and some did not.
Summary of Changes
Add a metric to make it easier to monitor WFSM status in future restarts.