Skip to content

Add WFSM stake metric#9803

Merged
alexpyattaev merged 3 commits intoanza-xyz:masterfrom
willhickey:wfsm_metrics
Jan 14, 2026
Merged

Add WFSM stake metric#9803
alexpyattaev merged 3 commits intoanza-xyz:masterfrom
willhickey:wfsm_metrics

Conversation

@willhickey
Copy link

Problem

The 2025-12-03 testnet restart failed because some nodes saw 80% of stake online and some did not.

Summary of Changes

Add a metric to make it easier to monitor WFSM status in future restarts.

@mergify mergify bot requested a review from a team January 5, 2026 20:06
@willhickey
Copy link
Author

This should be backported to v3.1 and v3.0 before the next testnet restart

Copy link

@steviez steviez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I guess the thought here is that we want visibility into what all the nodes in the cluster (or at least the ones reporting metrics) see in gossip ? This metric will allow us to determine what percent nodes see, but obviously we won't know which nodes aren't visible to some operator unless we're able to get logs from them.

some nodes saw 80% of stake online and some did not.

The ones that don't see 80% won't start and won't emit validator-new and all the regular "steady-state" metrics.

@alexpyattaev & @gregcusack - Thoughts on whether you think this metric would be helpful ? I know y'all (and maybe others) looked into the most recent failed restarts during Breakpoint

@alexpyattaev
Copy link

Well we know quite well what the problem is (gossip handshake is slowing things down too much). Gossip today reports how many nodes are online (which is not quite the same), so these metrics might be helpful in case we get issues, we are designing a fix for gossip join.

@alexpyattaev alexpyattaev self-requested a review January 12, 2026 07:36
@willhickey
Copy link
Author

This will provide visibility into a class of bugs that might only manifest during WFSM. We should have added it after this bug:
https://discord.com/channels/428295358100013066/478692221441409024/1293642426959003678

We expect some noise in the % of stake visible in gossip, but both the 2024 bug and the recent one produced multiple modes which we wouldn't expect just from noise. Seeing multiple modes early in a restart would provide more time to investigate while the cluster is WFSM.

@alexpyattaev alexpyattaev added the CI Pull Request is ready to enter CI label Jan 12, 2026
@anza-team anza-team removed the CI Pull Request is ready to enter CI label Jan 12, 2026
Copy link

@alexpyattaev alexpyattaev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this extra visibility is good to have. One real concern is the wrong shred_version stat - it should be always zero in current gossip, so likely that tracking it is not useful. I'd prefer to remove it, any propagation of gossip messages with wrong shred_version is considered to be a bug in gossip.

Copy link

@steviez steviez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will provide visibility into a class of bugs that might only manifest during WFSM.

Cool, I'm onboard and will defer approval & merging of this one to Alex and/or Greg as I'll likely be reviewing the v3.1 and/or v3.1 BP's.

Thanks for the PR Will !

@gregcusack
Copy link

gregcusack commented Jan 12, 2026

closing in favor of: #9961 (Will's commit + comments above)

@gregcusack gregcusack closed this Jan 12, 2026
@steviez
Copy link

steviez commented Jan 13, 2026

closing in favor of: #9961 (Will's commit + comments above)

For the future, I think we probably could have continued in this PR. I can't see the setting since the PR is closed now, but there is this option "Maintainers are allowed to edit this pull request" that most people have set. Ie, here it is from #9961
image

Also, Will opened the PR and responded to my question pretty quickly so I think let's let him continue if he's wanting/willing. We want the change soon but not a vuln or anything so I think it is fine if another couple days pass before we merge

@gregcusack gregcusack reopened this Jan 13, 2026
@gregcusack gregcusack added the CI Pull Request is ready to enter CI label Jan 13, 2026
@anza-team anza-team removed the CI Pull Request is ready to enter CI label Jan 13, 2026
@steviez
Copy link

steviez commented Jan 13, 2026

@willhickey - I think CI failed for a reason that isn't related to your changes. Would you mind rebasing to tip of master ? Also, while we're at it, would you mind squashing the commits down ? We don't enforce it uniformly but if you already have to rebase then you'll have to force push. Not a requirement tho

@steviez steviez added the CI Pull Request is ready to enter CI label Jan 14, 2026
@anza-team anza-team removed the CI Pull Request is ready to enter CI label Jan 14, 2026
@codecov-commenter
Copy link

Codecov Report

❌ Patch coverage is 0% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.5%. Comparing base (804c089) to head (c02095d).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #9803   +/-   ##
=======================================
  Coverage    82.5%    82.5%           
=======================================
  Files         844      844           
  Lines      316758   316761    +3     
=======================================
+ Hits       261578   261628   +50     
+ Misses      55180    55133   -47     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link

@alexpyattaev alexpyattaev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great thank you!

@alexpyattaev
Copy link

@gregcusack maybe worth backporting this?

Copy link

@gregcusack gregcusack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! thank you!

@alexpyattaev alexpyattaev added this pull request to the merge queue Jan 14, 2026
@alexpyattaev alexpyattaev removed this pull request from the merge queue due to a manual request Jan 14, 2026
@gregcusack
Copy link

@gregcusack maybe worth backporting this?

yesssss good idea

@alexpyattaev alexpyattaev added this pull request to the merge queue Jan 14, 2026
@alexpyattaev alexpyattaev added the v3.1 Backport to v3.1 branch label Jan 14, 2026
@mergify
Copy link

mergify bot commented Jan 14, 2026

Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis.

Merged via the queue into anza-xyz:master with commit 14a5055 Jan 14, 2026
49 checks passed
mergify bot pushed a commit that referenced this pull request Jan 14, 2026
* Add wfsm metric. Add trace logging for peers.

* Remove trace logging, since peers are already logged by gossip

* Remove wrong_shred_stake from wfsm_gossip metric. This will always be 0 and the associated code will be cleaned up in a future PR

(cherry picked from commit 14a5055)
alexpyattaev pushed a commit that referenced this pull request Jan 14, 2026
Add WFSM stake metric (#9803)

* Add wfsm metric. Add trace logging for peers.

(cherry picked from commit 14a5055)

Co-authored-by: Will Hickey <will.hickey@anza.xyz>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants