Add WFSM stake metric by willhickey · Pull Request #9803 · anza-xyz/agave

willhickey · 2026-01-05T20:06:16Z

Problem

The 2025-12-03 testnet restart failed because some nodes saw 80% of stake online and some did not.

Summary of Changes

Add a metric to make it easier to monitor WFSM status in future restarts.

willhickey · 2026-01-05T20:07:34Z

This should be backported to v3.1 and v3.0 before the next testnet restart

steviez

So I guess the thought here is that we want visibility into what all the nodes in the cluster (or at least the ones reporting metrics) see in gossip ? This metric will allow us to determine what percent nodes see, but obviously we won't know which nodes aren't visible to some operator unless we're able to get logs from them.

some nodes saw 80% of stake online and some did not.

The ones that don't see 80% won't start and won't emit validator-new and all the regular "steady-state" metrics.

@alexpyattaev & @gregcusack - Thoughts on whether you think this metric would be helpful ? I know y'all (and maybe others) looked into the most recent failed restarts during Breakpoint

alexpyattaev · 2026-01-12T07:36:02Z

Well we know quite well what the problem is (gossip handshake is slowing things down too much). Gossip today reports how many nodes are online (which is not quite the same), so these metrics might be helpful in case we get issues, we are designing a fix for gossip join.

willhickey · 2026-01-12T17:31:33Z

This will provide visibility into a class of bugs that might only manifest during WFSM. We should have added it after this bug:
https://discord.com/channels/428295358100013066/478692221441409024/1293642426959003678

We expect some noise in the % of stake visible in gossip, but both the 2024 bug and the recent one produced multiple modes which we wouldn't expect just from noise. Seeing multiple modes early in a restart would provide more time to investigate while the cluster is WFSM.

alexpyattaev

I think this extra visibility is good to have. One real concern is the wrong shred_version stat - it should be always zero in current gossip, so likely that tracking it is not useful. I'd prefer to remove it, any propagation of gossip messages with wrong shred_version is considered to be a bug in gossip.

core/src/validator.rs

steviez

This will provide visibility into a class of bugs that might only manifest during WFSM.

Cool, I'm onboard and will defer approval & merging of this one to Alex and/or Greg as I'll likely be reviewing the v3.1 and/or v3.1 BP's.

Thanks for the PR Will !

core/src/validator.rs

gregcusack · 2026-01-12T23:21:59Z

~~closing in favor of: #9961 (Will's commit + comments above)~~

steviez · 2026-01-13T01:47:35Z

closing in favor of: #9961 (Will's commit + comments above)

For the future, I think we probably could have continued in this PR. I can't see the setting since the PR is closed now, but there is this option "Maintainers are allowed to edit this pull request" that most people have set. Ie, here it is from #9961

Also, Will opened the PR and responded to my question pretty quickly so I think let's let him continue if he's wanting/willing. We want the change soon but not a vuln or anything so I think it is fine if another couple days pass before we merge

steviez · 2026-01-13T17:24:53Z

@willhickey - I think CI failed for a reason that isn't related to your changes. Would you mind rebasing to tip of master ? Also, while we're at it, would you mind squashing the commits down ? We don't enforce it uniformly but if you already have to rebase then you'll have to force push. Not a requirement tho

… 0 and the associated code will be cleaned up in a future PR

codecov-commenter · 2026-01-14T13:37:38Z

Codecov Report

❌ Patch coverage is 0% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.5%. Comparing base (804c089) to head (c02095d).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #9803   +/-   ##
=======================================
  Coverage    82.5%    82.5%           
=======================================
  Files         844      844           
  Lines      316758   316761    +3     
=======================================
+ Hits       261578   261628   +50     
+ Misses      55180    55133   -47

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

alexpyattaev

Looks great thank you!

alexpyattaev · 2026-01-14T18:47:38Z

@gregcusack maybe worth backporting this?

gregcusack

lgtm! thank you!

gregcusack · 2026-01-14T18:52:30Z

@gregcusack maybe worth backporting this?

yesssss good idea

mergify · 2026-01-14T19:38:14Z

Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis.

* Add wfsm metric. Add trace logging for peers. * Remove trace logging, since peers are already logged by gossip * Remove wrong_shred_stake from wfsm_gossip metric. This will always be 0 and the associated code will be cleaned up in a future PR (cherry picked from commit 14a5055)

Add WFSM stake metric (#9803) * Add wfsm metric. Add trace logging for peers. (cherry picked from commit 14a5055) Co-authored-by: Will Hickey <will.hickey@anza.xyz>

mergify bot added community need:merge-assist labels Jan 5, 2026

mergify bot requested a review from a team January 5, 2026 20:06

steviez reviewed Jan 12, 2026

View reviewed changes

alexpyattaev self-requested a review January 12, 2026 07:36

alexpyattaev added the CI Pull Request is ready to enter CI label Jan 12, 2026

anza-team removed the CI Pull Request is ready to enter CI label Jan 12, 2026

alexpyattaev reviewed Jan 12, 2026

View reviewed changes

core/src/validator.rs Outdated Show resolved Hide resolved

core/src/validator.rs Outdated Show resolved Hide resolved

steviez reviewed Jan 12, 2026

View reviewed changes

core/src/validator.rs Outdated Show resolved Hide resolved

gregcusack mentioned this pull request Jan 12, 2026

Wfsm stake metric. Continuation of pr 9803 #9961

Closed

gregcusack closed this Jan 12, 2026

gregcusack reopened this Jan 13, 2026

gregcusack added the CI Pull Request is ready to enter CI label Jan 13, 2026

anza-team removed the CI Pull Request is ready to enter CI label Jan 13, 2026

steviez requested review from alexpyattaev and gregcusack January 13, 2026 17:18

willhickey added 3 commits January 14, 2026 07:04

Add wfsm metric. Add trace logging for peers.

1ce6eec

Remove trace logging, since peers are already logged by gossip

2a758ad

Remove wrong_shred_stake from wfsm_gossip metric. This will always be…

c02095d

… 0 and the associated code will be cleaned up in a future PR

steviez force-pushed the wfsm_metrics branch from 74e8c9a to c02095d Compare January 14, 2026 13:04

steviez added the CI Pull Request is ready to enter CI label Jan 14, 2026

anza-team removed the CI Pull Request is ready to enter CI label Jan 14, 2026

alexpyattaev approved these changes Jan 14, 2026

View reviewed changes

gregcusack approved these changes Jan 14, 2026

View reviewed changes

alexpyattaev added this pull request to the merge queue Jan 14, 2026

alexpyattaev removed this pull request from the merge queue due to a manual request Jan 14, 2026

alexpyattaev added this pull request to the merge queue Jan 14, 2026

alexpyattaev added the v3.1 Backport to v3.1 branch label Jan 14, 2026

Merged via the queue into anza-xyz:master with commit 14a5055 Jan 14, 2026
49 checks passed

mergify bot mentioned this pull request Jan 14, 2026

v3.1: Add WFSM stake metric (backport of #9803) #10035

Merged

gregcusack mentioned this pull request Jan 21, 2026

WFSM: Remove logging of nodes with wrong shred version #9962

Merged

Conversation

willhickey commented Jan 5, 2026

Problem

Summary of Changes

Uh oh!

willhickey commented Jan 5, 2026

Uh oh!

steviez left a comment

Choose a reason for hiding this comment

Uh oh!

alexpyattaev commented Jan 12, 2026

Uh oh!

willhickey commented Jan 12, 2026

Uh oh!

alexpyattaev left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

steviez left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gregcusack commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

steviez commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

steviez commented Jan 13, 2026

Uh oh!

codecov-commenter commented Jan 14, 2026

Codecov Report

Uh oh!

alexpyattaev left a comment

Choose a reason for hiding this comment

Uh oh!

alexpyattaev commented Jan 14, 2026

Uh oh!

gregcusack left a comment

Choose a reason for hiding this comment

Uh oh!

gregcusack commented Jan 14, 2026

Uh oh!

mergify bot commented Jan 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

gregcusack commented Jan 12, 2026 •

edited

Loading

steviez commented Jan 13, 2026 •

edited

Loading