runtime: bench stakes cache by 2501babe · Pull Request #10760 · anza-xyz/agave

2501babe · 2026-02-23T04:44:44Z

Problem

as best as i can tell we dont have anything simple to measure stakes cache and rewards distribution

Summary of Changes

add two simple benches, one measuring how long going from one epoch to the next takes, and the other how long the rewards period takes. this captures the full cost of stakes cache-related work, including accounts-db access, which seems better than just measuring StakesCache functions in isolation

however these also seem to be extremely well-targeted: with 10 stake accounts, we take 150us/iter and 3.5ms/iter. with 100k stake accounts, we take 16ms/iter and 350ms/iter. bench_epoch_rewards_period() spends 95% of its measured time and bench_epoch_turnover() >99% inside the update_epoch_time_us metric, which contains process_new_epoch(), update_epoch_stakes(), and distribute_partitioned_epoch_rewards()

codecov-commenter · 2026-02-23T05:43:02Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.0%. Comparing base (eeb8f6b) to head (e90c8f9).
⚠️ Report is 24 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master   #10760   +/-   ##
=======================================
  Coverage    83.0%    83.0%           
=======================================
  Files         849      849           
  Lines      318370   318370           
=======================================
+ Hits       264488   264504   +16     
+ Misses      53882    53866   -16

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

vadorovsky

as best as i can tell we dont have anything simple to measure stakes cache and rewards distribution

What I do personally is trying to get a ledger a few minutes after crossing an epoch boundary, and then repetitively using agave-ledger-tool to replay the epoch boundary, but that's pretty annoying to do (especially because I have to catch the moment of doing epoch calculations, I was able to do it only by setting gdb/lldb breakpoints 😅 ) and having some bench would be nice.

That said, I think such a benchmark should at least match the mainnet load, see the inline comment.

vadorovsky · 2026-02-23T11:11:51Z

runtime/benches/epoch_turnover.rs

+const NUM_STAKE_ACCOUNTS: usize = 100_000;
+const NUM_VOTE_ACCOUNTS: usize = 10;


These values are unfortunately way lower than what we see on mainnet and I don't think they would be capable of reproducing some of the performance issues, I've been fixing in epoch boundary (#7742, #8065).

I think we should have two sets of values:

Matching mainnet load - ~1_000 validators/vote accounts and ~1_000_000 stake accounts.

Some even larger "stress" values, zillions both for vote and stake accounts, as close to OOMing a devbox as possible. 🙂 I think this would let us find even more performance issues and come up with some nice improvements.

yea i just chose these numbers as the largest that lets the rust bench harness finish within single-digit minutes. switching to Criterion and cranking down the sample size i can do 1m stake 1k vote fine

bench_epoch_turnover/HANA time: [160.48 ms 161.36 ms 161.99 ms] change: [−0.7566% +0.0322% +0.7800%] (p = 0.95 > 0.05) No change in performance detected. Benchmarking bench_epoch_rewards_period/HANA: Warming up for 1.0000 s Warning: Unable to complete 10 samples in 10.0s. You may wish to increase target time to 56.2s. bench_epoch_rewards_period/HANA time: [4.1164 s 4.1551 s 4.1983 s] change: [−3.1529% −1.8001% −0.3664%] (p = 0.03 < 0.05) Change within noise threshold.

apfitzge · 2026-02-23T14:22:50Z

I did not examine the actual benchmark, but will butt my head in here anyway with an opinion some others may not share.

We have added similar microbenchmarks in the past, and what I've tended to see is that we use that benchmark to rapidly iterate several improvements. Then the benchmark is rarely or never run again, but farily consistently needs maintainence as interfaces change. I've even seen us functionally break benches and go months without noticing since noone runs them.

I would encourage thoughtfulness on whether this will be used as a one-off or longer-term when deciding to merge to main - or just use in near-term PRs to show improvement.

vadorovsky · 2026-02-24T07:06:49Z

runtime/benches/epoch_turnover.rs

+    std::sync::Arc,
+    test::{Bencher, black_box},
+};
+


Can we enable jemalloc in this bench?

Suggested change

#[cfg(not(any(target_env = "msvc", target_os = "freebsd")))]

#[global_allocator]

static GLOBAL: jemallocator::Jemalloc = jemallocator::Jemalloc;

You'll also need to add the following to Cargo.toml of the local crate:

[target.'cfg(not(any(target_env = "msvc", target_os = "freebsd")))'.dependencies] jemallocator = { workspace = true }

Without this line, the bench will use the glibc allocator, which is way slower in such scenarios. Usually when I profile benches without jemalloc, all I see is page faults and drops taking the most of the time. 😅

ive added jemalloc and changed the benches to use the product of trivial/full votes and trivial/full stakes. this makes it easy to add bigger cases when testing locally. tbh the complete case is already slow as hell tho, if anything jemalloc may have made it slightly worse

vadorovsky · 2026-02-24T07:35:59Z

I did not examine the actual benchmark, but will butt my head in here anyway with an opinion some others may not share.

We have added similar microbenchmarks in the past, and what I've tended to see is that we use that benchmark to rapidly iterate several improvements. Then the benchmark is rarely or never run again, but farily consistently needs maintainence as interfaces change. I've even seen us functionally break benches and go months without noticing since noone runs them.

I would encourage thoughtfulness on whether this will be used as a one-off or longer-term when deciding to merge to main - or just use in near-term PRs to show improvement.

I'm usually on board with Alessandro and Trent in their crusade against benchmarks, but in this case I'm actually in favor in adding one, given that my comments up are addressed and we make it as close to the mainnet behavior as possible. Usually the main problem why benchmarks are inaccurate is pretty much what I commented inline on - numbers too low and glibc allocator.

Reasons I'm in favor of improving and merging this one:

I'm curious to see (and profile) how the stress scenario looks like (numbers way higher than mainnet).
Profiling epoch boundary is super annoying (catching it on mainnet and doing the whole dance to start slowlana in the exact moment, even when replaying). Obviously I'm still going to do it even if this bench is merged, but having it would be a nice way to quickly sanity check the changes before profiling.

And yes, I would use it pretty much daily.

2501babe self-assigned this Feb 23, 2026

2501babe force-pushed the 20260221_stakescachebench branch from 0cca07d to 09b77da Compare February 23, 2026 05:03

2501babe marked this pull request as ready for review February 23, 2026 07:00

2501babe requested a review from apfitzge February 23, 2026 07:00

2501babe mentioned this pull request Feb 23, 2026

runtime: swap im for imbl in StakesCache #10762

Draft

vadorovsky reviewed Feb 23, 2026

View reviewed changes

vadorovsky reviewed Feb 24, 2026

View reviewed changes

2501babe force-pushed the 20260221_stakescachebench branch 2 times, most recently from 9b9cbcd to e90c8f9 Compare February 24, 2026 10:50

runtime: bench stakes cache

23d35fe

2501babe force-pushed the 20260221_stakescachebench branch from e90c8f9 to 23d35fe Compare February 25, 2026 15:16

apfitzge removed their request for review February 26, 2026 16:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: bench stakes cache#10760

runtime: bench stakes cache#10760
2501babe wants to merge 1 commit intoanza-xyz:masterfrom
2501babe:20260221_stakescachebench

2501babe commented Feb 23, 2026

Uh oh!

codecov-commenter commented Feb 23, 2026 •

edited

Loading

Uh oh!

vadorovsky left a comment

Uh oh!

vadorovsky Feb 23, 2026

Uh oh!

2501babe Feb 23, 2026

Uh oh!

apfitzge commented Feb 23, 2026

Uh oh!

vadorovsky Feb 24, 2026 •

edited

Loading

Uh oh!

2501babe Feb 24, 2026

Uh oh!

vadorovsky commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		const NUM_STAKE_ACCOUNTS: usize = 100_000;
		const NUM_VOTE_ACCOUNTS: usize = 10;

+#[cfg(not(any(target_env = "msvc", target_os = "freebsd")))]
+#[global_allocator]
+static GLOBAL: jemallocator::Jemalloc = jemallocator::Jemalloc;

Conversation

2501babe commented Feb 23, 2026

Problem

Summary of Changes

Uh oh!

codecov-commenter commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

vadorovsky left a comment

Choose a reason for hiding this comment

Uh oh!

vadorovsky Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

2501babe Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

apfitzge commented Feb 23, 2026

Uh oh!

vadorovsky Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

2501babe Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

vadorovsky commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Feb 23, 2026 •

edited

Loading

vadorovsky Feb 24, 2026 •

edited

Loading