Skip to content

runtime: Collect stake delegations only once during epoch activation#8065

Merged
vadorovsky merged 1 commit intoanza-xyz:masterfrom
vadorovsky:epoch-one-iteration
Nov 7, 2025
Merged

runtime: Collect stake delegations only once during epoch activation#8065
vadorovsky merged 1 commit intoanza-xyz:masterfrom
vadorovsky:epoch-one-iteration

Conversation

@vadorovsky
Copy link
Member

@vadorovsky vadorovsky commented Sep 16, 2025

Problem

Processing new epoch (Bank::process_new_epoch) involves collecting stake delegations twice:

  1. In Stakes::activate_epoch, to create a stake history entry and refresh vote accounts.
  2. In Bank::filter_stake_delegations, which is then used in Bank::calculate_stake_vote_rewards to calculate rewards for stakers and voters.

The overall time of crossing the epoch boundary is ~519ms:

update_epoch_us=519953i

Where the two heaviest operations are collect() calls on stake delegations, each of them taking ~200-220ms:

before_0 before_1

Summary of Changes

Reduce that to just one collect to a Vec<(&Pubkey, &StakeAccount)> done on the beginning of Bank::process_new_epoch and passing the stake delegations to the other methods.

The new time of crossing the epoch boundary is ~337ms:

update_epoch_us=337371i

There is only one heavy collect() done on stake delegations, which still takes the most of main thread's time. But that's the best we can do while still using im::HashMap.

after_collect

Making that change possible required several refactors:

  • Tale &PointValue in Bank::create_epoch_rewards_sysvar. That makes it easier to operate on references of PartitionedRewardsCalculation. Copying integers from PointValue is cheap and has no visible
    performance impact.
  • Split Stakes::activate_epoch, that was performing calculations and mutating the cache at the same time. The calculations got split to Stakes::calculate_activated_stake that takes &self.
  • Add Stakes::stake_delegations_ves method. Stake delegations are stored as hash array mapped trie (HAMT)[0], which means that inserts, deletions and lookups are average-case O(1) and worst-case O(log n). However, the performance of iterations is poor due to depth-first traversal and jumps. Currently it's also impossible to iterate over it with rayon. That issue is known and handled by converting the HAMT to a vector with stakes.stake_delegations.iter().collect(). Move that trick to a dedicated method that describes the performance consequences.
  • Add FilteredStakeDelegation wrapper type, that wraps a vector of stake delegations and acts as a lazy iterator that filters out ones with insufficient stake.
  • Split the code dealing with rewards calculation and vote rewards distribution into separate methods:
    • Bank::calculate_rewards that takes &self and does not acquire any locks.
    • Bank::begin_partitioned_rewards that takes &mut self, sets calculation status and creates a sysvar.
    • Bank::distribute_vote_rewards that stores partitioned rewards and increases capitalization.

[0] https://en.wikipedia.org/wiki/Hash_array_mapped_trie

Fixes: #8282

@vadorovsky vadorovsky force-pushed the epoch-one-iteration branch 5 times, most recently from 2b1439a to 5525c4f Compare September 23, 2025 11:54
@vadorovsky vadorovsky changed the title runtime: Iterate over stake delegations only once during epoch activation runtime: Collect stake delegations only once during epoch activation Sep 29, 2025
@vadorovsky vadorovsky force-pushed the epoch-one-iteration branch 12 times, most recently from b52cfbf to 3b50554 Compare October 3, 2025 09:59
@codecov-commenter
Copy link

codecov-commenter commented Oct 3, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 81.9%. Comparing base (0f761dc) to head (b84c5cf).
⚠️ Report is 3 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff            @@
##           master    #8065    +/-   ##
========================================
  Coverage    81.9%    81.9%            
========================================
  Files         860      860            
  Lines      326456   326603   +147     
========================================
+ Hits       267624   267784   +160     
+ Misses      58832    58819    -13     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@vadorovsky vadorovsky marked this pull request as ready for review October 3, 2025 10:50
@vadorovsky vadorovsky force-pushed the epoch-one-iteration branch from 3b50554 to d5159c1 Compare October 4, 2025 08:15
@HaoranYi
Copy link

HaoranYi commented Oct 6, 2025

There is an issue with this PR for epoch_reward_cache.

The PR moved the cache check after the computation. Before the PR, the cache was checked before computing rewards in calculate_rewards_and_distribute_vote_rewards. After the PR, the cache is only populated in save_rewards, which happens after the expensive computation.

@vadorovsky
Copy link
Member Author

vadorovsky commented Oct 6, 2025

After the PR, the cache is only populated in save_rewards, which happens after the expensive computation.

And your worry is that it will take more than one slot? Or is there something else you have in mind?

To be precise - the computation you're talking about, currently takes around 50ms. And the entire epoch boundary after this change - 330ms. So I think we are fine. The overall goal of my optimizations here is to keep epoch boundary below one slot.

@HaoranYi
Copy link

HaoranYi commented Oct 7, 2025

After the PR, the cache is only populated in save_rewards, which happens after the expensive computation.

And your worry is that it will take more than one slot? Or is there something else you have in mind?

To be precise - the computation you're talking about, currently takes around 50ms. And the entire epoch boundary after this change - 330ms. So I think we are fine. The overall goal of my optimizations here is to keep epoch boundary below one slot.

Yes. we used to have many forks at epoch boundary. And the cache is introduced to avoid computing the rewards again at forks. If we are certain that there is going to be no forks, we can remove the cache. In this Pr, we store to the cache but never read from it. seems a waste.

@vadorovsky vadorovsky marked this pull request as ready for review November 3, 2025 17:30
@vadorovsky vadorovsky force-pushed the epoch-one-iteration branch 3 times, most recently from ef1f93b to 8ec02be Compare November 5, 2025 08:15
@vadorovsky vadorovsky requested a review from jstarry November 5, 2025 09:47
}

// Calculate rewards from previous epoch and distribute vote rewards
pub(in crate::bank) fn calculate_rewards_and_distribute_vote_rewards(
Copy link

@jstarry jstarry Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm well you also split out Bank::store_vote_accounts_partitioned (also inside Bank::save_rewards) from Bank::calculate_rewards_and_distribute_vote_rewards so the core part of vote reward distribution is actually not in there. But I see your point about the other distribution code being in there still. Do you think we could move all of that code into Bank::save_rewards (maybe rename this to distribute_vote_rewards) so that all the code for vote reward distribution is in the same place?

Specifically:

  • Bank::update_vote_rewards
  • Capitalization update

And then Bank::create_epoch_rewards_sysvar can be called after Bank::save_rewards.

I don't care a lot about keeping the datapoints ("epoch_rewards" and "epoch-rewards-status-update") consistent but others may disagree.

@vadorovsky vadorovsky force-pushed the epoch-one-iteration branch 2 times, most recently from 1d141dd to 4f7bb25 Compare November 6, 2025 13:45
jstarry
jstarry previously approved these changes Nov 6, 2025
Copy link

@jstarry jstarry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks correct to me. I added a comment with some more suggested refactorings but this is fine as is already. Nice work!

@vadorovsky vadorovsky force-pushed the epoch-one-iteration branch 3 times, most recently from 260508d to 0f0253f Compare November 7, 2025 09:29
@vadorovsky vadorovsky requested a review from jstarry November 7, 2025 11:00
self.begin_partitioned_rewards(
parent_slot,
parent_height,
parent_epoch,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, the commit with my suggestions had a mistake.. these params are out of order. parent_epoch should be before parent_slot.

Processing new epoch (`Bank::process_new_epoch`) involves collecting
stake delegations twice:

1) In `Bank::compute_new_epoch_caches_and_rewards`, to create a stake
   history entry and refresh vote accounts.
2) In `Bank::get_epoch_reward_calculate_param_info`, which is then used
   in `Bank::calculate_stake_vote_rewards` to calculate rewards for
   stakers and voters.

The overall time of crossing the epoch boundary is ~519ms:

```
update_epoch_us=519953i
```

Where the two heaviest operations are `collect()`` calls on stake
delegations, each of them taking ~200-220ms.

Reduce that to just one collect by passing the vector 1) with freshly
computed stake history and vote accounts to `Bank::begin_partitioned_rewards`.
This way, we can avoid calling `Bank::get_epoch_reward_calculate_param_info`.

The new time of crossing the epoch boundary is ~337ms:

```
update_epoch_us=337371i
```

Making that change possible required several refactors:

* Tale `&PointValue` in `Bank::create_epoch_rewards_sysvar`. That makes
  it easier to operate on references of `PartitionedRewardsCalculation`.
  Copying integers from `PointValue` is cheap and has no visible
  performance impact.
* Split `Stakes::activate_epoch`, that was performing calculations and
  mutating the cache at the same time. The calculations got split to
  `Stakes::calculate_activated_stake` that takes `&self`.
* Add `Stakes::stake_delegations_ves` method. Stake delegations are
  stored as hash array mapped trie (HAMT)[0], which means that inserts,
  deletions and lookups are average-case O(1) and worst-case O(log n).
  However, the performance of iterations is poor due to depth-first
  traversal and jumps. Currently it's also impossible to iterate over it
  with rayon. That issue is known and handled by converting the HAMT to
  a vector with `stakes.stake_delegations.iter().collect()`. Move that
  trick to a dedicated method that describes the performance
  consequences.
* Add `FilteredStakeDelegation` wrapper type, that wraps a vector of
  stake delegations and acts as a lazy iterator that filters out ones
  with insufficient stake.
* Split the code dealing with rewards calculation and vote rewards
  distribution into separate methods:
  * `Bank::calculate_rewards` that takes `&self` and does not acquire
    any locks.
  * `Bank::begin_partitioned_rewards` that takes `&mut self`, sets
    calculation status and creates a sysvar.
  * `Bank::distribute_vote_rewards` that stores partitioned rewards and
    increases capitalization.

[0] https://en.wikipedia.org/wiki/Hash_array_mapped_trie

Fixes: anza-xyz#8282
@vadorovsky vadorovsky added this pull request to the merge queue Nov 7, 2025
Merged via the queue into anza-xyz:master with commit 3a2abd6 Nov 7, 2025
44 checks passed
@vadorovsky vadorovsky deleted the epoch-one-iteration branch November 7, 2025 13:29
rustopian pushed a commit to rustopian/agave that referenced this pull request Nov 20, 2025
…nza-xyz#8065)

Processing new epoch (`Bank::process_new_epoch`) involves collecting
stake delegations twice:

1) In `Bank::compute_new_epoch_caches_and_rewards`, to create a stake
   history entry and refresh vote accounts.
2) In `Bank::get_epoch_reward_calculate_param_info`, which is then used
   in `Bank::calculate_stake_vote_rewards` to calculate rewards for
   stakers and voters.

The overall time of crossing the epoch boundary is ~519ms:

```
update_epoch_us=519953i
```

Where the two heaviest operations are `collect()`` calls on stake
delegations, each of them taking ~200-220ms.

Reduce that to just one collect by passing the vector 1) with freshly
computed stake history and vote accounts to `Bank::begin_partitioned_rewards`.
This way, we can avoid calling `Bank::get_epoch_reward_calculate_param_info`.

The new time of crossing the epoch boundary is ~337ms:

```
update_epoch_us=337371i
```

Making that change possible required several refactors:

* Tale `&PointValue` in `Bank::create_epoch_rewards_sysvar`. That makes
  it easier to operate on references of `PartitionedRewardsCalculation`.
  Copying integers from `PointValue` is cheap and has no visible
  performance impact.
* Split `Stakes::activate_epoch`, that was performing calculations and
  mutating the cache at the same time. The calculations got split to
  `Stakes::calculate_activated_stake` that takes `&self`.
* Add `Stakes::stake_delegations_ves` method. Stake delegations are
  stored as hash array mapped trie (HAMT)[0], which means that inserts,
  deletions and lookups are average-case O(1) and worst-case O(log n).
  However, the performance of iterations is poor due to depth-first
  traversal and jumps. Currently it's also impossible to iterate over it
  with rayon. That issue is known and handled by converting the HAMT to
  a vector with `stakes.stake_delegations.iter().collect()`. Move that
  trick to a dedicated method that describes the performance
  consequences.
* Add `FilteredStakeDelegation` wrapper type, that wraps a vector of
  stake delegations and acts as a lazy iterator that filters out ones
  with insufficient stake.
* Split the code dealing with rewards calculation and vote rewards
  distribution into separate methods:
  * `Bank::calculate_rewards` that takes `&self` and does not acquire
    any locks.
  * `Bank::begin_partitioned_rewards` that takes `&mut self`, sets
    calculation status and creates a sysvar.
  * `Bank::distribute_vote_rewards` that stores partitioned rewards and
    increases capitalization.

[0] https://en.wikipedia.org/wiki/Hash_array_mapped_trie

Fixes: anza-xyz#8282
@vadorovsky vadorovsky added the v3.1 Backport to v3.1 branch label Nov 27, 2025
@mergify
Copy link

mergify bot commented Nov 27, 2025

Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis.

mergify bot pushed a commit that referenced this pull request Nov 27, 2025
…8065)

Processing new epoch (`Bank::process_new_epoch`) involves collecting
stake delegations twice:

1) In `Bank::compute_new_epoch_caches_and_rewards`, to create a stake
   history entry and refresh vote accounts.
2) In `Bank::get_epoch_reward_calculate_param_info`, which is then used
   in `Bank::calculate_stake_vote_rewards` to calculate rewards for
   stakers and voters.

The overall time of crossing the epoch boundary is ~519ms:

```
update_epoch_us=519953i
```

Where the two heaviest operations are `collect()`` calls on stake
delegations, each of them taking ~200-220ms.

Reduce that to just one collect by passing the vector 1) with freshly
computed stake history and vote accounts to `Bank::begin_partitioned_rewards`.
This way, we can avoid calling `Bank::get_epoch_reward_calculate_param_info`.

The new time of crossing the epoch boundary is ~337ms:

```
update_epoch_us=337371i
```

Making that change possible required several refactors:

* Tale `&PointValue` in `Bank::create_epoch_rewards_sysvar`. That makes
  it easier to operate on references of `PartitionedRewardsCalculation`.
  Copying integers from `PointValue` is cheap and has no visible
  performance impact.
* Split `Stakes::activate_epoch`, that was performing calculations and
  mutating the cache at the same time. The calculations got split to
  `Stakes::calculate_activated_stake` that takes `&self`.
* Add `Stakes::stake_delegations_ves` method. Stake delegations are
  stored as hash array mapped trie (HAMT)[0], which means that inserts,
  deletions and lookups are average-case O(1) and worst-case O(log n).
  However, the performance of iterations is poor due to depth-first
  traversal and jumps. Currently it's also impossible to iterate over it
  with rayon. That issue is known and handled by converting the HAMT to
  a vector with `stakes.stake_delegations.iter().collect()`. Move that
  trick to a dedicated method that describes the performance
  consequences.
* Add `FilteredStakeDelegation` wrapper type, that wraps a vector of
  stake delegations and acts as a lazy iterator that filters out ones
  with insufficient stake.
* Split the code dealing with rewards calculation and vote rewards
  distribution into separate methods:
  * `Bank::calculate_rewards` that takes `&self` and does not acquire
    any locks.
  * `Bank::begin_partitioned_rewards` that takes `&mut self`, sets
    calculation status and creates a sysvar.
  * `Bank::distribute_vote_rewards` that stores partitioned rewards and
    increases capitalization.

[0] https://en.wikipedia.org/wiki/Hash_array_mapped_trie

Fixes: #8282
(cherry picked from commit 3a2abd6)
vadorovsky added a commit that referenced this pull request Dec 2, 2025
…ation (backport of #8065) (#9321)

runtime: Collect stake delegations only once during epoch activation (#8065)

Processing new epoch (`Bank::process_new_epoch`) involves collecting
stake delegations twice:

1) In `Bank::compute_new_epoch_caches_and_rewards`, to create a stake
   history entry and refresh vote accounts.
2) In `Bank::get_epoch_reward_calculate_param_info`, which is then used
   in `Bank::calculate_stake_vote_rewards` to calculate rewards for
   stakers and voters.

The overall time of crossing the epoch boundary is ~519ms:

```
update_epoch_us=519953i
```

Where the two heaviest operations are `collect()`` calls on stake
delegations, each of them taking ~200-220ms.

Reduce that to just one collect by passing the vector 1) with freshly
computed stake history and vote accounts to `Bank::begin_partitioned_rewards`.
This way, we can avoid calling `Bank::get_epoch_reward_calculate_param_info`.

The new time of crossing the epoch boundary is ~337ms:

```
update_epoch_us=337371i
```

Making that change possible required several refactors:

* Tale `&PointValue` in `Bank::create_epoch_rewards_sysvar`. That makes
  it easier to operate on references of `PartitionedRewardsCalculation`.
  Copying integers from `PointValue` is cheap and has no visible
  performance impact.
* Split `Stakes::activate_epoch`, that was performing calculations and
  mutating the cache at the same time. The calculations got split to
  `Stakes::calculate_activated_stake` that takes `&self`.
* Add `Stakes::stake_delegations_ves` method. Stake delegations are
  stored as hash array mapped trie (HAMT)[0], which means that inserts,
  deletions and lookups are average-case O(1) and worst-case O(log n).
  However, the performance of iterations is poor due to depth-first
  traversal and jumps. Currently it's also impossible to iterate over it
  with rayon. That issue is known and handled by converting the HAMT to
  a vector with `stakes.stake_delegations.iter().collect()`. Move that
  trick to a dedicated method that describes the performance
  consequences.
* Add `FilteredStakeDelegation` wrapper type, that wraps a vector of
  stake delegations and acts as a lazy iterator that filters out ones
  with insufficient stake.
* Split the code dealing with rewards calculation and vote rewards
  distribution into separate methods:
  * `Bank::calculate_rewards` that takes `&self` and does not acquire
    any locks.
  * `Bank::begin_partitioned_rewards` that takes `&mut self`, sets
    calculation status and creates a sysvar.
  * `Bank::distribute_vote_rewards` that stores partitioned rewards and
    increases capitalization.

[0] https://en.wikipedia.org/wiki/Hash_array_mapped_trie

Fixes: #8282
(cherry picked from commit 3a2abd6)

Co-authored-by: Michal R <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

v3.1 Backport to v3.1 branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

runtime: Stake delegations are collected twice during Bank::process_new_epoch

6 participants