Skip to content

Conversation

@AshwinSekar
Copy link
Contributor

@AshwinSekar AshwinSekar commented Mar 10, 2025

Problem

The replay voting loop needs the ability to query if a certain slot is skip_certified

Certain situations can cause multiple skip range certificates to be available in the skip pool (see test_multi_cert for an example). In these scenarios we need the ability to query all skip certs to see if the voting loop slot is skip_certified.

Current skip pool impl also has incorrect behavior around consecutive slots and contributors (see test_consecutive_slots and test_contributor_removed)

Summary of Changes

Add functions for skip_certified, and a general multi cert scan fn. Does not remove the query implementation, will potentially remove when integrating the voting loop / maybe start leader.

.iter()
.any(|range| range.contains(&slot))
{
// If we are already have a certificate no reason to rescan (potentially costly)
Copy link
Contributor Author

@AshwinSekar AshwinSekar Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the assumption here is if a slot is already part of a skip certificate then it's fine to not update, as once part of a skip certificate it cannot be removed. Also the up_to_date is to avoid scanning the range every time a new vote is added, however if you feel it's clearer we can mimic the behavior in query

generally I think we need to have protections around ingesting all-to-all / gossip when catching up as it could overwrite votes from replay that are necessary to skip certify "old" slots leading to catch up stalling.
i don't know the best way to deal with this yet, will explore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally I think we need to have protections around ingesting all-to-all / gossip when catching up as it could overwrite votes from replay that are necessary to skip certify "old" slots leading to catch up stalling.

Each block in replay should have the skip/notarization certificates necessary to prove block replay, so checking the vote states should be sufficient to prove the block is valid, shouldn't require checking the skip pool here I think.

As long as we're getting valid blocks, catch up shouldn't stall.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's true, just trying to figure out the best way to query.

If we're catching up, we should only look at the skip certificates from replayed votes, however if we're at tip then we should ingest all-to-all / gossip as well. We can't really tell which scenario we're in so I think we could:

  • Maintain a separate skip pool with only votes from replay, check the replay skip pool first and then this general skip pool
  • Tag votes with their source here and allow storage of 1 vote per source per validator
    Open to any better suggestions haha

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense

  1. For new blocks, check vote state. A separate skip pool could work, or we could run a simple algorithm that scans across each vote state and checks if the most recent skip vote covers the entire range (M, N) where M is the parent and N is this replayed block. If so, increment the stake counter.
  2. We stream the latest votes during replay into this main skip pool

let certs = self
.tree
.iter()
.scan(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be less nesting here to use a filter_map:

 let mut accumulated = 0f64;
 let mut current_contributors = BTreeSet::new();
 let mut cert: Option<(Slot, BTreeSet<Pubkey>)> = None;

let certs = self.tree.iter().filter_map(move |(slot, (starts, ends))| {

   // Once we hit threshold return the cert
    if accumulated < threshold_stake {
            let (start_slot, contributors)) = cert.take().expect("must have some contributors");
             return Some((start_slot..=*slot, contributors));
        }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, got too eager trying to skyline 81b4b02

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lmao, write all your code in one line

.iter()
.any(|range| range.contains(&slot))
{
// If we are already have a certificate no reason to rescan (potentially costly)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally I think we need to have protections around ingesting all-to-all / gossip when catching up as it could overwrite votes from replay that are necessary to skip certify "old" slots leading to catch up stalling.

Each block in replay should have the skip/notarization certificates necessary to prove block replay, so checking the vote states should be sufficient to prove the block is valid, shouldn't require checking the skip pool here I think.

As long as we're getting valid blocks, catch up shouldn't stall.

Comment on lines +128 to +131
if accumulated <= threshold_stake {
if let Some((start_slot, contributors)) = &current_cert {
// Skip certificate has ended, reset and publish
certs.push(((*start_slot, *slot), contributors.clone()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the threshold never falls below threshold_stake before the loop ends, we should still push the current cert if it exists

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this can actually happen right, since every vote has to end accumulated is guaranteed to be <= threshold_stake before the loop ends.

Added a debug assert for this case 4c08dbe

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yeah, we separate out the ends, makes sense

@AshwinSekar AshwinSekar requested a review from carllin March 11, 2025 15:16
@AshwinSekar AshwinSekar merged commit 91f3deb into anza-xyz:master Mar 11, 2025
7 checks passed
@AshwinSekar AshwinSekar deleted the skip-pool branch March 11, 2025 22:52
bw-solana pushed a commit to bw-solana/alpenglow that referenced this pull request Aug 1, 2025
bw-solana pushed a commit to bw-solana/alpenglow that referenced this pull request Aug 1, 2025
bw-solana pushed a commit to bw-solana/alpenglow that referenced this pull request Aug 1, 2025
bw-solana pushed a commit to bw-solana/alpenglow that referenced this pull request Aug 1, 2025
bw-solana pushed a commit to bw-solana/alpenglow that referenced this pull request Aug 1, 2025
bw-solana pushed a commit to bw-solana/alpenglow that referenced this pull request Aug 1, 2025
bw-solana pushed a commit to bw-solana/alpenglow that referenced this pull request Aug 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants