Skip to content

Conversation

@AshwinSekar
Copy link
Contributor

@AshwinSekar AshwinSekar commented Nov 19, 2025

Problem

We currently do not advance MigrationStatus during startup at all, it's just set at the bank forks root.
This causes problems if we restart after the migration has succeeded or during the migrationary period before we've advanced the root.
The old root could indicate that we are still in the migration while blockstore has already been cleaned up and we have Alpenglow blocks present.

Additionally we do not process the GenesisCertificate marker in the first Alpenglow block. This is key to inform startup and sometimes lagging steady state that the migration was successful. There's no need to try to do genesis discovery if this marker is processed, we can just go straight into Alpenglow.

Summary of Changes

Allow processing of the GenesisCertificate marker:

  • Immediately move us to ReadyToEnable as the processing of the GenesisCertificate via block means we have the genesis block frozen
  • Allow header and genesis certificate markers to be processed when we're in the migrationary period, for such cases where we missed the migration's completion and are receiving the first alpenglow block.

Advance MigrationStatus during startup in load_frozen_forks:

  • If we root the feature flag activation, transition from PreFeatureActivation to Migration
  • Don't root startup blocks that are in the Migration
  • If we process a GenesisCertificate marker go to ReadyToEnable
  • If we are ReadyToEnable, enable alpenglow immediately and retry processing the current block onwards as Alpenglow blocks instead
  • Do not do genesis discovery - if we don't process a GenesisCertificate then we rely on the standard compute_bank_stats pathway in replay to find the genesis block post startup

Add a test where a node restarts post migration from a root before the migration and ensure it can catch up.
Add a test where a node misses the migration entirely and ensure that it can catch up.

Note: The certificate in the GenesisCertificate marker is not validated yet. Future PR will plug in BLS verification.

@AshwinSekar AshwinSekar force-pushed the process-genesis-marker branch from f18a2a3 to e004e48 Compare November 20, 2025 19:05
@AshwinSekar AshwinSekar changed the title blockstore_processor: step migration during startup blockstore_processor: advance migration during startup Nov 21, 2025
@AshwinSekar AshwinSekar force-pushed the process-genesis-marker branch 2 times, most recently from f832097 to edc6120 Compare November 21, 2025 03:56
Copy link
Contributor Author

@AshwinSekar AshwinSekar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Startup replay is slightly different than steady state replay so unfortunately more code 😭

Almost done with migration thank y'all for reviewing. Remaining work:

  • BLS verify the certificate in GenesisCertificate block marker

// being verified as a TowerBFT one.
//
// We are safe to cleanly transition to alpenglow here
if migration_status.is_ready_to_enable() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is basically analogous to when we enable alpenglow during steady state in replay.
Difference here is:

  • We reach ReadyToEnable by trying to process the first alpenglow block as a TowerBFT block and failing. While processing we observed the GenesisCertificate marker so we know that the migration happened, and which block is the genesis.
  • We don't have to purge the blocks > genesis, instead we just reprocesses them as Alpenglow blocks (ticks adjusted and markers allowed).
  • We have to reset the dead status and retry this first alpenglow block since we just failed to process it as a TowerBFT block

}.filter(|new_root_bank| {
// In the case that we've restarted while the migrationary period is going on but before alpenglow
// is enabled, don't root blocks past the migration slot
migration_status.should_root_during_startup(new_root_bank.slot())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

startup equivalent of

// We do not root during the migration - post genesis rooting is handled by votor
migration_status.should_report_commitment_or_root(*root)
});

root_retain_us += m.as_us();

// If this root bank activated the feature flag, update migration status
if migration_status.is_pre_feature_activation() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

startup equivalent of

// Check if we've rooted a bank that will tell us the migration slot
if migration_status.is_pre_feature_activation() {

@AshwinSekar AshwinSekar marked this pull request as ready for review November 21, 2025 04:08
@AshwinSekar AshwinSekar force-pushed the process-genesis-marker branch from edc6120 to c466111 Compare November 21, 2025 16:29
}

let genesis_cert = Certificate::from(genesis_cert);
// TODO(ashwin): verify genesis cert using bls sigverify and bank
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #609 so I don't forget

}

#[test]
// This test requires alpenglow repair
Copy link
Contributor Author

@AshwinSekar AshwinSekar Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leaving this test here as a stretch goal that we can revisit once we have alpenglow repair.

It's a pretty gnarly situation:

  • Our node observes the start of the migration and views 5 TowerBFT blocks after the migration slot
  • Our node partitions/goes offline
  • The remaining nodes finish the migration, clean everything past genesis and root a block in Alpenglow
  • Since they root a block they stop broadcasting the genesis certificate
  • Our node rejoins, it can repair the missing blocks but it has 5 TowerBFT blocks in place of the alpenglow ones
  • Once alpenglow repair is implemented it'll repair the 5 Alpenglow blocks into the alternate blockstore column

Need some signal to decide to try to replay the alpenglow blocks from the alternate column -> views the GenesisCertificate block marker -> enables alpenglow.
In practice it might be simpler to just tell the operator to wait for a new snapshot and restart if they end up in this state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another alternative is for nodes to continue broadcasting the genesis cert A2A every 10 seconds for a couple hours or until the end of the epoch.

This is much easier, and if someone's node is partitioned for so long that it still misses the cert we can instruct operators to restart with fresh snapshot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant