-
Notifications
You must be signed in to change notification settings - Fork 33
blockstore_processor: advance migration during startup #608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
f18a2a3 to
e004e48
Compare
f832097 to
edc6120
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Startup replay is slightly different than steady state replay so unfortunately more code 😭
Almost done with migration thank y'all for reviewing. Remaining work:
- BLS verify the certificate in
GenesisCertificateblock marker
| // being verified as a TowerBFT one. | ||
| // | ||
| // We are safe to cleanly transition to alpenglow here | ||
| if migration_status.is_ready_to_enable() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is basically analogous to when we enable alpenglow during steady state in replay.
Difference here is:
- We reach
ReadyToEnableby trying to process the first alpenglow block as a TowerBFT block and failing. While processing we observed theGenesisCertificatemarker so we know that the migration happened, and which block is the genesis. - We don't have to purge the blocks > genesis, instead we just reprocesses them as Alpenglow blocks (ticks adjusted and markers allowed).
- We have to reset the dead status and retry this first alpenglow block since we just failed to process it as a TowerBFT block
| }.filter(|new_root_bank| { | ||
| // In the case that we've restarted while the migrationary period is going on but before alpenglow | ||
| // is enabled, don't root blocks past the migration slot | ||
| migration_status.should_root_during_startup(new_root_bank.slot()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
startup equivalent of
alpenglow/core/src/replay_stage.rs
Lines 2814 to 2816 in d6f08a5
| // We do not root during the migration - post genesis rooting is handled by votor | |
| migration_status.should_report_commitment_or_root(*root) | |
| }); |
| root_retain_us += m.as_us(); | ||
|
|
||
| // If this root bank activated the feature flag, update migration status | ||
| if migration_status.is_pre_feature_activation() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
startup equivalent of
alpenglow/core/src/replay_stage.rs
Lines 2843 to 2844 in d6f08a5
| // Check if we've rooted a bank that will tell us the migration slot | |
| if migration_status.is_pre_feature_activation() { |
edc6120 to
c466111
Compare
| } | ||
|
|
||
| let genesis_cert = Certificate::from(genesis_cert); | ||
| // TODO(ashwin): verify genesis cert using bls sigverify and bank |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #609 so I don't forget
| } | ||
|
|
||
| #[test] | ||
| // This test requires alpenglow repair |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leaving this test here as a stretch goal that we can revisit once we have alpenglow repair.
It's a pretty gnarly situation:
- Our node observes the start of the migration and views 5 TowerBFT blocks after the migration slot
- Our node partitions/goes offline
- The remaining nodes finish the migration, clean everything past genesis and root a block in Alpenglow
- Since they root a block they stop broadcasting the genesis certificate
- Our node rejoins, it can repair the missing blocks but it has 5 TowerBFT blocks in place of the alpenglow ones
- Once alpenglow repair is implemented it'll repair the 5 Alpenglow blocks into the alternate blockstore column
Need some signal to decide to try to replay the alpenglow blocks from the alternate column -> views the GenesisCertificate block marker -> enables alpenglow.
In practice it might be simpler to just tell the operator to wait for a new snapshot and restart if they end up in this state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another alternative is for nodes to continue broadcasting the genesis cert A2A every 10 seconds for a couple hours or until the end of the epoch.
This is much easier, and if someone's node is partitioned for so long that it still misses the cert we can instruct operators to restart with fresh snapshot.
Problem
We currently do not advance
MigrationStatusduring startup at all, it's just set at the bank forks root.This causes problems if we restart after the migration has succeeded or during the migrationary period before we've advanced the root.
The old root could indicate that we are still in the migration while blockstore has already been cleaned up and we have Alpenglow blocks present.
Additionally we do not process the
GenesisCertificatemarker in the first Alpenglow block. This is key to inform startup and sometimes lagging steady state that the migration was successful. There's no need to try to do genesis discovery if this marker is processed, we can just go straight into Alpenglow.Summary of Changes
Allow processing of the
GenesisCertificatemarker:ReadyToEnableas the processing of theGenesisCertificatevia block means we have the genesis block frozenAdvance
MigrationStatusduring startup inload_frozen_forks:PreFeatureActivationtoMigrationMigrationGenesisCertificatemarker go toReadyToEnableReadyToEnable, enable alpenglow immediately and retry processing the current block onwards as Alpenglow blocks insteadGenesisCertificatethen we rely on the standardcompute_bank_statspathway in replay to find the genesis block post startupAdd a test where a node restarts post migration from a root before the migration and ensure it can catch up.
Add a test where a node misses the migration entirely and ensure that it can catch up.
Note: The certificate in the
GenesisCertificatemarker is not validated yet. Future PR will plug in BLS verification.