-
Notifications
You must be signed in to change notification settings - Fork 28
fix: do not use fixed start time for proposal duties #787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Claude finished @petarjuki7's task in 3m 7s —— View job Code Review CompleteI've completed a comprehensive review of PR #787. The implementation is generally well-structured and follows Rust best practices, but there are several important issues that should be addressed. 🔴 Critical Issues (Must Fix)1. Incomplete Timer Reset Signal CoverageLocation: The timer reset signal is only set when receiving a proposal for a future round, but there are other round advancement paths that may also need timer resets in Relative mode:
Action: Review all code paths that update 2. Potential Race Condition in Timer Reset LogicLocation: The timer reset check happens AFTER message processing, but timeout calculation happens BEFORE receiving the next message. This creates a timing window where the first timeout after a round change may use an incorrect duration. Sequence:
Action: Restructure so timeout recalculation happens immediately when the timer needs to be reset. 3. Inconsistent Timer Reset MechanismsLocation: Natural timeout path uses direct Action: Use a consistent approach for both paths. Direct assignment is simpler and less error-prone.
|
| debug!(old_round = ?self.current_round, new_round = ?round, "Updating to future round from proposal"); | ||
| self.current_round = round; | ||
| self.round_change_timer_reset = true; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: Timer reset signal is only set when receiving a proposal for a future round. Consider whether other round advancement paths should also trigger timer resets in Relative mode:
- Line 1156:
self.set_round(round)in round change consensus - Line 1167:
self.current_round = roundwhen receiving f+1 round changes
These paths may also need to set self.round_change_timer_reset = true for correct behavior in Relative mode. Please review all places where self.current_round is updated and determine if they should reset the timer.
anchor/qbft_manager/src/instance.rs
Outdated
| // Check if timer should be reset due to justified proposal (Relative mode | ||
| // only) | ||
| if let QbftInstance::Initialized(initialized) = &mut instance { | ||
| initialized.check_timer_reset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential race condition: The timer reset check happens AFTER message processing, but the timeout calculation in recv() (lines 200-204) happens BEFORE receiving the next message. This creates a timing window:
recv()calculates timeout using currentstart_time- Message arrives and sets
round_change_timer_reset = true check_timer_reset()updatesstart_timetoInstant::now()- But the timeout in the
select!was already calculated with the oldstart_time
The first timeout after a round change may use an incorrect duration. Consider restructuring so timeout recalculation happens immediately when needed.
anchor/qbft_manager/src/instance.rs
Outdated
| warn!("Round timer elapsed"); | ||
| initialized.qbft.end_round(); | ||
| // Reset timer for new round in Relative mode | ||
| if initialized.timeout_mode == TimeoutMode::Relative { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistency: Timer reset on natural timeout (here) uses direct assignment, while reset on receiving messages (line 315) uses a signal-based mechanism with take_timer_reset_signal(). These should use a consistent approach.
The signal-based approach is more fragile - if check_timer_reset() is called multiple times before the next recv(), only the first call sees the signal. Consider using direct start_time updates in both paths for simplicity and consistency.
anchor/qbft_manager/src/instance.rs
Outdated
| impl<D: QbftData<Hash = Hash256>> Initialized<D> { | ||
| fn check_timer_reset(&mut self) { | ||
| if self.timeout_mode == TimeoutMode::Relative && self.qbft.take_timer_reset_signal() { | ||
| debug!("Resetting round timer due to justified proposal"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: Add the round number to this debug log for better debugging of timeout issues:
let round = self.qbft.get_round();
debug!(round = ?round, "Resetting round timer due to justified proposal");| // - Relative: single-round from Instant::now() after sleep | ||
| // | ||
| // When start_time is now, both should behave similarly for the first run, | ||
| // but the internal calculations differ. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test weakness: This test doesn't actually demonstrate a behavioral difference between SlotTime and Relative modes. When start_time is set to Instant::now(), both modes behave identically for the duration.
To effectively test the difference, set start_time to the PAST:
- With SlotTime and past start_time: rounds should timeout based on cumulative time from the past (potentially immediately)
- With Relative and past start_time: after the initial sleep, rounds should get full duration starting from
Instant::now()
This would demonstrate that Relative mode truly ignores historical start_time after initialization.
shane-moore
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great job with this! I've taken some time to understand the different scenarios that trigger a round change:
- receive
PROPOSEfor round > current - natural timeout -> round timer expires
- receive 2f + 1
ROUND_CHANGEmessages - receive f +1
ROUND_CHANGEmessages
This PR currently handles scenarios 1 and 2 well. We could extend it to cover 3 and 4 by adding the timer reset signal in the round change handlers
anchor/anchor/common/qbft/src/lib.rs
Line 1156 in 3b61266
| self.set_round(round); |
anchor/anchor/common/qbft/src/lib.rs
Line 1167 in 3b61266
| self.current_round = round; |
but i think a more future proof option would actually be to keep your current solution for scenario 2 and handle scenarios 1, 3, and 4 via adding logic at
anchor/anchor/qbft_manager/src/instance.rs
Lines 303 to 311 in 3b61266
| } | |
| // We got a new network message, this should be passed onto the instance | |
| QbftMessageKind::NetworkMessage(message) => { | |
| // We use `WrappedQbftMessage`'s `Display` implementation here for a brief | |
| // summary of the most important fields. This is brief enough for reasonable | |
| // log file size while maintaining debuggability for the testing phase. | |
| // Can be removed as Anchor approaches maturity. | |
| debug!(msg = %message, "Received message in qbft_instance"); | |
| instance.receive(message); |
a. Capture
old_round before processing the messageb. Process the message, which may advance the round
c. Check if
new_round > old_round and reset timer if in Relative mode
we'd handle 3 scenarios in one place, which is some nice future proofing. and seems less brittle since we won't need to remember setting signals in multiple qbft code paths
lmk if I'm missing something or if you see any issues with this approach!
|
Thanks @shane-moore! I think it makes sense because either way that's the thing we are checking in the |
shane-moore
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
anchor/qbft_manager/src/lib.rs
Outdated
| message_id: MessageId, | ||
| /// The time when the first round is supposed to start. Rounds will be advanced based on this. | ||
| /// The time reference for timeout calculations. | ||
| start_time: Instant, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
start_time now has multiple possible meaning based on the TimeoutMode. This makes it a bit harder to understand and might be a source of bugs in future.
What do you think about e.g. making the TimeoutMode instead contain this:
pub enum TimeoutMode {
SlotTime {
instance_start_time: Instant,
},
Relative {
current_round_start_time: Instant,
},
}This forces using sites to check which mode we are in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I was considering both approaches, but I agree this makes it more explicit. Updated, thanks!
dknopik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Issue Addressed
Closes issue #758