Skip to content

Conversation

@petarjuki7
Copy link
Member

Issue Addressed

Closes issue #758

@claude-code-actions-sigp
Copy link

claude-code-actions-sigp bot commented Jan 29, 2026

Claude finished @petarjuki7's task in 3m 7s —— View job


Code Review Complete

I've completed a comprehensive review of PR #787. The implementation is generally well-structured and follows Rust best practices, but there are several important issues that should be addressed.


🔴 Critical Issues (Must Fix)

1. Incomplete Timer Reset Signal Coverage

Location: anchor/common/qbft/src/lib.rs:613

The timer reset signal is only set when receiving a proposal for a future round, but there are other round advancement paths that may also need timer resets in Relative mode:

  • Line 1156: self.set_round(round) in round change consensus
  • Line 1167: self.current_round = round when receiving f+1 round changes

Action: Review all code paths that update self.current_round and determine if they should set round_change_timer_reset = true.

2. Potential Race Condition in Timer Reset Logic

Location: anchor/qbft_manager/src/instance.rs:311-316

The timer reset check happens AFTER message processing, but timeout calculation happens BEFORE receiving the next message. This creates a timing window where the first timeout after a round change may use an incorrect duration.

Sequence:

  1. recv() calculates timeout using current start_time (lines 200-204)
  2. Message arrives, sets round_change_timer_reset = true
  3. check_timer_reset() updates start_time to Instant::now()
  4. But the timeout in the select! was already calculated with old start_time

Action: Restructure so timeout recalculation happens immediately when the timer needs to be reset.

3. Inconsistent Timer Reset Mechanisms

Location: anchor/qbft_manager/src/instance.rs:327 vs line 315

Natural timeout path uses direct start_time assignment, while the message path uses signal-based take_timer_reset_signal(). The signal-based approach is more fragile—if check_timer_reset() is called multiple times before the next recv(), only the first call sees the signal.

Action: Use a consistent approach for both paths. Direct assignment is simpler and less error-prone.


⚠️ Important Warnings

4. Missing Context in Debug Logs

Location: anchor/qbft_manager/src/instance.rs:192

The timer reset log doesn't include the round number, which would be valuable for debugging timeout issues.

5. Weak Test Coverage

Location: anchor/qbft_manager/src/tests.rs:1113

The test test_relative_vs_slottime_timing_difference doesn't effectively demonstrate the behavioral difference between modes. When start_time is set to Instant::now(), both modes behave identically. The test should use a past start_time to show that:

  • SlotTime calculates cumulative timeouts from the past time
  • Relative uses Instant::now() after the initial sleep

6. Missing Test Case

No test validates timer reset behavior when receiving a justified proposal while the instance is already running.


✅ Positive Aspects

  1. Memory Safety: Proper use of std::mem::take, no .unwrap() in production paths, good Option/Result handling
  2. Overflow Protection: All arithmetic uses checked_mul, checked_sub, checked_add
  3. Clean Enum Design: TimeoutMode uses appropriate derives and is well-documented
  4. Test Quality: Tests use start_paused = true for deterministic timing
  5. Performance: Minimal overhead, no allocations in hot paths

💡 Suggestions

  • Naming: Consider more descriptive names like CumulativeTimeout vs ResettingTimeout instead of SlotTime vs Relative
  • Metrics: Add a counter tracking timer resets in Relative mode for production monitoring
  • Documentation: Add protocol-level reasoning for why block proposals need different timeout behavior than attestations

Files Reviewed

  • anchor/common/qbft/src/lib.rs (+10/-0) - Timer reset signal mechanism
  • anchor/qbft_manager/src/instance.rs (+35/-3) - Timer reset logic implementation
  • anchor/qbft_manager/src/lib.rs (+16/-1) - TimeoutMode enum definition
  • anchor/qbft_manager/src/timeout.rs (+37/-12) - Timeout calculation functions
  • anchor/qbft_manager/src/tests.rs (+135/-4) - Test coverage
  • anchor/validator_store/src/lib.rs (+15/-1) - TimeoutMode usage

Summary

The design is sound and addresses the issue of using fixed start times for proposal duties. The implementation correctly distinguishes between cumulative (SlotTime) and resetting (Relative) timeout modes. However, the three critical issues around timer reset completeness, race conditions, and consistency must be addressed before merging to ensure correct behavior in all scenarios.

The codebase follows Rust best practices with proper error handling, overflow protection, and memory safety. With the suggested fixes, this will be a solid implementation.

debug!(old_round = ?self.current_round, new_round = ?round, "Updating to future round from proposal");
self.current_round = round;
self.round_change_timer_reset = true;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical: Timer reset signal is only set when receiving a proposal for a future round. Consider whether other round advancement paths should also trigger timer resets in Relative mode:

  • Line 1156: self.set_round(round) in round change consensus
  • Line 1167: self.current_round = round when receiving f+1 round changes

These paths may also need to set self.round_change_timer_reset = true for correct behavior in Relative mode. Please review all places where self.current_round is updated and determine if they should reset the timer.

// Check if timer should be reset due to justified proposal (Relative mode
// only)
if let QbftInstance::Initialized(initialized) = &mut instance {
initialized.check_timer_reset();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential race condition: The timer reset check happens AFTER message processing, but the timeout calculation in recv() (lines 200-204) happens BEFORE receiving the next message. This creates a timing window:

  1. recv() calculates timeout using current start_time
  2. Message arrives and sets round_change_timer_reset = true
  3. check_timer_reset() updates start_time to Instant::now()
  4. But the timeout in the select! was already calculated with the old start_time

The first timeout after a round change may use an incorrect duration. Consider restructuring so timeout recalculation happens immediately when needed.

warn!("Round timer elapsed");
initialized.qbft.end_round();
// Reset timer for new round in Relative mode
if initialized.timeout_mode == TimeoutMode::Relative {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistency: Timer reset on natural timeout (here) uses direct assignment, while reset on receiving messages (line 315) uses a signal-based mechanism with take_timer_reset_signal(). These should use a consistent approach.

The signal-based approach is more fragile - if check_timer_reset() is called multiple times before the next recv(), only the first call sees the signal. Consider using direct start_time updates in both paths for simplicity and consistency.

impl<D: QbftData<Hash = Hash256>> Initialized<D> {
fn check_timer_reset(&mut self) {
if self.timeout_mode == TimeoutMode::Relative && self.qbft.take_timer_reset_signal() {
debug!("Resetting round timer due to justified proposal");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Add the round number to this debug log for better debugging of timeout issues:

let round = self.qbft.get_round();
debug!(round = ?round, "Resetting round timer due to justified proposal");

// - Relative: single-round from Instant::now() after sleep
//
// When start_time is now, both should behave similarly for the first run,
// but the internal calculations differ.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test weakness: This test doesn't actually demonstrate a behavioral difference between SlotTime and Relative modes. When start_time is set to Instant::now(), both modes behave identically for the duration.

To effectively test the difference, set start_time to the PAST:

  • With SlotTime and past start_time: rounds should timeout based on cumulative time from the past (potentially immediately)
  • With Relative and past start_time: after the initial sleep, rounds should get full duration starting from Instant::now()

This would demonstrate that Relative mode truly ignores historical start_time after initialization.

Copy link

@shane-moore shane-moore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great job with this! I've taken some time to understand the different scenarios that trigger a round change:

  1. receive PROPOSE for round > current
  2. natural timeout -> round timer expires
  3. receive 2f + 1 ROUND_CHANGE messages
  4. receive f +1 ROUND_CHANGE messages

This PR currently handles scenarios 1 and 2 well. We could extend it to cover 3 and 4 by adding the timer reset signal in the round change handlers

self.set_round(round);
and
self.current_round = round;

but i think a more future proof option would actually be to keep your current solution for scenario 2 and handle scenarios 1, 3, and 4 via adding logic at

}
// We got a new network message, this should be passed onto the instance
QbftMessageKind::NetworkMessage(message) => {
// We use `WrappedQbftMessage`'s `Display` implementation here for a brief
// summary of the most important fields. This is brief enough for reasonable
// log file size while maintaining debuggability for the testing phase.
// Can be removed as Anchor approaches maturity.
debug!(msg = %message, "Received message in qbft_instance");
instance.receive(message);
to:
a. Capture old_round before processing the message
b. Process the message, which may advance the round
c. Check if new_round > old_round and reset timer if in Relative mode

we'd handle 3 scenarios in one place, which is some nice future proofing. and seems less brittle since we won't need to remember setting signals in multiple qbft code paths

lmk if I'm missing something or if you see any issues with this approach!

@petarjuki7
Copy link
Member Author

Thanks @shane-moore! I think it makes sense because either way that's the thing we are checking in the received_propose() function, right? The only thing that we need to check is if we are in the TimeoutMode::Relative. Fixed!

shane-moore
shane-moore previously approved these changes Jan 30, 2026
Copy link

@shane-moore shane-moore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@dknopik dknopik added the v2.0.0 The release shipping the next network upgrade label Feb 2, 2026
message_id: MessageId,
/// The time when the first round is supposed to start. Rounds will be advanced based on this.
/// The time reference for timeout calculations.
start_time: Instant,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start_time now has multiple possible meaning based on the TimeoutMode. This makes it a bit harder to understand and might be a source of bugs in future.

What do you think about e.g. making the TimeoutMode instead contain this:

pub enum TimeoutMode {
    SlotTime {
        instance_start_time: Instant,
    },
    Relative {
        current_round_start_time: Instant,
    },
}

This forces using sites to check which mode we are in

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I was considering both approaches, but I agree this makes it more explicit. Updated, thanks!

Copy link
Member

@dknopik dknopik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mergify mergify bot merged commit b862f7c into sigp:unstable Feb 3, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-merge v2.0.0 The release shipping the next network upgrade

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants