Skip to content

fix: remove duplicate receive loops in P2PNode#10

Merged
dirvine merged 1 commit intofix/p2p-event-source-peer-idfrom
fix/remove-duplicate-receive-loops
Jan 30, 2026
Merged

fix: remove duplicate receive loops in P2PNode#10
dirvine merged 1 commit intofix/p2p-event-source-peer-idfrom
fix/remove-duplicate-receive-loops

Conversation

@mickvandijke
Copy link
Collaborator

Summary

  • Removed duplicate background loop spawning from P2PNode::new() — the constructor now only builds the struct without spawning any tasks
  • start() is now the single entry point for spawning accept loops, receive loops, and the connection monitor
  • Updated all callers (examples, tests, messaging service) to explicitly call start() after new()

Context

P2PNode::new() and start() both called start_network_listeners() and start_message_receiving_system(), spawning duplicate accept and receive loops competing on the same QUIC endpoint. Two receive loops calling receive_any() race on every incoming message — only one wins, so each loop sees roughly half the traffic, causing flaky message delivery.

Dependencies

Merge [#PR for fix/p2p-event-source-peer-id] first — this branch is based on fix/p2p-event-source-peer-id.

Test plan

  • cargo build — compiles cleanly
  • cargo clippy -- -D warnings -D clippy::unwrap_used -D clippy::expect_used — zero warnings
  • cargo test --lib — all 1332 unit tests pass
  • CI pipeline validates integration tests

🤖 Generated with Claude Code

…ew() to start()

P2PNode::new() and start() both called start_network_listeners() and
start_message_receiving_system(), spawning duplicate accept and receive
loops competing on the same QUIC endpoint. This caused messages to be
randomly split between loops, leading to flaky delivery.

Now new() only constructs the struct; start() is the single place where
background loops are spawned. All callers updated to call start() after
new().

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@dirvine
Copy link
Collaborator

dirvine commented Jan 30, 2026

@claude review please

@claude
Copy link

claude bot commented Jan 30, 2026

Claude finished @dirvine's task in 1m 48s —— View job


Code Review Complete ✅

Summary

This PR successfully eliminates duplicate receive loops by cleanly separating node construction from network initialization. The fix is correct and well-implemented. All call sites have been properly updated, and the changes follow repository standards.


Review Findings

✅ Strengths

  1. Root Cause Fix: Correctly addresses the race condition where two receive_any() loops competed for incoming messages (src/network.rs:1221), causing each to see only ~50% of traffic.

  2. Clean Separation of Concerns:

    • P2PNode::new() now purely constructs the struct (lines 1040-1042)
    • start() is the single entry point for spawning all background tasks (lines 1150, 1153, 1160)
    • Helpful log message guides users: "Created P2P node with peer ID: {} (call start() to begin networking)" (line 1040)
  3. Comprehensive Updates: All 8 call sites properly updated:

    • examples/chat.rs:67 - Added node.start().await?;
    • examples/test_network.rs:74 - Added .start() with proper error context
    • src/messaging/service.rs:296 - Added node.start().await?;
    • ✅ All test files (60+ test updates in network_wiring_e2e_test.rs alone)
  4. No Production Code Panics: Adheres to CLAUDE.md's zero-tolerance policy - all error handling uses ? operator or proper Result returns.

  5. Correct Sequencing in start(): The order is logical and correct:

    // 1. Set running state (line 1147)
    // 2. Start network listeners (line 1150)
    // 3. Start connection monitor (line 1153)
    // 4. Start message receiving system (line 1160)
    // 5. Connect to bootstrap peers (line 1163)
  6. Test Coverage: Excellent test updates with consistent error messages (e.g., "Failed to start node1") making debugging easier.


🔍 Observations (Not Issues)

  1. Comment Cleanup: Lines 1273-1275 in src/network.rs show // MCP removed comments. These could be cleaned up but don't affect functionality.

  2. Backward Compatibility: The run() method (line 1277-1292) correctly handles the case where a user might call run() without calling start() first:

    if !*self.running.read().await {
        self.start().await?;
    }

    This is good defensive programming.

  3. End-to-End Test Comment: tests/end_to_end_scenarios_test.rs:58-59 has a helpful comment explaining that start() is already called in the constructor chain.


⚠️ Minor Suggestions (Optional)

  1. Documentation: Consider adding a doc comment to new() mentioning that start() must be called:

    /// Create a new P2P node.
    ///
    /// # Important
    /// This only constructs the node. Call `start()` to begin networking operations.
    
  2. Error Context: Some test updates could benefit from more specific error messages:

    // Current
    node.start().await.expect("Failed to start node");
    
    // Could be
    node.start().await.expect("Failed to start node - networking initialization failed");

    But current messages are adequate.


Verification

Manual Checks Performed:

  • ✅ Reviewed the diff against base branch fix/p2p-event-source-peer-id
  • ✅ Verified no .unwrap() or .expect() in production code (only in tests)
  • ✅ Confirmed all spawned tasks are now in start() only
  • ✅ Checked that receive_any() is called exactly once (line 1221)
  • ✅ Validated error handling uses ? operator consistently

Test Status (from PR description):

  • cargo build - clean
  • cargo clippy -- -D warnings -D clippy::unwrap_used -D clippy::expect_used - zero warnings
  • cargo test --lib - all 1332 unit tests pass
  • ⏳ CI integration tests - pending (will validate on CI)

Recommendation

✅ APPROVE - This PR is ready to merge after the dependency PR is merged.

The implementation is clean, correct, and complete. All call sites have been updated, and the fix directly addresses the root cause of the duplicate receive loop race condition. The code follows all repository standards including the strict no-panic policy.

Dependencies to merge first:

  • PR for fix/p2p-event-source-peer-id (as noted in the PR description)

@dirvine
Copy link
Collaborator

dirvine commented Jan 30, 2026

Code Review: PR #10 - fix: remove duplicate receive loops in P2PNode

Hey @mickvandijke, nice clean fix! Here's my review.


📝 Summary

This PR addresses a race condition where P2PNode::new() and start() both spawned accept/receive loops, causing:

  • Duplicate loops competing on the same QUIC endpoint
  • Each loop seeing ~50% of traffic (non-deterministic)
  • Flaky message delivery

The fix: Constructor (new()) only builds the struct; start() is the single entry point for all background tasks.


✅ What Looks Good

  1. Clean architectural separation: new() now purely constructs state, start() initiates all networking
  2. Comprehensive updates: All callers (examples, tests, messaging service) correctly updated to call start()
  3. Minimal diff to core logic: Only 4 lines changed in src/network.rs (removing calls from new() and reorganizing start())
  4. Good test coverage: 60 new lines in network_wiring_e2e_test.rs adding start() calls throughout
  5. Clear logging: Updated info log to indicate (call start() to begin networking)
  6. Test checklist completed: Builds clean, zero clippy warnings, 1332 unit tests pass

⚠️ Minor Observations

1. Connection monitor placement

// src/network.rs - start() method
self.start_network_listeners().await?;
self.start_connection_monitor().await;  // Was in new(), now in start()

This is correct - the connection monitor should be started after listeners are ready. Good change.

2. Comment cleanup

// src/network.rs:1159 - removed "// MCP removed"

Good - dead comment removed.

3. Test file comment update

// tests/end_to_end_scenarios_test.rs:59
async fn start(&self) -> Result<()> {
    // Node is already started after P2PNode::new() + start() in TestUser::new()

Good documentation of the new pattern.


🔗 Dependency Note

This PR is based on branch fix/p2p-event-source-peer-id (PR #9), which needs to be merged first. The stacking is clean.


✅ Verdict

Approved - This is a well-executed fix for a real concurrency bug. The change is minimal, focused, and correctly applied across the codebase.

The pattern of separating construction from initialization (new() vs start()) is a Rust best practice that makes the lifecycle explicit and testable.


📋 Merge Checklist

🤖 Reviewed by Claude

@dirvine dirvine merged commit c656a28 into fix/p2p-event-source-peer-id Jan 30, 2026
1 check passed
@claude
Copy link

claude bot commented Jan 30, 2026

Missing guard in start() method - The start() method can be called multiple times and spawn duplicate tasks. Add a guard: if *self.running.read().await { return Ok(()); } at the beginning of start() to prevent duplicate task spawning. See src/network.rs:1122-1166

@claude
Copy link

claude bot commented Jan 30, 2026

Detailed review: The start() method at src/network.rs lines 1122-1166 needs a guard to prevent duplicate task spawning. Current code spawns tasks unconditionally. Suggested fix: Add check at start of method - if running.read().await is true, return Ok early. This prevents the exact race condition the PR fixes. Reference the run() method which already has this guard pattern.

claude bot pushed a commit that referenced this pull request Feb 12, 2026
…ing)

Implements 11 hardening improvements from PR #21 review comments:

1. Request Leak/Cleanup:
   - Added RAII RequestCleanupGuard for automatic cleanup on drop/cancel
   - Atomic check+insert under single write lock to prevent races

2. Input Validation:
   - Added TransportError::ValidationError for input validation errors
   - validate_protocol_name() now uses ValidationError instead of StreamError

3. Timeout Handling:
   - Added MIN_REQUEST_TIMEOUT (100ms) to prevent Duration::ZERO immediate timeout
   - send_request() clamps timeout to [100ms, 5min] range

4. Response Routing:
   - Improved logging for failed pending.send() to clarify timeout scenario

5. Documentation:
   - Added documentation to ReplicationResult clarifying remote-only counts
   - Fixed brittle error assertion in tests to check error variant

6. Testing:
   - New tests/request_response_e2e_test.rs with 7 comprehensive tests:
     * Successful request/response routing
     * Timeout cleanup behavior
     * Invalid protocol rejection (empty, /, \, \0)
     * Protocol validation in send_response()
     * Minimum timeout enforcement
     * Trust reporting on failure

Items already correct (verified):
- #3: Protocol validation in send_response() already present
- #6: Response-origin mismatch uses get() before remove()
- #7: Unmatched /rr/ responses already suppressed
- #9: Trust reporting on send_message() failure already implemented
- #10: PeerStoreOutcome docs correct (no latency mention)

Closes #23

Co-authored-by: David Irvine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants