fix: remove duplicate receive loops in P2PNode by mickvandijke · Pull Request #10 · saorsa-labs/saorsa-core

mickvandijke · 2026-01-30T13:34:48Z

Summary

Removed duplicate background loop spawning from P2PNode::new() — the constructor now only builds the struct without spawning any tasks
start() is now the single entry point for spawning accept loops, receive loops, and the connection monitor
Updated all callers (examples, tests, messaging service) to explicitly call start() after new()

Context

P2PNode::new() and start() both called start_network_listeners() and start_message_receiving_system(), spawning duplicate accept and receive loops competing on the same QUIC endpoint. Two receive loops calling receive_any() race on every incoming message — only one wins, so each loop sees roughly half the traffic, causing flaky message delivery.

Dependencies

Merge [#PR for fix/p2p-event-source-peer-id] first — this branch is based on fix/p2p-event-source-peer-id.

Test plan

cargo build — compiles cleanly
cargo clippy -- -D warnings -D clippy::unwrap_used -D clippy::expect_used — zero warnings
cargo test --lib — all 1332 unit tests pass
CI pipeline validates integration tests

🤖 Generated with Claude Code

…ew() to start() P2PNode::new() and start() both called start_network_listeners() and start_message_receiving_system(), spawning duplicate accept and receive loops competing on the same QUIC endpoint. This caused messages to be randomly split between loops, leading to flaky delivery. Now new() only constructs the struct; start() is the single place where background loops are spawned. All callers updated to call start() after new(). Co-Authored-By: Claude Opus 4.5 <[email protected]>

dirvine · 2026-01-30T13:35:50Z

@claude review please

claude · 2026-01-30T13:36:04Z

Claude finished @dirvine's task in 1m 48s —— View job

Code Review Complete ✅

Summary

This PR successfully eliminates duplicate receive loops by cleanly separating node construction from network initialization. The fix is correct and well-implemented. All call sites have been properly updated, and the changes follow repository standards.

Review Findings

✅ Strengths

Root Cause Fix: Correctly addresses the race condition where two receive_any() loops competed for incoming messages (src/network.rs:1221), causing each to see only ~50% of traffic.
Clean Separation of Concerns:
- P2PNode::new() now purely constructs the struct (lines 1040-1042)
- start() is the single entry point for spawning all background tasks (lines 1150, 1153, 1160)
- Helpful log message guides users: "Created P2P node with peer ID: {} (call start() to begin networking)" (line 1040)
Comprehensive Updates: All 8 call sites properly updated:
- ✅ examples/chat.rs:67 - Added node.start().await?;
- ✅ examples/test_network.rs:74 - Added .start() with proper error context
- ✅ src/messaging/service.rs:296 - Added node.start().await?;
- ✅ All test files (60+ test updates in network_wiring_e2e_test.rs alone)
No Production Code Panics: Adheres to CLAUDE.md's zero-tolerance policy - all error handling uses ? operator or proper Result returns.

Correct Sequencing in start(): The order is logical and correct:

// 1. Set running state (line 1147)
// 2. Start network listeners (line 1150)
// 3. Start connection monitor (line 1153)
// 4. Start message receiving system (line 1160)
// 5. Connect to bootstrap peers (line 1163)

Test Coverage: Excellent test updates with consistent error messages (e.g., "Failed to start node1") making debugging easier.

🔍 Observations (Not Issues)

Comment Cleanup: Lines 1273-1275 in src/network.rs show // MCP removed comments. These could be cleaned up but don't affect functionality.
Backward Compatibility: The run() method (line 1277-1292) correctly handles the case where a user might call run() without calling start() first:
```
if !*self.running.read().await {
    self.start().await?;
}
```
This is good defensive programming.
End-to-End Test Comment: tests/end_to_end_scenarios_test.rs:58-59 has a helpful comment explaining that start() is already called in the constructor chain.

⚠️ Minor Suggestions (Optional)

Documentation: Consider adding a doc comment to new() mentioning that start() must be called:

/// Create a new P2P node.
///
/// # Important
/// This only constructs the node. Call `start()` to begin networking operations.

Error Context: Some test updates could benefit from more specific error messages:

// Current
node.start().await.expect("Failed to start node");

// Could be
node.start().await.expect("Failed to start node - networking initialization failed");

But current messages are adequate.

Verification

Manual Checks Performed:

✅ Reviewed the diff against base branch fix/p2p-event-source-peer-id
✅ Verified no .unwrap() or .expect() in production code (only in tests)
✅ Confirmed all spawned tasks are now in start() only
✅ Checked that receive_any() is called exactly once (line 1221)
✅ Validated error handling uses ? operator consistently

Test Status (from PR description):

✅ cargo build - clean
✅ cargo clippy -- -D warnings -D clippy::unwrap_used -D clippy::expect_used - zero warnings
✅ cargo test --lib - all 1332 unit tests pass
⏳ CI integration tests - pending (will validate on CI)

Recommendation

✅ APPROVE - This PR is ready to merge after the dependency PR is merged.

The implementation is clean, correct, and complete. All call sites have been updated, and the fix directly addresses the root cause of the duplicate receive loop race condition. The code follows all repository standards including the strict no-panic policy.

Dependencies to merge first:

PR for fix/p2p-event-source-peer-id (as noted in the PR description)

dirvine · 2026-01-30T13:36:20Z

Code Review: PR #10 - fix: remove duplicate receive loops in P2PNode

Hey @mickvandijke, nice clean fix! Here's my review.

📝 Summary

This PR addresses a race condition where P2PNode::new() and start() both spawned accept/receive loops, causing:

Duplicate loops competing on the same QUIC endpoint
Each loop seeing ~50% of traffic (non-deterministic)
Flaky message delivery

The fix: Constructor (new()) only builds the struct; start() is the single entry point for all background tasks.

✅ What Looks Good

Clean architectural separation: new() now purely constructs state, start() initiates all networking
Comprehensive updates: All callers (examples, tests, messaging service) correctly updated to call start()
Minimal diff to core logic: Only 4 lines changed in src/network.rs (removing calls from new() and reorganizing start())
Good test coverage: 60 new lines in network_wiring_e2e_test.rs adding start() calls throughout
Clear logging: Updated info log to indicate (call start() to begin networking)
Test checklist completed: Builds clean, zero clippy warnings, 1332 unit tests pass

⚠️ Minor Observations

1. Connection monitor placement

// src/network.rs - start() method
self.start_network_listeners().await?;
self.start_connection_monitor().await;  // Was in new(), now in start()

This is correct - the connection monitor should be started after listeners are ready. Good change.

2. Comment cleanup

// src/network.rs:1159 - removed "// MCP removed"

Good - dead comment removed.

3. Test file comment update

// tests/end_to_end_scenarios_test.rs:59
async fn start(&self) -> Result<()> {
    // Node is already started after P2PNode::new() + start() in TestUser::new()

Good documentation of the new pattern.

🔗 Dependency Note

This PR is based on branch fix/p2p-event-source-peer-id (PR #9), which needs to be merged first. The stacking is clean.

✅ Verdict

Approved - This is a well-executed fix for a real concurrency bug. The change is minimal, focused, and correctly applied across the codebase.

The pattern of separating construction from initialization (new() vs start()) is a Rust best practice that makes the lifecycle explicit and testable.

📋 Merge Checklist

Code builds cleanly
Clippy passes with strict lints
Unit tests pass (1332)
CI integration tests (pending)
Merge PR fix: use transport peer ID as P2PEvent source for correct message routing #9 first

🤖 Reviewed by Claude

claude · 2026-01-30T13:40:58Z

Missing guard in start() method - The start() method can be called multiple times and spawn duplicate tasks. Add a guard: if *self.running.read().await { return Ok(()); } at the beginning of start() to prevent duplicate task spawning. See src/network.rs:1122-1166

claude · 2026-01-30T13:41:33Z

Detailed review: The start() method at src/network.rs lines 1122-1166 needs a guard to prevent duplicate task spawning. Current code spawns tasks unconditionally. Suggested fix: Add check at start of method - if running.read().await is true, return Ok early. This prevents the exact race condition the PR fixes. Reference the run() method which already has this guard pattern.

…ing) Implements 11 hardening improvements from PR #21 review comments: 1. Request Leak/Cleanup: - Added RAII RequestCleanupGuard for automatic cleanup on drop/cancel - Atomic check+insert under single write lock to prevent races 2. Input Validation: - Added TransportError::ValidationError for input validation errors - validate_protocol_name() now uses ValidationError instead of StreamError 3. Timeout Handling: - Added MIN_REQUEST_TIMEOUT (100ms) to prevent Duration::ZERO immediate timeout - send_request() clamps timeout to [100ms, 5min] range 4. Response Routing: - Improved logging for failed pending.send() to clarify timeout scenario 5. Documentation: - Added documentation to ReplicationResult clarifying remote-only counts - Fixed brittle error assertion in tests to check error variant 6. Testing: - New tests/request_response_e2e_test.rs with 7 comprehensive tests: * Successful request/response routing * Timeout cleanup behavior * Invalid protocol rejection (empty, /, \, \0) * Protocol validation in send_response() * Minimum timeout enforcement * Trust reporting on failure Items already correct (verified): - #3: Protocol validation in send_response() already present - #6: Response-origin mismatch uses get() before remove() - #7: Unmatched /rr/ responses already suppressed - #9: Trust reporting on send_message() failure already implemented - #10: PeerStoreOutcome docs correct (no latency mention) Closes #23 Co-authored-by: David Irvine <[email protected]>

dirvine merged commit c656a28 into fix/p2p-event-source-peer-id Jan 30, 2026
1 check passed

dirvine mentioned this pull request Jan 30, 2026

fix: use transport peer ID as P2PEvent source for correct message routing #9

Merged

3 tasks

claude bot mentioned this pull request Feb 5, 2026

feat: node discovery and much more #16

Merged

claude bot mentioned this pull request Feb 12, 2026

fix: address review feedback from PR #21 (request/response API hardening) #23

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: remove duplicate receive loops in P2PNode#10

fix: remove duplicate receive loops in P2PNode#10
dirvine merged 1 commit intofix/p2p-event-source-peer-idfrom
fix/remove-duplicate-receive-loops

mickvandijke commented Jan 30, 2026

Uh oh!

dirvine commented Jan 30, 2026

Uh oh!

claude bot commented Jan 30, 2026 •

edited

Loading

Uh oh!

dirvine commented Jan 30, 2026

Uh oh!

Uh oh!

claude bot commented Jan 30, 2026

Uh oh!

claude bot commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mickvandijke commented Jan 30, 2026

Summary

Context

Dependencies

Test plan

Uh oh!

dirvine commented Jan 30, 2026

Uh oh!

claude bot commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Complete ✅

Summary

Review Findings

✅ Strengths

🔍 Observations (Not Issues)

⚠️ Minor Suggestions (Optional)

Verification

Recommendation

Uh oh!

dirvine commented Jan 30, 2026

Code Review: PR #10 - fix: remove duplicate receive loops in P2PNode

📝 Summary

✅ What Looks Good

⚠️ Minor Observations

1. Connection monitor placement

2. Comment cleanup

3. Test file comment update

🔗 Dependency Note

✅ Verdict

📋 Merge Checklist

Uh oh!

Uh oh!

claude bot commented Jan 30, 2026

Uh oh!

claude bot commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude bot commented Jan 30, 2026 •

edited

Loading