-
-
Notifications
You must be signed in to change notification settings - Fork 107
Description
Problem
In small network tests (e.g., six-peer-regression), peers form a star topology (all connected only to gateway) instead of a mesh topology where peers connect to each other. This causes test failures when operations need to route through peers that aren't directly connected.
User-Visible Impact
- Flaky tests (six-peer-regression, subscription forwarding tests)
- Slow network formation in production
- Operations fail when routing requires peer-to-peer paths that don't exist
Root Cause Analysis
Four parallel investigations identified these contributing factors:
1. Serial Connection Acquisition (CRITICAL)
Location: crates/core/src/ring/mod.rs:428-466
The maintenance loop uses live_tx to track pending connections, but only allows ONE connection attempt at a time:
if let Some(ideal_location) = pending_conn_adds.pop_first() {
if live_tx.is_none() { // <- Only proceeds if no pending transaction
live_tx = self.acquire_new(...).await?;
} else {
pending_conn_adds.insert(ideal_location); // Re-queue and wait
}
}Impact: If a connection attempt takes 60 seconds (OPERATION_TTL), ALL other connection attempts are blocked for 60 seconds. With 25 min_connections target and serial acquisition, reaching minimum connections takes far longer than test windows allow.
2. No Timeout on WaitingForResponses (CRITICAL)
Location: crates/core/src/operations/connect.rs:769-792
The JoinerState has a last_progress: Instant field that is tracked but never enforced:
pub(crate) struct JoinerState {
pub target_connections: usize,
pub last_progress: Instant, // Tracked but NEVER checked for timeout
pub accepted: HashSet<PeerKeyLocation>,
// ...
}Impact: If a connect operation sends requests but receives no responses, it waits indefinitely. There's no mechanism to abort stalled operations.
Forward attempts timeout after 20 seconds (FORWARD_ATTEMPT_TIMEOUT), but this only records the failure for the estimator - it doesn't retry or fail the overall operation.
3. Tests Start Before Network Ready (MODERATE)
Location: crates/freenet-macros/src/codegen.rs:543-580
Tests use a fixed 15-second sleep instead of condition-based waiting:
tracing::info!("Waiting {} seconds for nodes to start up", startup_wait_secs);
tokio::time::sleep(Duration::from_secs(startup_wait_secs)).await;Impact: For 6-peer networks, 15 seconds may not be enough for peer-to-peer connections to form, especially with serial acquisition.
4. Initial Join is Fire-and-Forget (MODERATE)
Location: crates/core/src/node/p2p_impl.rs:106-114
The initial_join_procedure() is spawned as a background task but never awaited:
let join_handle = connect::initial_join_procedure(...).await?;
self.initial_join_task = Some(join_handle); // Stored but never awaitedImpact: Node startup completes before the join procedure finishes, leading to race conditions.
Recommended Fixes
Priority 1: Allow Parallel Connection Attempts
File: crates/core/src/ring/mod.rs
Instead of single live_tx: Option<Transaction>, use a bounded set:
const MAX_CONCURRENT_CONNECTIONS: usize = 3;
// In maintenance loop:
if pending_conn_adds.len() > 0 && live_tx_tracker.active_count() < MAX_CONCURRENT_CONNECTIONS {
let ideal_location = pending_conn_adds.pop_first().unwrap();
let tx = self.acquire_new(ideal_location, ...).await?;
if let Some(tx) = tx {
live_tx_tracker.add(tx);
}
}Priority 2: Add Timeout to WaitingForResponses
File: crates/core/src/operations/connect.rs
Enforce last_progress timeout (e.g., 30 seconds of no progress = fail):
const JOINER_PROGRESS_TIMEOUT: Duration = Duration::from_secs(30);
// In handle_response or a periodic check:
if now.duration_since(state.last_progress) > JOINER_PROGRESS_TIMEOUT {
self.state = Some(ConnectState::Failed);
return Err(OpError::Timeout);
}Priority 3: Condition-Based Test Waiting (if above fixes don't resolve)
File: crates/freenet-macros/src/codegen.rs
Tests should use wait_for_connections = true instead of fixed sleep:
#[freenet_test(
nodes = ["gateway", "peer1", ...],
wait_for_connections = true,
expected_connections = 2, // Each peer should have 2+ connections
)]Priority 4: Retry Logic for Failed Forwards (Nice-to-have)
File: crates/core/src/operations/connect.rs
When forward attempts timeout, retry with a different peer instead of just recording failure:
// After FORWARD_ATTEMPT_TIMEOUT:
if let Some(alternative_peer) = self.find_alternative_forward_target(&skip_list) {
actions.forward = Some((alternative_peer, request.clone()));
}Testing
After fixes, these should pass reliably:
six-peer-regressionsix-peer-connection-cap- Subscription forwarding tests
- Any test with 5+ peers
Notes for Agents
- Priority 1 and 2 are the core fixes - the others may not be needed if these work
- The
live_tx_trackeralready exists and tracks transactions per peer - extend it to support multiple concurrent transactions - Be careful not to overwhelm the network - 3 parallel connections is a reasonable starting point
- Consider adding metrics/logging to track connection acquisition latency
Related
- Part of ongoing work to fix flaky CI tests
- Connected to PR refactor(routing): add upstream_addr for connection-based routing #2167 (connection-based routing)
[AI-assisted - Claude]
Metadata
Metadata
Assignees
Labels
Type
Projects
Status