Skip to content

fix: Star topology formation due to serialized connection acquisition #2173

@sanity

Description

@sanity

Problem

In small network tests (e.g., six-peer-regression), peers form a star topology (all connected only to gateway) instead of a mesh topology where peers connect to each other. This causes test failures when operations need to route through peers that aren't directly connected.

User-Visible Impact

  • Flaky tests (six-peer-regression, subscription forwarding tests)
  • Slow network formation in production
  • Operations fail when routing requires peer-to-peer paths that don't exist

Root Cause Analysis

Four parallel investigations identified these contributing factors:

1. Serial Connection Acquisition (CRITICAL)

Location: crates/core/src/ring/mod.rs:428-466

The maintenance loop uses live_tx to track pending connections, but only allows ONE connection attempt at a time:

if let Some(ideal_location) = pending_conn_adds.pop_first() {
    if live_tx.is_none() {  // <- Only proceeds if no pending transaction
        live_tx = self.acquire_new(...).await?;
    } else {
        pending_conn_adds.insert(ideal_location);  // Re-queue and wait
    }
}

Impact: If a connection attempt takes 60 seconds (OPERATION_TTL), ALL other connection attempts are blocked for 60 seconds. With 25 min_connections target and serial acquisition, reaching minimum connections takes far longer than test windows allow.

2. No Timeout on WaitingForResponses (CRITICAL)

Location: crates/core/src/operations/connect.rs:769-792

The JoinerState has a last_progress: Instant field that is tracked but never enforced:

pub(crate) struct JoinerState {
    pub target_connections: usize,
    pub last_progress: Instant,  // Tracked but NEVER checked for timeout
    pub accepted: HashSet<PeerKeyLocation>,
    // ...
}

Impact: If a connect operation sends requests but receives no responses, it waits indefinitely. There's no mechanism to abort stalled operations.

Forward attempts timeout after 20 seconds (FORWARD_ATTEMPT_TIMEOUT), but this only records the failure for the estimator - it doesn't retry or fail the overall operation.

3. Tests Start Before Network Ready (MODERATE)

Location: crates/freenet-macros/src/codegen.rs:543-580

Tests use a fixed 15-second sleep instead of condition-based waiting:

tracing::info!("Waiting {} seconds for nodes to start up", startup_wait_secs);
tokio::time::sleep(Duration::from_secs(startup_wait_secs)).await;

Impact: For 6-peer networks, 15 seconds may not be enough for peer-to-peer connections to form, especially with serial acquisition.

4. Initial Join is Fire-and-Forget (MODERATE)

Location: crates/core/src/node/p2p_impl.rs:106-114

The initial_join_procedure() is spawned as a background task but never awaited:

let join_handle = connect::initial_join_procedure(...).await?;
self.initial_join_task = Some(join_handle);  // Stored but never awaited

Impact: Node startup completes before the join procedure finishes, leading to race conditions.


Recommended Fixes

Priority 1: Allow Parallel Connection Attempts

File: crates/core/src/ring/mod.rs

Instead of single live_tx: Option<Transaction>, use a bounded set:

const MAX_CONCURRENT_CONNECTIONS: usize = 3;

// In maintenance loop:
if pending_conn_adds.len() > 0 && live_tx_tracker.active_count() < MAX_CONCURRENT_CONNECTIONS {
    let ideal_location = pending_conn_adds.pop_first().unwrap();
    let tx = self.acquire_new(ideal_location, ...).await?;
    if let Some(tx) = tx {
        live_tx_tracker.add(tx);
    }
}

Priority 2: Add Timeout to WaitingForResponses

File: crates/core/src/operations/connect.rs

Enforce last_progress timeout (e.g., 30 seconds of no progress = fail):

const JOINER_PROGRESS_TIMEOUT: Duration = Duration::from_secs(30);

// In handle_response or a periodic check:
if now.duration_since(state.last_progress) > JOINER_PROGRESS_TIMEOUT {
    self.state = Some(ConnectState::Failed);
    return Err(OpError::Timeout);
}

Priority 3: Condition-Based Test Waiting (if above fixes don't resolve)

File: crates/freenet-macros/src/codegen.rs

Tests should use wait_for_connections = true instead of fixed sleep:

#[freenet_test(
    nodes = ["gateway", "peer1", ...],
    wait_for_connections = true,
    expected_connections = 2,  // Each peer should have 2+ connections
)]

Priority 4: Retry Logic for Failed Forwards (Nice-to-have)

File: crates/core/src/operations/connect.rs

When forward attempts timeout, retry with a different peer instead of just recording failure:

// After FORWARD_ATTEMPT_TIMEOUT:
if let Some(alternative_peer) = self.find_alternative_forward_target(&skip_list) {
    actions.forward = Some((alternative_peer, request.clone()));
}

Testing

After fixes, these should pass reliably:

  • six-peer-regression
  • six-peer-connection-cap
  • Subscription forwarding tests
  • Any test with 5+ peers

Notes for Agents

  • Priority 1 and 2 are the core fixes - the others may not be needed if these work
  • The live_tx_tracker already exists and tracks transactions per peer - extend it to support multiple concurrent transactions
  • Be careful not to overwhelm the network - 3 parallel connections is a reasonable starting point
  • Consider adding metrics/logging to track connection acquisition latency

Related

[AI-assisted - Claude]

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-developer-xpArea: developer experienceA-networkingArea: Networking, ring protocol, peer discoveryE-mediumExperience needed to fix/implement: Medium / intermediateP-criticalCritical priorityT-bugType: Something is broken

    Type

    No type

    Projects

    Status

    Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions