fix: Star topology formation due to serialized connection acquisition

## Problem

In small network tests (e.g., six-peer-regression), peers form a **star topology** (all connected only to gateway) instead of a mesh topology where peers connect to each other. This causes test failures when operations need to route through peers that aren't directly connected.

### User-Visible Impact
- Flaky tests (six-peer-regression, subscription forwarding tests)
- Slow network formation in production
- Operations fail when routing requires peer-to-peer paths that don't exist

### Root Cause Analysis

Four parallel investigations identified these contributing factors:

## 1. Serial Connection Acquisition (CRITICAL)

**Location:** `crates/core/src/ring/mod.rs:428-466`

The maintenance loop uses `live_tx` to track pending connections, but **only allows ONE connection attempt at a time**:

```rust
if let Some(ideal_location) = pending_conn_adds.pop_first() {
    if live_tx.is_none() {  // <- Only proceeds if no pending transaction
        live_tx = self.acquire_new(...).await?;
    } else {
        pending_conn_adds.insert(ideal_location);  // Re-queue and wait
    }
}
```

**Impact:** If a connection attempt takes 60 seconds (OPERATION_TTL), ALL other connection attempts are blocked for 60 seconds. With 25 min_connections target and serial acquisition, reaching minimum connections takes far longer than test windows allow.

## 2. No Timeout on WaitingForResponses (CRITICAL)

**Location:** `crates/core/src/operations/connect.rs:769-792`

The `JoinerState` has a `last_progress: Instant` field that is tracked but **never enforced**:

```rust
pub(crate) struct JoinerState {
    pub target_connections: usize,
    pub last_progress: Instant,  // Tracked but NEVER checked for timeout
    pub accepted: HashSet<PeerKeyLocation>,
    // ...
}
```

**Impact:** If a connect operation sends requests but receives no responses, it waits **indefinitely**. There's no mechanism to abort stalled operations.

Forward attempts timeout after 20 seconds (`FORWARD_ATTEMPT_TIMEOUT`), but this only records the failure for the estimator - it doesn't retry or fail the overall operation.

## 3. Tests Start Before Network Ready (MODERATE)

**Location:** `crates/freenet-macros/src/codegen.rs:543-580`

Tests use a **fixed 15-second sleep** instead of condition-based waiting:

```rust
tracing::info!("Waiting {} seconds for nodes to start up", startup_wait_secs);
tokio::time::sleep(Duration::from_secs(startup_wait_secs)).await;
```

**Impact:** For 6-peer networks, 15 seconds may not be enough for peer-to-peer connections to form, especially with serial acquisition.

## 4. Initial Join is Fire-and-Forget (MODERATE)

**Location:** `crates/core/src/node/p2p_impl.rs:106-114`

The `initial_join_procedure()` is spawned as a background task but **never awaited**:

```rust
let join_handle = connect::initial_join_procedure(...).await?;
self.initial_join_task = Some(join_handle);  // Stored but never awaited
```

**Impact:** Node startup completes before the join procedure finishes, leading to race conditions.

---

## Recommended Fixes

### Priority 1: Allow Parallel Connection Attempts

**File:** `crates/core/src/ring/mod.rs`

Instead of single `live_tx: Option<Transaction>`, use a bounded set:

```rust
const MAX_CONCURRENT_CONNECTIONS: usize = 3;

// In maintenance loop:
if pending_conn_adds.len() > 0 && live_tx_tracker.active_count() < MAX_CONCURRENT_CONNECTIONS {
    let ideal_location = pending_conn_adds.pop_first().unwrap();
    let tx = self.acquire_new(ideal_location, ...).await?;
    if let Some(tx) = tx {
        live_tx_tracker.add(tx);
    }
}
```

### Priority 2: Add Timeout to WaitingForResponses

**File:** `crates/core/src/operations/connect.rs`

Enforce `last_progress` timeout (e.g., 30 seconds of no progress = fail):

```rust
const JOINER_PROGRESS_TIMEOUT: Duration = Duration::from_secs(30);

// In handle_response or a periodic check:
if now.duration_since(state.last_progress) > JOINER_PROGRESS_TIMEOUT {
    self.state = Some(ConnectState::Failed);
    return Err(OpError::Timeout);
}
```

### Priority 3: Condition-Based Test Waiting (if above fixes don't resolve)

**File:** `crates/freenet-macros/src/codegen.rs`

Tests should use `wait_for_connections = true` instead of fixed sleep:

```rust
#[freenet_test(
    nodes = ["gateway", "peer1", ...],
    wait_for_connections = true,
    expected_connections = 2,  // Each peer should have 2+ connections
)]
```

### Priority 4: Retry Logic for Failed Forwards (Nice-to-have)

**File:** `crates/core/src/operations/connect.rs`

When forward attempts timeout, retry with a different peer instead of just recording failure:

```rust
// After FORWARD_ATTEMPT_TIMEOUT:
if let Some(alternative_peer) = self.find_alternative_forward_target(&skip_list) {
    actions.forward = Some((alternative_peer, request.clone()));
}
```

---

## Testing

After fixes, these should pass reliably:
- `six-peer-regression` 
- `six-peer-connection-cap`
- Subscription forwarding tests
- Any test with 5+ peers

## Notes for Agents

- Priority 1 and 2 are the core fixes - the others may not be needed if these work
- The `live_tx_tracker` already exists and tracks transactions per peer - extend it to support multiple concurrent transactions
- Be careful not to overwhelm the network - 3 parallel connections is a reasonable starting point
- Consider adding metrics/logging to track connection acquisition latency

## Related

- Part of ongoing work to fix flaky CI tests
- Connected to PR #2167 (connection-based routing)

[AI-assisted - Claude]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix: Star topology formation due to serialized connection acquisition #2173

Problem

User-Visible Impact

Root Cause Analysis

1. Serial Connection Acquisition (CRITICAL)

2. No Timeout on WaitingForResponses (CRITICAL)

3. Tests Start Before Network Ready (MODERATE)

4. Initial Join is Fire-and-Forget (MODERATE)

Recommended Fixes

Priority 1: Allow Parallel Connection Attempts

Priority 2: Add Timeout to WaitingForResponses

Priority 3: Condition-Based Test Waiting (if above fixes don't resolve)

Priority 4: Retry Logic for Failed Forwards (Nice-to-have)

Testing

Notes for Agents

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

fix: Star topology formation due to serialized connection acquisition #2173

Description

Problem

User-Visible Impact

Root Cause Analysis

1. Serial Connection Acquisition (CRITICAL)

2. No Timeout on WaitingForResponses (CRITICAL)

3. Tests Start Before Network Ready (MODERATE)

4. Initial Join is Fire-and-Forget (MODERATE)

Recommended Fixes

Priority 1: Allow Parallel Connection Attempts

Priority 2: Add Timeout to WaitingForResponses

Priority 3: Condition-Based Test Waiting (if above fixes don't resolve)

Priority 4: Retry Logic for Failed Forwards (Nice-to-have)

Testing

Notes for Agents

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions