Skip to content

feat(nat): rotate hole-punch coordinators with capped per-attempt timeout#43

Merged
mickvandijke merged 3 commits intorc-2026.4.1from
feat/hole-punch-coordinator-rotation
Apr 6, 2026
Merged

feat(nat): rotate hole-punch coordinators with capped per-attempt timeout#43
mickvandijke merged 3 commits intorc-2026.4.1from
feat/hole-punch-coordinator-rotation

Conversation

@mickvandijke
Copy link
Copy Markdown
Collaborator

@mickvandijke mickvandijke commented Apr 6, 2026

Summary

  • Replaces the single-coordinator hole-punch hint with a ranked list (Tier 2): callers supply best-first, the connect loop rotates through them with a short per-attempt timeout (1.5s) for non-final candidates and the strategy's full timeout for the last. Worst-case wait drops from K * 8s to (K-1) * 1.5s + 8s.
  • Adds coordinator-side back-pressure (Tier 4 lite): BootstrapCoordinator tracks active relay slots keyed by (initiator_peer_id, target_peer_id) and silently refuses incoming PUNCH_ME_NOW relays once at capacity. Inline garbage-collection sweep reclaims stale slots so ghost sessions cannot leak the counter. Configurable via two new NatTraversalConfig fields: coordinator_max_active_relays (default 32) and coordinator_relay_slot_timeout (default 5s).
  • No wire-protocol changes — back-pressure is purely timing-based (silent refusal triggers Tier 2 rotation on the initiator). Backwards-compatible with older peers.

Why

Today every cold-starting peer pins one of the small set of bootstrap nodes (saorsa-2 / saorsa-3) as its hole-punch coordinator, concentrating active coordinator load on infrastructure that is already serving DHT traffic. With this change a single overloaded coordinator stops blocking the dial: the initiator times out fast (1.5s) and rotates to the next preferred candidate, and the coordinator itself sheds excess load by silently refusing relays once it has 32 in flight.

Design choice

Considered an explicit `HOLE_PUNCH_COORDINATOR_BUSY` wire frame for instant initiator notification but the cost — new frame type, codec, parser dispatch, version compatibility — was not justified. Timing-based back-pressure wastes ~1.5s per busy coordinator but ships in roughly 1/3 the code with zero protocol surface and is fully backwards-compatible.

Test plan

  • `cargo fmt --all -- --check` clean
  • `cargo clippy --all-targets -- -D warnings -D clippy::unwrap_used -D clippy::expect_used` clean
  • `cargo test --lib` passes (1480 tests, +10 new)
  • 5 new unit tests for `P2pEndpoint::merge_preferred_coordinators`: empty preferred no-op, front insertion in order, dedup of pre-existing entries, non-preferred entries keep their original order, empty candidate list
  • 5 new unit tests for `BootstrapCoordinator` back-pressure: under-capacity accept, at-capacity silent refuse with stat increment, slot re-arm without consuming additional capacity, sweep reclaims stale slots, non-relay frames do not consume slots
  • No breaking API changes — old `set_hole_punch_preferred_coordinator` is retained as a thin wrapper around the new list form

Companion PR

Second of three PRs landing smarter hole-punch coordinator selection. The first (saorsa-labs/saorsa-core#TBD) added round-aware referrer ranking in the DHT layer and is independent of this one. The third will wire saorsa-core to call the new `set_hole_punch_preferred_coordinators` API once this merges.

🤖 Generated with Claude Code

Greptile Summary

This PR implements two back-pressure improvements to hole-punch coordinator selection: a ranked coordinator list with per-attempt rotation (Tier 2) where non-final candidates get a 1.5s quick timeout before advancing to the next, and a node-wide RelaySlotTable (Tier 4 lite) that caps simultaneous in-flight (initiator_addr, target_peer_id) relay sessions and silently refuses when at capacity. The implementation is correct, well-documented, and well-tested (10 new unit tests), and the previously flagged concern about P2pConfig::to_nat_config() not propagating the new fields has been addressed — both coordinator_max_active_relays and coordinator_relay_slot_idle_timeout flow correctly through NatConfigto_nat_config()NatTraversalConfigRelaySlotTable::new.

Confidence Score: 5/5

Safe to merge — no P0 or P1 issues found; implementation is correct, well-tested, and well-documented.

All 10 changed/new files reviewed thoroughly. The relay slot table logic, rotation bounds, lock ordering, config propagation, and Drop-based cleanup are all correct. Prior review concerns about to_nat_config() propagation were addressed. Remaining observations (duplicate-preferred inefficiency, stale relay_initiator_addr on QUIC migration) are P2 edge cases bounded by the 5s idle timeout and do not affect correctness or reliability in normal operation.

No files require special attention.

Important Files Changed

Filename Overview
src/relay_slot_table.rs New node-wide back-pressure table; lock-drop-before-warn pattern is correct, amortized sweep bounds per-frame cost, 6 focused inline tests cover all code paths
src/p2p_endpoint.rs Adds merge_preferred_coordinators (O(N+M), correct dedup), rotation index arithmetic is bounds-safe, debug_assert guards sync between index and strategy coordinator, 5 new unit tests
src/connection/nat_traversal.rs BootstrapCoordinator correctly threads relay_slot_table Arc; Drop releases initiator slots; relay_initiator_addr captured once (stale on migration, but bounded by 5s idle timeout)
src/nat_traversal_api.rs New config fields with serde defaults and validation; single RelaySlotTable Arc shared via clone across both server and client TransportConfig — cap is correctly node-wide
src/unified_config.rs NatConfig gains both new fields with correct defaults; to_nat_config() propagates them to NatTraversalConfig (addresses prior review concern)
src/config/transport.rs relay_slot_table: Option<Arc<RelaySlotTable>> added to TransportConfig; defaults to None for Quinn-style fixtures that bypass P2pEndpoint
src/connection/mod.rs Both NatTraversalState::new call sites correctly thread self.config.relay_slot_table.clone() from the frozen config
src/lib.rs Exposes pub mod relay_slot_table; no other changes
tests/relay_queue_tests.rs Existing test struct literals updated with new NatTraversalConfig fields; no logic changes
tests/security_regression_tests.rs Existing test helpers updated with new NatTraversalConfig fields; no logic changes

Sequence Diagram

sequenceDiagram
    participant I as Initiator
    participant C1 as Coordinator 1 (preferred)
    participant C2 as Coordinator 2 (preferred)
    participant C3 as Coordinator 3 (fallback)
    participant T as Target

    Note over I: set_hole_punch_preferred_coordinators([C1, C2])
    Note over I: preferred_coordinator_count = 2

    I->>C1: PUNCH_ME_NOW (attempt 1)
    Note over C1: try_acquire slot → capacity check
    alt C1 at capacity (Tier 4 back-pressure)
        C1-->>I: silent drop (no reply)
        Note over I: PER_COORDINATOR_QUICK_HOLEPUNCH_TIMEOUT (1.5s)
    else C1 available
        C1->>T: PUNCH_ME_NOW relay
        T-->>I: QUIC connection attempt
    end

    Note over I: Timeout after 1.5s → rotate
    Note over I: current_preferred_coordinator_idx = 0 → 1
    Note over I: is_final_rotation_attempt = true (1+1 >= 2)

    I->>C2: PUNCH_ME_NOW (attempt 2, full strategy timeout ~8s)
    alt C2 available
        C2->>T: PUNCH_ME_NOW relay
        T->>I: QUIC connection (success)
    else C2 fails / times out
        Note over I: should_retry_holepunch()?
        alt retry budget remaining
            I->>C2: retry with C2
        else exhausted
            I->>C3: MASQUE relay fallback
        end
    end
Loading

Greploops — Automatically fix all review issues by running /greploops in Claude Code. It iterates: fix, push, re-review, repeat until 5/5 confidence.
Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal.

Reviews (2): Last reviewed commit: "refactor!(nat): node-wide coordinator ba..." | Re-trigger Greptile

jacderida and others added 2 commits April 6, 2026 21:07
The per-connection rate limit of 5 coordination requests per 60-second
window is too restrictive for bootstrap nodes acting as coordinators.
On a network with 40+ NAT-restricted nodes, a client needs to send
~80 PUNCH_ME_NOW frames (2 rounds per target) during initial connection
establishment. The low limit causes 50% of relay requests to be silently
dropped, preventing hole punches from succeeding.

Raise to 50 per minute to allow clients to hole-punch to all targets
in a single burst while still providing protection against amplification
attacks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…eout

Adds two coordinated changes that move coordinator load off the small
set of bootstrap nodes that today serve as the de-facto hole-punch
coordinator for every cold-starting peer.

Tier 2 — list-form preferred coordinators with rotation:

- hole_punch_preferred_coordinators changes from
  DashMap<SocketAddr, SocketAddr> to DashMap<SocketAddr, Vec<SocketAddr>>.
  Callers (saorsa-core's DHT layer) supply a ranked list best-first;
  the existing single-coordinator setter remains as a thin wrapper.
- A new merge_preferred_coordinators helper inserts the ranked list
  at the front of coordinator_candidates, deduping any pre-existing
  copies. Pure function, unit-tested.
- The hole-punch loop in connect_with_fallback_inner rotates through
  the first N candidates (where N = preferred list length). All but
  the final attempt use a new short timeout
  (PER_COORDINATOR_QUICK_HOLEPUNCH_TIMEOUT = 1.5s) so a busy or
  unreachable coordinator is abandoned quickly; the final attempt
  uses the strategy's full hole-punch timeout to give it time to
  complete the punch. When no preferred list is set, the legacy
  single-coordinator retry behaviour is preserved.
- Bounds-safe rotation via .get() instead of indexing.

Tier 4 (lite) — coordinator-side back-pressure with silent refusal:

- New NatTraversalConfig fields coordinator_max_active_relays
  (default 32) and coordinator_relay_slot_timeout (default 5s),
  validated in 1..=256 and 100ms..=60s respectively.
- Plumbed through TransportConfig via setters mirroring
  allow_loopback, into NatTraversalState::new and
  BootstrapCoordinator::new.
- BootstrapCoordinator gains a relay_slots HashMap keyed by
  (initiator_peer_id, target_peer_id) recording arrival Instant,
  plus a backpressure_refusals stat.
- process_punch_me_now_frame inline-sweeps stale slots before the
  back-pressure check (so ghost slots from crashed peers cannot
  leak the counter), then either accepts and inserts/refreshes the
  slot or silently returns Ok(None) and increments the refusal stat.
  Re-arming the same (initiator, target) pair refreshes the slot
  timestamp without consuming additional capacity.
- Only the relay branch (frames carrying target_peer_id) consumes a
  slot — non-relay echo frames are unchanged.

The two tiers compose: a coordinator at capacity silently drops the
relay; the initiator's per-attempt timeout (Tier 2) drives it to its
next preferred coordinator. No new wire frame is needed, so this is
backwards-compatible with older peers.

Tests added:
- 5 unit tests for merge_preferred_coordinators (front insertion,
  ordering preservation, dedup, empty handling)
- 5 unit tests for BootstrapCoordinator back-pressure (under-cap
  accept, at-cap silent refuse, slot re-arm without capacity
  consumption, sweep reclaims stale slots, non-relay frames don't
  consume slots)

Test files (security_regression_tests.rs, relay_queue_tests.rs)
updated to include the two new fields in their NatTraversalConfig
struct literals.

This is the second of three PRs landing smarter hole-punch coordinator
selection. The first (saorsa-core) added round-aware referrer ranking
in the DHT layer; the third will wire saorsa-core to call the new
list-form set_hole_punch_preferred_coordinators API.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mickvandijke mickvandijke marked this pull request as draft April 6, 2026 22:42
…lotTable

Addresses PR #43 deep-review findings H1, H2, H3, M1, M2, L1–L6.

## H1 — node-wide scope (breaking)

Back-pressure state moves out of BootstrapCoordinator into a new shared
`RelaySlotTable` (`src/relay_slot_table.rs`) that every QUIC connection
spawned by a NatTraversalEndpoint references via Arc. The 32-slot cap is
now enforced *across all connections at the node*, not per-connection —
which is what the PR description originally promised but the code didn't
deliver. One table is instantiated in `create_inner_endpoint` and
injected into both server-side and client-side TransportConfig.

The old `TransportConfig::{coordinator_max_active_relays,
coordinator_relay_slot_timeout}` fields are gone; they are replaced by
`TransportConfig::relay_slot_table: Option<Arc<RelaySlotTable>>`.

BREAKING: `TransportConfig::coordinator_max_active_relays` and
`coordinator_relay_slot_timeout` setters removed. Callers using
`NatTraversalConfig` should switch to `coordinator_max_active_relays`
(unchanged) and the renamed `coordinator_relay_slot_idle_timeout`.

## H2 — explicit release on connection close

`impl Drop for BootstrapCoordinator` calls
`RelaySlotTable::release_for_initiator(addr)`, reclaiming every slot
owned by the closed connection immediately. The idle timeout is now an
honest safety net for crashes/NAT rebinds, not the only release path.
The coordinator cannot directly observe hole-punch outcomes (traffic
flows initiator↔target, bypassing the coordinator), so the three
release mechanisms are: connection close (fast), re-arm refresh, and
idle sweep. All three are documented on the config field.

## H3 — drop dead `from_peer` from slot key

Slot key is now `(SocketAddr, [u8; 32])` — `(initiator_addr,
target_peer_id)`. The old `from_peer = derive_peer_id_from_connection()`
was a connection-id hash that was constant per BootstrapCoordinator
instance, so keying on it added zero discrimination in production. The
socket address is stable across rounds within a session and distinct
across initiators, which is exactly what dedup needs.

## M1 — rotation / max_holepunch_rounds interaction documented

New paragraph on `set_hole_punch_preferred_coordinators` explains how
the rotation index and strategy round counter advance together, and the
edge case where a caller raising `max_holepunch_rounds` above K gives
the final coordinator extra retries.

## M2 — fields plumbed through `P2pConfig::nat`

`NatConfig` gains `coordinator_max_active_relays` and
`coordinator_relay_slot_idle_timeout`; `to_nat_config()` now reads from
`self.nat.*` instead of hardcoding defaults, so downstream P2pConfig
callers can tune the cap.

## Minor fixes

- L1: misleading "Coordination completed" trace on the refusal path is
  replaced with an accurate "refused by node-wide back-pressure" trace.
- L2: RelaySlotTable warns at the first refusal and every 16 thereafter
  so operators see a log line at the start of a storm.
- L3: debug_assert! in the connect loop proves
  `coordinator_candidates[idx] == strategy_coordinator` while rotating.
- L4: sweep is amortized — `sweep_if_due` runs the `retain` only if
  the previous sweep was at least 100ms ago, bounding per-frame cost.
- L5: `merge_preferred_coordinators` builds the merged list with one
  allocation instead of `Vec::insert(0, ..)` in a loop (O(N+M) not
  O(N·M)).
- L6: test helpers use `VarInt::from_u32` directly; dead `unwrap_or`
  fallback gone.

## Tests

- 6 new `RelaySlotTable` unit tests (under-cap, at-cap refuse, re-arm,
  idle sweep, release_for_initiator, refusal counter).
- 4 new `BootstrapCoordinator`-integration tests verifying the shared
  table is consulted for the relay branch, the non-relay branch doesn't
  consume a slot, capacity refusal is silent, and — new for H2 — that
  dropping the coordinator releases exactly the slots it owned while
  leaving other initiators' slots alone.
- Old 5 per-coordinator back-pressure tests are gone (the data
  structure they exercised no longer exists).
- Existing relay_queue_tests.rs and security_regression_tests.rs
  struct literals updated for the renamed field.

1485 lib tests + 24 integration tests pass; cargo fmt clean;
cargo clippy --all-targets -- -D warnings -D clippy::unwrap_used
-D clippy::expect_used clean.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mickvandijke mickvandijke marked this pull request as ready for review April 6, 2026 23:13
@mickvandijke mickvandijke merged commit 2441896 into rc-2026.4.1 Apr 6, 2026
3 of 5 checks passed
mickvandijke added a commit that referenced this pull request Apr 7, 2026
… key

The coordinator-rotation flow added in PR #43 was producing a "duplicate
connection" close storm under symmetric NAT during testnet validation.
Each rotation round opened a parallel A→B QUIC dial via the per-round
`post_direct` probe alongside the relayed B→A return; under symmetric
NAT each return arrived on a fresh source port that the SocketAddr-keyed
dedup in `spawn_accept_loop` could not collapse, so several connections
accumulated for the same logical peer. The first one promoted while the
losers were closed as `b"duplicate"` — and any of those closes that
happened to be the one saorsa-core's lifecycle monitor was tracking
killed the identity exchange, producing the consistent 15 s timeouts the
testnet logs showed.

Three changes in `src/p2p_endpoint.rs`:

1. **Bump `PER_COORDINATOR_QUICK_HOLEPUNCH_TIMEOUT` from 1.5 s to 4 s.**
   1.5 s was below one full PUNCH_ME_NOW round trip on cross-region
   links, so rotation routinely fired while the previous round's
   relayed return was still in flight. 4 s gives a comfortably high
   margin and dramatically reduces the racing-attempt window. Worst-
   case wait for K preferred coordinators is now `(K-1) * 4s +
   holepunch_timeout` instead of `K * holepunch_timeout`, still well
   below `K * 8s`.

2. **Remove the per-rotation `post_direct` probe.** The probe is
   replaced by a new `try_post_rotation_direct` helper that fires
   exactly **once** at the end of the rotation chain, after every
   coordinator has been tried. This preserves the optimisation — the
   accumulated NAT bindings from the rotation may finally let a direct
   dial through — without producing a per-round race.

3. **Add peer-identity dedup at `P2pEndpoint::accept`.** A new
   `find_live_connection_by_public_key` helper scans `connected_peers`
   for any existing entry whose TLS public key matches the new
   incoming connection. On a hit the new connection is closed silently
   with `b"duplicate-peer-id"` and never registered, so symmetric-NAT
   rebinds collapse to the single surviving authoritative connection
   regardless of which `SocketAddr` they arrived on. The helper filters
   stale entries via `self.inner.is_connected(addr)` so a dead
   connection that hasn't been swept yet does not block its
   replacement.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
grumbach added a commit to grumbach/saorsa-transport that referenced this pull request Apr 8, 2026
…ing it

The accept loop was closing the NEWER incoming connection when a live
connection to the same address existed, keeping the older one. This is
backwards: the newer connection is the one the remote peer just completed
a handshake on and is actively using. Closing it kills their identity
exchange, causing 15s timeouts.

This was the root cause of the regression from PR saorsa-labs#43 (coordinator
rotation): when the rotation opened a new connection to a coordinator
that already had an old connection, the acceptor closed the new one,
leaving neither side with a working connection.

Fix: replace old connection with new one (consistent with add_connection
which also always overwrites with the newer connection). Also reset the
emitted set so the replacement gets a reader task and PeerConnected event.

Defense in depth: try_hole_punch now checks the DashMap in addition to
connected_peers before opening a new coordinator connection, preventing
unnecessary duplicates from being created in the first place.
grumbach added a commit to grumbach/saorsa-transport that referenced this pull request Apr 8, 2026
…ing it

The accept loop was closing the NEWER incoming connection when a live
connection to the same address existed, keeping the older one. This is
backwards: the newer connection is the one the remote peer just completed
a handshake on and is actively using. Closing it kills their identity
exchange, causing 15s timeouts.

This was the root cause of the regression from PR saorsa-labs#43 (coordinator
rotation): when the rotation opened a new connection to a coordinator
that already had an old connection, the acceptor closed the new one,
leaving neither side with a working connection.

Fix: replace old connection with new one (consistent with add_connection
which also always overwrites with the newer connection). Also reset the
emitted set so the replacement gets a reader task and PeerConnected event.

Defense in depth: try_hole_punch now checks the DashMap in addition to
connected_peers before opening a new coordinator connection, preventing
unnecessary duplicates from being created in the first place.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants