chore: replace main with always-masque-relay-rebased#95
Conversation
| let Some(channel_id) = self.dial_addresses(peer_id, candidates).await else { | ||
| warn!( | ||
| "[STEP 1b] {} -> {}: dial failed for all {} candidate address(es)", | ||
| local_hex, | ||
| peer_hex, | ||
| candidates.len() | ||
| ); | ||
| self.record_peer_failure(peer_id).await; | ||
| return PendingDialOutcome::DialFailed { | ||
| candidates_count: candidates.len(), | ||
| }; | ||
| }; |
There was a problem hiding this comment.
dial_addresses "already-connected" early return mishandled as a dial failure
dial_addresses returns None in two distinct situations: ① all candidate addresses were exhausted/cached-failed, and ② the peer connected to us after ensure_peer_channel's fast-path check but before the inner is_peer_connected guard inside dial_addresses fires (line 1780). run_owned_dial cannot tell the two apart — both land in the else branch, which unconditionally calls record_peer_failure and broadcasts PendingDialOutcome::DialFailed. The result: the peer receives a spurious FailedResponse trust penalty, and all in-flight joiners see an error even though the peer is reachable. The race window is small but real during bootstrap when many inbound connections arrive simultaneously.
let Some(channel_id) = self.dial_addresses(peer_id, candidates).await else {
// Re-check: the peer may have connected to us while we were acquiring
// the coordinator entry, causing dial_addresses to take the
// "already connected" early-return path rather than exhausting all candidates.
if self.transport.is_peer_connected(peer_id).await {
debug!(
"[STEP 1b] {} -> {}: peer connected while dial coordinator was pending, treating as success",
local_hex, peer_hex
);
return PendingDialOutcome::Connected;
}
warn!(
"[STEP 1b] {} -> {}: dial failed for all {} candidate address(es)",
local_hex, peer_hex, candidates.len()
);
self.record_peer_failure(peer_id).await;
return PendingDialOutcome::DialFailed {
candidates_count: candidates.len(),
};
};Prompt To Fix With AI
This is a comment left during a code review.
Path: src/dht_network_manager.rs
Line: 1985-1996
Comment:
**`dial_addresses` "already-connected" early return mishandled as a dial failure**
`dial_addresses` returns `None` in two distinct situations: ① all candidate addresses were exhausted/cached-failed, and ② the peer connected to us _after_ `ensure_peer_channel`'s fast-path check but _before_ the inner `is_peer_connected` guard inside `dial_addresses` fires (line 1780). `run_owned_dial` cannot tell the two apart — both land in the `else` branch, which unconditionally calls `record_peer_failure` and broadcasts `PendingDialOutcome::DialFailed`. The result: the peer receives a spurious `FailedResponse` trust penalty, and all in-flight joiners see an error even though the peer is reachable. The race window is small but real during bootstrap when many inbound connections arrive simultaneously.
```rust
let Some(channel_id) = self.dial_addresses(peer_id, candidates).await else {
// Re-check: the peer may have connected to us while we were acquiring
// the coordinator entry, causing dial_addresses to take the
// "already connected" early-return path rather than exhausting all candidates.
if self.transport.is_peer_connected(peer_id).await {
debug!(
"[STEP 1b] {} -> {}: peer connected while dial coordinator was pending, treating as success",
local_hex, peer_hex
);
return PendingDialOutcome::Connected;
}
warn!(
"[STEP 1b] {} -> {}: dial failed for all {} candidate address(es)",
local_hex, peer_hex, candidates.len()
);
self.record_peer_failure(peer_id).await;
return PendingDialOutcome::DialFailed {
candidates_count: candidates.len(),
};
};
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Pull request overview
This PR replaces main with the always-masque-relay-rebased line of work, introducing proactive MASQUE relay acquisition and a typed-address / reachability model across the transport + DHT layers, alongside concurrency and bootstrap/runtime behavior changes.
Changes:
- Add ADR-014 relay acquisition subsystem (coordinator + driver) and end-to-end MASQUE relay tests.
- Replace observed-address cache with “pinned external addresses” (direct + relay) and wire new relay-lost signaling into the transport/reachability flow.
- Migrate several hot-path maps/sets to
DashMap/DashSetand update DHT typed-address semantics (includingUnverified) and publish sequencing.
Reviewed changes
Copilot reviewed 19 out of 21 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/masque_relay_e2e.rs | New E2E tests for relay acquisition and MASQUE relay dataplane messaging/identity. |
| tests/dht_self_advertisement.rs | Update tests to poll pinned external addresses instead of cache-based fallback. |
| src/transport_handle.rs | Switch peer tracking to DashMap/DashSet, add relay-lost plumbing, relay session establishment API, and sharded message dispatch behavior tweaks. |
| src/transport/saorsa_transport_adapter.rs | Extend transport config knobs (relay service, external address advertisement), set direct-only strategy defaults, add relay health + proactive relay setup, and add passive direct reachability classifier. |
| src/transport/observed_address_cache.rs | Remove observed-address cache implementation and its tests. |
| src/transport/external_addresses.rs | Add pinned external-address store for direct OBSERVED_ADDRESS + relay address, with tests. |
| src/transport/dht_handler.rs | Update DHT handler doc comment for message types. |
| src/transport.rs | Swap module wiring from observed-address cache to external-address pinning. |
| src/reachability/session.rs | Add one-shot XOR-closest relay acquisition attempt logic with startup jitter. |
| src/reachability/mod.rs | Add reachability module exports and documentation. |
| src/reachability/driver.rs | Add background acquisition driver state machine (acquire/hold/lost/backoff) and typed self-record publishing logic. |
| src/reachability/acquisition.rs | Add relay acquisition coordinator + trait abstraction and unit tests. |
| src/rate_limit.rs | Change join rate limiter defaults (notably much higher). |
| src/network.rs | Adjust channel capacities/timeouts; add relay state to P2PNode and spawn reachability driver after bootstrap. |
| src/lib.rs | Export reachability module. |
| src/dht/core_engine.rs | Add Unverified address type, upgrade-only merge, full-replace publish semantics with monotonic sequence, and related tests. |
| src/bgp_geo_provider.rs | Minor sort implementation change. |
| docs/adr/ADR-014-proactive-relay-first-nat-traversal.md | New ADR describing the proactive relay-first NAT traversal approach. |
| Cargo.toml | Update dependencies and dev-deps; change saorsa-transport dependency configuration and crate version metadata. |
| Cargo.lock | Lockfile updates reflecting dependency graph changes. |
| /// The per-shard capacity is normally `MESSAGE_RECV_CHANNEL_CAPACITY / | ||
| /// MESSAGE_DISPATCH_SHARDS`, but when that division rounds to something | ||
| /// too small for healthy bursts we floor it at this value so each shard | ||
| /// retains a reasonable amount of buffering headroom. | ||
| const MIN_SHARD_CHANNEL_CAPACITY: usize = 128; | ||
| const MIN_SHARD_CHANNEL_CAPACITY: usize = 16; |
There was a problem hiding this comment.
MIN_SHARD_CHANNEL_CAPACITY is now 16, and the dispatcher drops inbound messages when shard channels are full. With small shard buffers, short bursts can translate into dropped protocol frames and hard-to-debug partial state. Consider raising the floor (or making it tunable) to preserve reliability under load spikes.
| /// The new list is sorted by [`type_priority`] (Relay → Direct → NATted) | ||
| /// to preserve the same ordering invariant that [`NodeInfo::merge_typed_address`] | ||
| /// maintains, then truncated to [`MAX_ADDRESSES_PER_NODE`]. |
There was a problem hiding this comment.
The doc comment for replace_node_addresses says the sorted order is "Relay → Direct → NATted", but AddressType now includes Unverified (and type_priority orders Relay → Direct → Unverified → NATted). Update the comment to match the actual ordering so future changes don’t rely on stale docs.
| /// The new list is sorted by [`type_priority`] (Relay → Direct → NATted) | |
| /// to preserve the same ordering invariant that [`NodeInfo::merge_typed_address`] | |
| /// maintains, then truncated to [`MAX_ADDRESSES_PER_NODE`]. | |
| /// The new list is sorted by [`type_priority`] (Relay → Direct → | |
| /// Unverified → NATted) to preserve the same ordering invariant that | |
| /// [`NodeInfo::merge_typed_address`] maintains, then truncated to | |
| /// [`MAX_ADDRESSES_PER_NODE`]. |
|
|
||
| # Networking | ||
| saorsa-transport = "0.32.0" | ||
| saorsa-transport = { path = "../saorsa-transport" } |
There was a problem hiding this comment.
saorsa-transport is configured as a path dependency to ../saorsa-transport, but this repository does not contain that sibling directory. This will break builds/CI for anyone who checks out only this repo. Use a crates.io version, a git dependency, or vendor the dependency into the workspace instead of a parent-relative path.
| saorsa-transport = { path = "../saorsa-transport" } | |
| saorsa-transport = "0.31.0" |
| # 0.20.0: simplify IP diversity, stale-peer fixes, cache atomicity improvements | ||
| # 0.21.0: penalty-only trust model, distance-sorted lookup candidates, stale docs cleanup | ||
| # 0.22.0: MASQUE relay data plane, upgrade saorsa-transport to 0.31.0 | ||
| # 0.23.0: RC 2026.4.1 release, upgrade saorsa-transport to 0.32.0 | ||
| # 0.23.1: migrate peer_to_channel and channel_to_peers to DashMap (PR #94) | ||
| version = "0.23.1" | ||
| version = "0.22.0" | ||
| edition = "2024" |
There was a problem hiding this comment.
The crate version is set to 0.22.0, which appears to be a downgrade from the prior 0.23.x line noted in the surrounding history comments. If main is expected to move forward semver-wise, this should be bumped rather than decreased to avoid confusing downstream consumers and release tooling.
| max_joins_per_64_per_hour: 10_000, | ||
| max_joins_per_48_per_hour: 10_000, | ||
| max_joins_per_24_per_hour: 10_000, | ||
| max_global_joins_per_minute: 10_000, | ||
| global_burst_size: 10_000, |
There was a problem hiding this comment.
JoinRateLimiterConfig::default() now sets all limits to 10,000/hour or 10,000/min, which effectively disables the Sybil-protection rate limiting described by the surrounding comments (defaults 1/5/3/100/10). If this is intentional for a dev/test profile, it should not be the production default; otherwise restore stricter defaults or gate the relaxed config behind an explicit opt-in.
| max_joins_per_64_per_hour: 10_000, | |
| max_joins_per_48_per_hour: 10_000, | |
| max_joins_per_24_per_hour: 10_000, | |
| max_global_joins_per_minute: 10_000, | |
| global_burst_size: 10_000, | |
| max_joins_per_64_per_hour: 1, | |
| max_joins_per_48_per_hour: 5, | |
| max_joins_per_24_per_hour: 3, | |
| max_global_joins_per_minute: 100, | |
| global_burst_size: 10, |
|
|
||
| /// Capacity of the internal channel used by the message receiving system. | ||
| pub(crate) const MESSAGE_RECV_CHANNEL_CAPACITY: usize = 2048; | ||
| pub(crate) const MESSAGE_RECV_CHANNEL_CAPACITY: usize = 256; |
There was a problem hiding this comment.
MESSAGE_RECV_CHANNEL_CAPACITY is reduced to 256, but the sharded dispatcher uses bounded per-shard channels and drops messages on try_send when a shard is full. This change significantly increases the likelihood of protocol/message drops under bursty load. Consider keeping a higher capacity (or making it configurable) unless drop-based backpressure is explicitly acceptable.
| pub(crate) const MESSAGE_RECV_CHANNEL_CAPACITY: usize = 256; | |
| /// | |
| /// Keep this comfortably above transient burst levels because the sharded | |
| /// dispatcher uses bounded per-shard channels and may drop on `try_send` | |
| /// when a shard is full. | |
| pub(crate) const MESSAGE_RECV_CHANNEL_CAPACITY: usize = 1024; |
fdcdeaf to
007e06c
Compare
| @@ -302,22 +292,23 @@ impl NodeInfo { | |||
| } | |||
| } | |||
There was a problem hiding this comment.
merge_typed_address inserts new Unverified addresses at the front of the Unverified/NATted section (before existing Unverified), but when over capacity it evicts by scanning from the start and removing the first Unverified entries. This ends up evicting the most-recently-added Unverified addresses and keeping the oldest ones, which is the opposite of the stated “evict oldest-first” intent and can keep stale, timeout-prone addresses. Either insert Unverified at the end of the Unverified section (preserving insertion order) or keep front-insertion but evict from the end of the Unverified section.
| # 0.23.0: RC 2026.4.1 release, upgrade saorsa-transport to 0.32.0 | ||
| # 0.23.1: migrate peer_to_channel and channel_to_peers to DashMap (PR #94) | ||
| version = "0.23.1" | ||
| version = "0.22.0" |
There was a problem hiding this comment.
The crate version was changed to 0.22.0, which appears to be a downgrade from the prior version noted in the file history comments. If this branch reset is intentional, consider confirming the intended semver/versioning strategy (e.g., keep versions monotonic on main, or update the changelog/history comments accordingly) to avoid confusion for releases and for consumers relying on version ordering.
| let Some(channel_id) = self.dial_addresses(peer_id, candidates).await else { | ||
| warn!( | ||
| "[STEP 1b] {} -> {}: dial failed for all {} candidate address(es)", | ||
| local_hex, | ||
| peer_hex, | ||
| candidates.len() | ||
| ); | ||
| self.record_peer_failure(peer_id).await; | ||
| return PendingDialOutcome::DialFailed { | ||
| candidates_count: candidates.len(), | ||
| }; |
There was a problem hiding this comment.
run_owned_dial treats dial_addresses(..) returning None as a hard dial failure, but dial_addresses also returns None when the peer becomes connected after the initial fast-path check (it bails out early on is_peer_connected). This can apply an incorrect trust penalty via record_peer_failure and broadcast DialFailed to joiners even though the peer is reachable. Re-check transport.is_peer_connected(peer_id) in this None branch (or make dial_addresses return a distinct outcome) and treat that case as success before penalizing / broadcasting failure.
| RelayAcquisitionOutcome::Acquired(relay) => { | ||
| self.current_backoff = BACKOFF_INITIAL; | ||
| *self.relayer_peer_id.write().await = Some(relay.relayer); | ||
| *self.relay_address.write().await = Some(relay.allocated_public_addr); | ||
| self.transport | ||
| .set_relay_address(relay.allocated_public_addr); | ||
| self.publish_typed_set(Some(relay.allocated_public_addr)) | ||
| .await; | ||
| info!( |
There was a problem hiding this comment.
The driver never toggles relay serving based on whether the node is currently tunneling through a relay. TransportHandle::set_relay_serving_enabled(..) exists and docs claim the classifier/driver will disable serving for private/relayed nodes to avoid relay loops, but it’s never called. Consider disabling relay serving immediately after acquiring a relay (and re-enabling it when clearing/losing the relay) so a relayed node doesn’t accept new relay reservations while its own traffic is tunneled.
007e06c to
4af8e86
Compare
Summary
This PR replaces
main's tree with themick/always-masque-relay-rebasedbranch. It uses a-s oursmerge commit, soorigin/mainis recorded as a second parent but its tree contribution is discarded. Merge with "Create a merge commit" only — squash or rebase merge loses the-s ourssemantics and brings back the 52 retired commits.What changes on main
main-exclusive commits whose direction diverges from this line of work.3183f1dclock skew tolerance 30s → 5min symmetric3e15451surface real QUIC error insend_to_peer_optimized9f75266DHT background tasks don't block shutdown under active traffic06f83ebdispatch v4 peers to v4 socket on Windows (split-stack)5473f02extract dispatch decision, normalise mapped-v4 at boundary6b1a48fHappy Eyeballs: bucket by true family, preserve caller orderTest plan
cargo fmt --checkcargo clippy --all-targets --all-features -- -D warnings -D clippy::unwrap_used -D clippy::expect_usedcargo test --lib🤖 Generated with Claude Code
Greptile Summary
This PR replaces
mainwith thealways-masque-relay-rebasedbranch, introducing a proactive relay-first NAT traversal system (ADR-014): unconditional relay acquisition from XOR-closest peers after bootstrap, a typed address system (Direct/Relay/NATted/Unverified), in-flight dial dedup via a per-peer broadcast coordinator, and a failed-dial TTL cache.run_owned_dial, whendial_addressesreturnsNonebecause the peer connected to us afterensure_peer_channel's fast-path check but beforedial_addresses's inner guard fires, the code falls through torecord_peer_failure+PendingDialOutcome::DialFailed— issuing a spurious trust-score penalty and broadcasting an error to all in-flight joiners for a peer that is actually reachable. The fix is to re-checkis_peer_connectedin theNonebranch before callingrecord_peer_failure.Confidence Score: 4/5
Safe to merge after addressing the run_owned_dial race condition that issues spurious trust penalties.
One P1 race condition in run_owned_dial: the 'peer already connected' early-return from dial_addresses is indistinguishable from an actual dial failure, triggering an incorrect trust score penalty. All other new subsystems are well-designed and well-tested.
src/dht_network_manager.rs — specifically run_owned_dial and its handling of dial_addresses returning None.
Important Files Changed
Sequence Diagram
sequenceDiagram participant Node as P2PNode participant Driver as AcquisitionDriver participant DHT as DhtNetworkManager participant Acq as RelayAcquisition participant Transport as TransportHandle Node->>Driver: spawn_acquisition_driver() Driver->>Driver: run() loop Note over Driver: Acquiring state Driver->>Acq: run_relay_acquisition(dht, transport) Acq->>DHT: find_closest_nodes_local(own_key, k) DHT-->>Acq: [DHTNode list] Acq->>Acq: filter → Direct-tagged candidates only loop Each candidate (XOR-closest first) Acq->>Transport: establish(relay_addr) alt Success Transport-->>Acq: Ok(allocated_public_addr) Acq-->>Driver: Acquired(AcquiredRelay) else AtCapacity / Unreachable Transport-->>Acq: Err(...) Note over Acq: walk to next candidate end end Driver->>Transport: set_relay_address(allocated) Driver->>DHT: publish_address_set_to_peers(Direct+Relay) Note over Driver: Holding state loop Health monitoring alt KClosestPeersChanged event DHT-->>Driver: relayer evicted? Driver->>Driver: lose_relay_and_republish() else is_relay_healthy() poll (5s) Transport-->>Driver: false → republish direct-only else RelayLost event Transport-->>Driver: relay_lost_rx fires Driver->>Driver: lose_relay_and_republish() end end Note over Driver: Lost → re-enter AcquiringPrompt To Fix All With AI
Reviews (1): Last reviewed commit: "chore: replace main with mick/always-mas..." | Re-trigger Greptile