Skip to content

Conversation

@sanity
Copy link
Collaborator

@sanity sanity commented Nov 29, 2025

Problem

Subscribe responses in NAT scenarios were being sent to wrong addresses. The gateway tried to send ReturnSub to 127.0.0.1:31337 (peer's self-reported address) instead of 172.26.0.102:31337 (actual NAT-translated address observed by gateway).

This caused "Timeout waiting for subscribe response" failures in the docker_nat test.

Root Cause

Two separate bugs were causing address routing failures:

  1. Error path bug (first commit): The subscribe_op_end function wasn't setting target_addr in OperationResult for error responses, causing them to be sent to the wrong address.

  2. Bridge event bug (second commit): The handle_bridge_msg function was discarding the target address from PeerId when creating ConnEvent::OutboundMessage. The event loop then extracted the target from msg.target().addr() - the peer's self-reported address - instead of using the transport address passed from P2pBridge::send().

This Solution

Commit 1: Fix error path in subscribe.rs

  • Set target_addr from upstream_addr for all error responses (BroadcastStateConflict, error_response)
  • This ensures error responses are routed to the correct NAT-translated address

Commit 2: Preserve target address through bridge event

  • Add new ConnEvent::OutboundMessageWithTarget variant with explicit target_addr: SocketAddr
  • Modify handle_bridge_msg to use this variant, extracting target.addr from the PeerId
  • Add handler that uses target_addr directly for connection lookup instead of msg.target().addr()

Testing

  • Docker NAT test now shows 100% connectivity
  • River message flow test progresses past the subscribe timeout:
    • riverctl room create succeeded in 1.15s
    • riverctl invite accept succeeded in 81.75ms

Fixes

Part of #2164 (peer identity restructuring)

[AI-assisted - Claude]

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@sanity sanity force-pushed the refactor/wire-protocol-cleanup-2164 branch from 324bd29 to ed93d91 Compare November 29, 2025 19:47
@sanity sanity force-pushed the fix/subscribe-nat-routing-2164 branch 2 times, most recently from bb74066 to eea52ed Compare November 29, 2025 21:23
@sanity sanity force-pushed the refactor/wire-protocol-cleanup-2164 branch from cfee166 to 945fdb3 Compare November 29, 2025 22:38
@sanity sanity force-pushed the fix/subscribe-nat-routing-2164 branch from 1c90c91 to 0a9183a Compare November 29, 2025 22:38
sanity added a commit that referenced this pull request Nov 29, 2025
ObservedAddress and ConnectResponse messages must route through the
upstream connection (source_addr) rather than directly to target
addresses, since relay peers may not have direct connections to joiners.

This mirrors the fix from PR #2171 that addressed the same NAT routing
issue for different message types.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@sanity sanity force-pushed the refactor/wire-protocol-cleanup-2164 branch from 945fdb3 to 642a8d4 Compare November 29, 2025 23:33
@sanity sanity force-pushed the fix/subscribe-nat-routing-2164 branch from 0a9183a to 2906500 Compare November 29, 2025 23:34
@sanity sanity force-pushed the refactor/wire-protocol-cleanup-2164 branch from 642a8d4 to a2d4cfe Compare November 29, 2025 23:42
@sanity sanity force-pushed the fix/subscribe-nat-routing-2164 branch from 2906500 to 4ba0035 Compare November 29, 2025 23:43
@sanity sanity force-pushed the refactor/wire-protocol-cleanup-2164 branch from a2d4cfe to 3230675 Compare November 30, 2025 00:13
@sanity sanity force-pushed the fix/subscribe-nat-routing-2164 branch from 4ba0035 to f849c3a Compare November 30, 2025 00:14
@sanity sanity requested a review from Copilot November 30, 2025 04:48
Copilot finished reviewing on behalf of sanity November 30, 2025 04:52
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@sanity
Copy link
Collaborator Author

sanity commented Nov 30, 2025

@claude please do a code review of this PR. This is part 3 of a 4-PR stack - it fixes Subscribe operation routing to work correctly with NAT peers.

@claude
Copy link

claude bot commented Nov 30, 2025

Claude finished @sanity's task —— View job


Code Review: Subscribe Operation NAT Routing Fix

This PR successfully fixes NAT routing issues in the Subscribe operation by ensuring error responses are sent to the correct upstream address. The implementation is consistent with the wire protocol cleanup and follows the same patterns established in the Get operation.


✅ Strengths

1. Correct NAT Routing Pattern

  • The PR properly uses upstream_addr for all error responses in subscribe.rs
  • The new ConnEvent::OutboundMessageWithTarget variant correctly preserves the transport-level address
  • Consistent with the approach used in get.rs (lines 1402-1404)

2. Comprehensive Error Path Coverage
The fix covers all error scenarios in subscribe.rs:

  • Line 474-478: BroadcastStateConflict error response
  • Line 589-598: return_not_subbed() helper for all SeekNode error cases
  • Line 608, 644, 659, 663, 670, 741: All call sites properly use the helper

3. Bridge Event Handling
The handle_bridge_msg implementation (p2p_protoc.rs:2318-2331) correctly extracts target.addr from the PeerId, which contains the address from P2pBridge::send().

4. Logging Improvements

  • connect.rs changes use pub_key() instead of peer() to avoid panics with Unknown addresses
  • get.rs adds proper null checks before calling sender.peer() (lines 858-865)

🔍 Issues & Suggestions

Critical Issue: Inconsistent Return Type in Error Helper

Location: crates/core/src/operations/subscribe.rs:589-598

The return_not_subbed closure is changed from returning OperationResult to returning Result<OperationResult, OpError>, but there's an inconsistency:

let return_not_subbed = || -> Result<OperationResult, OpError> {
    let return_msg = SubscribeMsg::ReturnSub { ... };
    build_op_result(*id, None, Some(return_msg), upstream_addr)
};

Problem: All call sites now use return return_not_subbed(); instead of return Ok(return_not_subbed());. This works because build_op_result returns Result<OperationResult, OpError>, but it's confusing because:

  1. The closure signature says it returns Result<OperationResult, OpError>
  2. But build_op_result already returns that type
  3. So the closure is actually returning Result<Result<OperationResult, OpError>, OpError> - which Rust auto-flattens

Recommendation: Change the closure to match the actual return type:

let return_not_subbed = || -> Result<OperationResult, OpError> {
    build_op_result(*id, None, Some(return_msg), upstream_addr)
};

The call sites are already correct: return return_not_subbed();

Minor: Missing Error Context

Location: crates/core/src/node/network_bridge/p2p_protoc.rs:640-648

When no connection is found for the explicit target address, the code calls ctx.bridge.op_manager.completed(*msg.id()) to clean up. Consider whether this should also trigger a retry mechanism or notify the operation layer that routing failed, rather than silently completing.

Current:

None => {
    tracing::error!(...);
    ctx.bridge.op_manager.completed(*msg.id());
}

Suggestion: Consider whether operations expecting a response should be notified of routing failure, or if silent completion is the correct behavior here.

Minor: Potential PeerId Construction Issue

Location: crates/core/src/operations/get.rs:905-907, 936-938

The code now constructs PeerId::new(addr, pub_key) only when socket_addr() is known. This is safer, but there's a potential inconsistency:

if let Some(addr) = sender.socket_addr() {
    tried_peers.insert(PeerId::new(addr, sender.pub_key().clone()));
}

Question: If socket_addr() returns None, should we:

  1. Skip adding to tried_peers (current behavior) - peer might be retried
  2. Use a different mechanism to track peers without known addresses
  3. Treat this as an error condition

The current approach seems reasonable but should be documented in the code.


🎯 Consistency Check

Across Operations (Subscribe, Get, Connect):

Subscribe (this PR):

  • Uses upstream_addr for ReturnSub responses
  • Uses build_op_result helper consistently
  • Proper null handling for Unknown addresses

Get (baseline):

  • Uses upstream_addr for ReturnGet responses (line 1402)
  • Has build_op_result helper with same pattern (line 1389-1418)
  • Added null checks for sender lookup (lines 858-865)

Connect (this PR):

  • Logging changes use pub_key() instead of peer()
  • Removes unnecessary if let Some(upstream) checks (lines 833-835, 877-882)
  • Consistent with connection-based routing model

Pattern is consistent across all three operations.


📋 Testing Recommendations

While I cannot run tests, I recommend verifying:

  1. NAT Scenario Test: Confirm docker_nat test shows 100% connectivity and no "Timeout waiting for subscribe response" errors
  2. Error Path Coverage: Verify all error scenarios (HTL=0, max subscribers, no caching peers) correctly route responses
  3. Address Handling: Test that peers with PeerAddr::Unknown are handled gracefully throughout
  4. Connection Lookup: Verify that OutboundMessageWithTarget can find connections in all scenarios where a response is expected

✨ Summary

Overall Assessment: This is a well-structured fix that correctly addresses the NAT routing issues identified in #2164. The implementation is consistent across operations and follows established patterns.

Recommendation:Approve with minor suggestions

The one critical issue (closure return type confusion) doesn't affect correctness due to Rust's type flattening, but cleaning it up would improve code clarity. The other suggestions are minor documentation/design considerations.

Key Achievement: This PR completes the Subscribe operation's migration to connection-based routing, enabling proper NAT traversal for subscription responses - a critical capability for peer-to-peer networks.


sanity added a commit that referenced this pull request Nov 30, 2025
ObservedAddress and ConnectResponse messages must route through the
upstream connection (source_addr) rather than directly to target
addresses, since relay peers may not have direct connections to joiners.

This mirrors the fix from PR #2171 that addressed the same NAT routing
issue for different message types.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@sanity sanity force-pushed the refactor/wire-protocol-cleanup-2164 branch from 6a99400 to 3dd47e0 Compare November 30, 2025 22:19
sanity added a commit that referenced this pull request Nov 30, 2025
ObservedAddress and ConnectResponse messages must route through the
upstream connection (source_addr) rather than directly to target
addresses, since relay peers may not have direct connections to joiners.

This mirrors the fix from PR #2171 that addressed the same NAT routing
issue for different message types.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
sanity added a commit that referenced this pull request Nov 30, 2025
ObservedAddress and ConnectResponse messages must route through the
upstream connection (source_addr) rather than directly to target
addresses, since relay peers may not have direct connections to joiners.

This mirrors the fix from PR #2171 that addressed the same NAT routing
issue for different message types.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
sanity and others added 14 commits November 30, 2025 20:35
… address

Key changes:
- Replace `peer: PeerId` with `pub_key: TransportPublicKey` + `peer_addr: PeerAddr`
- Add PeerAddr enum with Unknown/Known variants for explicit address state
- Add accessor methods: pub_key(), addr(), socket_addr(), peer()
- Add constructors: new(), with_unknown_addr(), with_location()
- Implement Ord/PartialOrd based on socket address

This separates cryptographic identity (pub_key) from network address (peer_addr),
enabling proper handling of peers behind NAT who don't know their external address.

Part of #2164 peer identity restructuring.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
This commit applies all wire protocol cleanup changes from PR #2169
on top of the rebased PR #2167 base:

- Remove sender field from GetMsg, PutMsg, SubscribeMsg, UpdateMsg, ConnectMsg
- Use upstream_addr for routing responses instead of embedded sender fields
- Delete transient_manager.rs (no longer needed)
- Update freenet-macros code generation for new message structure

The routing logic now derives the response target from the connection's
observed address (upstream_addr) rather than trusting sender fields in
messages. This is more reliable for NAT traversal scenarios.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Phase 1.3 of peer identity restructuring (issue #2164).

Uses rust-analyzer SSR to convert:
- .peer.pub_key -> .pub_key()
- .peer.addr -> .addr() (for read accesses)

Assignment operations (.peer.addr = x) are kept as direct field access
for now since the addr() method returns a copy, not a reference.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
… address

Key changes:
- Replace `peer: PeerId` with `pub_key: TransportPublicKey` + `peer_addr: PeerAddr`
- Add PeerAddr enum with Unknown/Known variants for explicit address state
- Add accessor methods: pub_key(), addr(), socket_addr(), peer()
- Add constructors: new(), with_unknown_addr(), with_location()
- Implement Ord/PartialOrd based on socket address

This separates cryptographic identity (pub_key) from network address (peer_addr),
enabling proper handling of peers behind NAT who don't know their external address.

Part of #2164 peer identity restructuring.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Addresses Nacho's architectural feedback to avoid raw SocketAddr in
protocol layer. Uses ObservedAddr newtype to wrap transport-layer
addresses, making the address semantics explicit at the type level.

Changes:
- Add ObservedAddr newtype in transport/mod.rs
- Update Operation trait to use Option<ObservedAddr> for source_addr
- Update all operation implementations (connect, get, put, subscribe, update)
- Update node/mod.rs and p2p_protoc.rs to use ObservedAddr
- Wrap incoming source_addr in ObservedAddr::new() at transport boundary
- Convert back to SocketAddr at network send boundaries

The conversion to raw SocketAddr happens at transport boundaries:
- build_op_result() converts for target_addr
- network_bridge.send() calls use .socket_addr()

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
- Use peer() method instead of peer field access
- Use PeerKeyLocation::with_location() constructor
- Simplify get_peer_by_addr to only check connections_by_location

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
This commit applies all wire protocol cleanup changes from PR #2169
on top of the rebased PR #2167 base:

- Remove sender field from GetMsg, PutMsg, SubscribeMsg, UpdateMsg, ConnectMsg
- Use upstream_addr for routing responses instead of embedded sender fields
- Delete transient_manager.rs (no longer needed)
- Update freenet-macros code generation for new message structure

The routing logic now derives the response target from the connection's
observed address (upstream_addr) rather than trusting sender fields in
messages. This is more reliable for NAT traversal scenarios.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Apply subscribe NAT routing fixes on top of rebased wire protocol cleanup:
- Use upstream_addr for routing subscribe responses
- Fix pub_key() usage in logging for Unknown addresses
- Preserve target address through bridge event

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
When accepting a connect request, route the response and observed address
messages through the upstream peer (where the request came from) rather than
trying to send directly to the joiner. The accepting peer may not have a
direct connection to the joiner in relay scenarios.

This mirrors how response forwarding already works at line 989-994.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
When processing ReturnGet with empty value, the code previously assumed
sender_from_addr was always Some and that addresses were always Known.
This caused panics in CI:
1. expect() on None sender_from_addr
2. .peer() called on PeerKeyLocation with Unknown address

Changes:
- Return error gracefully when sender lookup fails
- Use pub_key() instead of peer() for logging
- Guard tried_peers insertions with socket_addr() checks

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
- Use addr() instead of peer() for live_tx_tracker methods (expect SocketAddr)
- Remove duplicate get_peer_location_by_addr function

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@sanity sanity force-pushed the fix/subscribe-nat-routing-2164 branch from f9e9ce5 to d1bf045 Compare December 1, 2025 02:35
@sanity sanity requested a review from Copilot December 1, 2025 05:17
Copilot finished reviewing on behalf of sanity December 1, 2025 05:21
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

}
ConnEvent::OutboundMessageWithTarget { target_addr, msg } => {
// This variant uses an explicit target address from OperationResult.target_addr,
// which is critical for NAT scenarios where the address in the message
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when would that happen?

@sanity
Copy link
Collaborator Author

sanity commented Dec 1, 2025

@iduartgomez Re: "when would that happen?" on OutboundMessageWithTarget

This variant is used in subscribe operations for NAT scenarios. Specifically:

When it happens:

  • Peer behind NAT subscribes via gateway
  • Peer's self-reported address: 127.0.0.1:31337 (in message)
  • Actual transport address observed by gateway: 172.26.0.102:31337 (from UDP socket)

Why we need it:
Without this variant, the gateway extracts the target from msg.target().addr() which gives the peer's self-reported private address. The ReturnSub response goes to 127.0.0.1:31337 (unreachable) instead of 172.26.0.102:31337 (correct NAT address).

How it's used:
In subscribe error paths, OperationResult.target_addr is set from upstream_addr (the observed transport address). P2pBridge::send() receives this and creates OutboundMessageWithTarget to preserve it through the event loop.

This fixes the "Timeout waiting for subscribe response" failures in docker_nat tests.

[AI-assisted - Claude]

@iduartgomez
Copy link
Collaborator

Without this variant, the gateway extracts the target from msg.target().addr() which gives the peer's self-reported private address.

Well that's the problem, that shouldn't happen and be mixed up. Why the private socket addr gets to other peer in the first place is what should be addressed.

But I think we can get to the gist of it in the next PR in the stack.

OutboundMessage(NetMessage),
/// Outbound message with explicit target address from OperationResult.target_addr.
/// Used when the target address differs from what's in the message (NAT scenarios).
OutboundMessageWithTarget {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we were doing things correctly this variant wouldn't be needed cause we wouldn't be ever leaking out the internal address to other peers.

Copy link
Collaborator

@iduartgomez iduartgomez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with reservations, the enxt one in the stack is where we need to address the inconsisntencies.

I am almost sure if we properly encode the different addresses the compiler is not gonna allow us leaking them out.

sanity added a commit that referenced this pull request Dec 1, 2025
PR #2172 added detailed tracing that calls subscriber.peer() BEFORE
filling in the address from source_addr. This causes a panic when the
subscriber has PeerAddr::Unknown (NAT peers).

The correct order (from PR #2171) is:
1. Check if address is unknown
2. Fill in address from source_addr
3. THEN call .peer() for tracing

This regression was introduced when merging PR #2172's tracing
improvements without preserving the correct order of operations.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@sanity sanity merged commit 814ad70 into refactor/wire-protocol-cleanup-2164 Dec 2, 2025
19 checks passed
@sanity sanity deleted the fix/subscribe-nat-routing-2164 branch December 2, 2025 00:11
sanity added a commit that referenced this pull request Dec 2, 2025
Consolidates approved changes from PRs #2169 and #2171:

- Remove ObservedAddr in favor of std::net::SocketAddr
- Add PeerAddr enum to explicitly represent known/unknown addresses
- Update PeerKeyLocation to use PeerAddr for NAT scenarios
- Clean up wire protocol types for address handling
- Remove unused transient_manager module

This is part of the larger #2164 effort to simplify peer identity.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants