-
-
Notifications
You must be signed in to change notification settings - Fork 107
fix: use pub_key for PeerId equality to fix NAT connection lookup #2163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…n completion When a PUT operation completes with subscribe=true, a child subscription operation is spawned. Previously, the response waited for this subscription to complete before returning to the client, causing PUT timeouts. Sub-operations like subscriptions are "fire and forget" from the client's perspective - they want to know their PUT succeeded, not wait for the subscription to complete. This change returns the finalized state immediately regardless of pending sub-operations. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
PeerId equality was based on socket address, causing connection lookup failures when peers behind NAT report different addresses than what the gateway observes. Changed Hash/PartialEq to use pub_key, which is the stable identifier for a peer regardless of NAT. Also updated PeerId::random() and Arbitrary impl to generate unique keypairs, since distinct peers must have distinct pub_keys for proper equality semantics. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
When a NAT peer receives its externally-observed address from a gateway during the connect handshake, update the peer's own PeerId.addr field. Without this fix, NAT peers embed their local address (127.0.0.1) in operation messages. When remote peers receive these messages and try to send responses back, they attempt to connect to 127.0.0.1 which fails. This complements the recent PeerId equality fix (pub_key based) - that fix allows lookups to succeed regardless of address, but we still need the correct address for establishing new connections. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
When a peer receives a PutForward and is the final destination (e.g., has no ring location yet), try_to_broadcast was hitting the catch-all error case because no prior operation state existed. This fix adds a specific match arm to handle this scenario by sending SuccessfulPut back to the upstream peer. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
| } | ||
| } | ||
|
|
||
| thread_local! { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the reason this exists is because for many tests where we dont care about the peerid (e.g. no integration or e2e tests) generating a large amount of peers is dramatic (comptutationally expensive). For the cases we care about we use the explicit PeerId::random
Arbitrary is used for the most part for unit tests. Be careful with the impact of this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I investigated this and found that the only problematic usage was in aof.rs tests, which generate 10k-100k log entries, each previously getting a unique PeerId. Those tests validate log serialization/deserialization—they don't actually need unique peer identities.
I've pushed a fix that reuses a single PeerId per test in the three affected functions (test_aof_read_write_complex, test_aof_complex_reconstruction, test_aof_sequential_ids_reconstruction). All other usages in the codebase are small counts (5-45 peers per test), where the overhead is negligible.
With this fix, the cache is no longer needed.
[AI-assisted - Claude]
These tests validate log I/O, not peer identity. Generating 10k-100k unique keypairs was unnecessary and would slow tests significantly after the PeerId::random() change to generate unique keys. [AI-assisted - Claude]
|
Closing this PR as superseded by the PR stack #2167 → #2169 → #2171 → #2172. That stack takes a more comprehensive approach to NAT routing:
This keeps identity (pub_key) and routing (observed address) as separate concerns rather than changing [AI-assisted - Claude] |
Summary
This PR improves PeerId semantics by basing equality/hashing on public key rather than socket address.
Status: Draft - This is a defensive correctness fix but does NOT resolve the PUT timeout issue under Docker NAT that motivated the investigation.
Changes
PeerIdHash/PartialEq to usepub_keyinstead ofaddrPEER_IDthread-local cache (eachrandom()call now generates unique keypairs)Ordimplementation with a comment clarifying it's only for ordering, not equalitytry_current()instead of creating nested runtime)Rationale
Even though this doesn't fix the PUT timeout:
Testing
Next Steps
The real PUT timeout issue is still being investigated. The failure occurs even with this fix applied, suggesting the root cause is elsewhere in the PUT operation flow.
🤖 Generated with Claude Code