transport: Fix double lock and state overwrite on disconnected peers by lexnv · Pull Request #179 · paritytech/litep2p

lexnv · 2024-07-17T17:02:36Z

This PR fixes an issue where the secondary connection is rejected because it has a different connection ID.

The issue comes from fn dial_address which can overwrite the state of a disconnected peer with in-flight dial requests.

High level repro case

T0: dial_address 1
T1: dial_address 2 which overwrites address 1
T2: dialed address 1 responds with a connection
T3: connection is rejected

While at it, this PR also:

removes the need to double lock the peer state (once to read and once to write)
enforces connection ID checks on dial records in on_dial_failure
adjust the repro case from transport: Reject secondary connections with different connection IDs #176 for the fix

Detailed repro case

T0: Dial address A connection ID 1 -> State : Dialing { in-flight-1 }
T1: Connection is established by remote address C connection ID 3 -> State : Connected { in-flight-dial 1 }
T2: Closed connection established in T2 -> State : Disconnected { in-flight-dial 1}

// This is where the state was wrongfully overwritten

T3: Dial address B connection ID 2 -> State: Dialing { in-flight-2 }
T4: Connection is established by remote address C connection ID 3 -> State : Connected { in-flight-dial 2 }

// The issue was detected here

T5: Dial from T0 receives a response however it expects in-flight-dial 1 and the state has in-flight-dial 2

Closes: #172

Signed-off-by: Alexandru Vasile <[email protected]>

lexnv · 2024-07-18T08:30:44Z

Believe there might be another path that triggers #172, will see after some more triaging

dmitry-markin · 2024-07-31T14:40:09Z

src/transport/manager/mod.rs

        ) {
            PeerState::Dialing { ref mut record } => {
                debug_assert_eq!(record.connection_id(), &Some(connection_id));
+                if record.connection_id() != &Some(connection_id) {


Isn't this triggered if we dial two or more addresses simultaneously? Basically, the case when we know more than one address of a peer and employ dialing parallelism?

Thinking out loud, I remember the state machine being similar to:

fn dial_addr (single_address) -> PeerState::Dialing { record: Some( single_address) } fn dial () -> this is where we employ the parallel dial. All dial records share the same connection ID -> PeerState::Opening { records, connection_id, ...transports } fn dial() -> multiple parallel dials (max 8) Disconnected ---------------------------------> Opening on_open_sucesss: negotiate + cancel other dials Opening -----> Dialing { record: first successful record } if other dial attempts succeed before canceling, then we return InvalidState (and just log it) on_open_failure: lose track of the dial Opening ------> Opening

That points me towards adding a nice state machine and documenting these states.
It gets complicated by just looking at the code and we can easily forget it.

A few other things that I could address here:

remove debug_assert from on_open_failure on_open_sucess; since we can race with the transports

a dial failure can come later (same as above) and on_dial_failure we have a todo if we are in Opening

You are right, looked into the sources and now understand I was missing we replace the address record with the first succeeded record.

src/transport/manager/mod.rs

dmitry-markin · 2024-07-31T14:53:07Z

src/transport/manager/mod.rs

+                    // However, the record score cannot be updated in the future.
+                    // TODO: context.addresses.insert(record.clone());


May be this is what you mean already by this TODO, but for me it looks like we have to store multiple records at least in PeerState::Dialing state to correctly handle concurrent dials (do we dial 8 addresses in parallel?) and track all connection_ids. This way the debug_assert!s in the on_dial_failure() shouldn't be triggered.

May be it's slightly more complicated than this, as we have to track what dials succeeded and drop the extra connections when there is more than two of them, so more bookkeeping might be needed.

// However, the record score cannot be updated in the future.

Can you also elaborate on this?

I think the Opening state should handle this, but let me double check

Yeaa, the comment is wrong it should be "The record score can be updated in the future" :D
Basically, we can keep track of this potential address in the future, instead of discarding it. So if we ever try to fn dial() 8 addresses, this address can be one of the dialed ones

src/transport/manager/mod.rs

dmitry-markin · 2024-07-31T14:56:13Z

(Actually wanted just to comment, but misclicked Approve. Don't merge the PR yet, please.)

dmitry-markin

Looks good! (This time the approval was intentional.)

dmitry-markin · 2024-08-01T10:01:20Z

src/transport/manager/mod.rs

+    }
+
+    #[tokio::test]
+    async fn reject_unknown_secondary_connections_with_different_connection_ids() {


How do we check for the unknown connection id being rejected below? Or we don't have the issue of getting the unknown connection id after this PR?

Indeed the edge-case that goes from those states is fixed now, I've added another test to check we reject connections from a theoretical state.

We might still have some other gaps, the state machine is quite complex

Signed-off-by: Alexandru Vasile <[email protected]>

…ager-fix Signed-off-by: Alexandru Vasile <[email protected]>

Signed-off-by: Alexandru Vasile <[email protected]>

lexnv · 2024-08-02T15:58:05Z

Few followups from my side:

document the state machine
fix the Opening todo! panic on dial failure

## [0.7.0] - 2024-09-05 This release introduces several new features, improvements, and fixes to the litep2p library. Key updates include enhanced error handling, configurable connection limits, and a new API for managing public addresses. ### [Exposing Public Addresses API](#212) A new `PublicAddresses` API has been added, enabling users to manage the node's public addresses. This API allows developers to add, remove, and retrieve public addresses, which are shared with peers through the Identify protocol. ```rust // Public addresses are accessible from the main litep2p instance. let public_addresses = litep2p.public_addresses(); // Add a new public address to the node. if let Err(err) = public_addresses.add_address(multiaddr) { eprintln!("Failed to add public address: {:?}", err); } // Remove a public address from the node. public_addresses.remove_address(&multiaddr); // Retrieve all public addresses of the node. for address in public_addresses.get_addresses() { println!("Public address: {}", address); } ``` **Breaking Change**: The Identify protocol no longer includes public addresses in its configuration. Instead, use the new `PublicAddresses` API. Migration Guide: ```rust // Before: let (identify_config, identify_event_stream) = IdentifyConfig::new( "/substrate/1.0".to_string(), Some(user_agent), config.public_addresses, ); // After: let (identify_config, identify_event_stream) = IdentifyConfig::new("/substrate/1.0".to_string(), Some(user_agent)); // Public addresses must now be added using the `PublicAddresses` API: for address in config.public_addresses { if let Err(err) = public_addresses.add_address(address) { eprintln!("Failed to add public address: {:?}", err); } } ``` ### [Dial Error and List Dial Failures Event](#206) The `DialFailure` event has been enhanced with a new `DialError` enum for more precise error reporting when a dial attempt fails. Additionally, a `ListDialFailures` event has been introduced, listing all dialed addresses and their corresponding errors when multiple addresses are involved. Other litep2p errors, such as `ParseError`, `AddressError`, and `NegotiationError`, have been refactored for improved error propagation. ### [Immediate Dial Error and Request-Response Rejection Reasons](#227) This new feature paves the way for better error handling in the `litep2p` library and moves away from the overarching `litep2p::error::Error` enum. The newly added `ImmediateDialError` enum captures errors occurring before a dial request is sent (e.g., missing peer IDs). It also enhances the `RejectReason` enum for request-response protocols, offering more detailed rejection reasons. ```rust match error { RequestResponseError::Rejected(reason) => { match reason { RejectReason::ConnectionClosed => "connection-closed", RejectReason::DialFailed(Some(ImmediateDialError::AlreadyConnected)) => "already-connected", _ => "other", } } _ => "other", } ``` ### [Connection Limits](#185) Developers can now set limits on the number of inbound and outbound connections to manage resources and optimize performance. ```rust // Configure connection limits for inbound and outbound established connections. let litep2p_config = Config::default() .with_connection_limits(ConnectionLimitsConfig::default() .max_incoming_connections(Some(3)) .max_outgoing_connections(Some(2)) ); ``` ### [Feature Flags for Optional Transports](#192) The library now supports feature flags to selectively enable or disable transport protocols. By default, only the `TCP` transport is enabled. Optional transports include: - `quic` - Enables QUIC transport. - `websocket` - Enables WebSocket transport. - `webrtc` - Enables WebRTC transport. ### [Configurable Keep-Alive Timeout](#155) The keep-alive timeout for connections is now configurable, providing more control over connection lifecycles. ```rust // Set keep alive timeout for connections. let litep2p_config = Config::default() .with_keep_alive_timeout(Duration::from_secs(30)); ``` Thanks for contributing to this @[Ma233](https://github.com/Ma233)! ### Added - errors: Introduce immediate dial error and request-response rejection reasons ([#227](#227)) - Expose API for `PublicAddresses` ([#212](#212)) - transport: Implement `TransportService::local_peer_id()` ([#224](#224)) - find_node: Optimize parallelism factor for slow to respond peers ([#220](#220)) - kad: Handle `ADD_PROVIDER` & `GET_PROVIDERS` network requests ([#213](#213)) - errors: Add `DialError` error and `ListDialFailures` event for better error reporting ([#206](#206)) - kad: Add support for provider records to `MemoryStore` ([#200](#200)) - transport: Add accept_pending/reject_pending for inbound connections and introduce inbound limits ([#194](#194)) - transport/manager: Add connection limits for inbound and outbound established connections ([#185](#185)) - kad: Add query id to log messages ([#174](#174)) ### Changed - transport: Replace trust_dns_resolver with hickory_resolver ([#223](#223)) - crypto/noise: Generate keypair only for Curve25519 ([#214](#214)) - transport: Allow manual setting of keep-alive timeout ([#155](#155)) - kad: Update connection status of an existing bucket entry ([#181](#181)) - Make transports optional ([#192](#192)) ### Fixed - kad: Fix substream opening and dialing race ([#222](#222)) - query-executor: Save the task waker on empty futures ([#219](#219)) - substream: Use write_all instead of manually writing bytes ([#217](#217)) - minor: fix tests without `websocket` feature ([#215](#215)) - Fix TCP, WebSocket, QUIC leaking connection IDs in `reject()` ([#198](#198)) - transport: Fix double lock and state overwrite on disconnected peers ([#179](#179)) - kad: Do not update memory store on incoming `GetRecordSuccess` ([#190](#190)) - transport: Reject secondary connections with different connection IDs ([#176](#176)) ### Testing Done - pulled the latest changes into Substrate - performed warp sync in kusama cc @paritytech/networking --------- Signed-off-by: Alexandru Vasile <[email protected]>

lexnv added 8 commits July 17, 2024 12:31

transport: Fix double lock and state overwrite on disconnected peers

c6f4191

Signed-off-by: Alexandru Vasile <[email protected]>

transport: Warn missmatching dial failures wrt connection IDs

c54c0c8

Signed-off-by: Alexandru Vasile <[email protected]>

transport: Update address score accordingly on on_dial_failure

3461347

Signed-off-by: Alexandru Vasile <[email protected]>

transport: Track dial addresses

f902a78

Signed-off-by: Alexandru Vasile <[email protected]>

transport/tests: Ensure multiple dials do not overwrite the current dial

0c4aa51

Signed-off-by: Alexandru Vasile <[email protected]>

transport: Saturating add

9a804a7

Signed-off-by: Alexandru Vasile <[email protected]>

transport: Extend todo with more information for separate PR

68bb679

Signed-off-by: Alexandru Vasile <[email protected]>

transport/tests: Backport test from PR #176

daa147b

Signed-off-by: Alexandru Vasile <[email protected]>

lexnv mentioned this pull request Jul 17, 2024

transport-manager: Unknown connection opened #172

Closed

lexnv self-assigned this Jul 17, 2024

lexnv mentioned this pull request Jul 17, 2024

transport: Reject secondary connections with different connection IDs #176

Merged

lexnv mentioned this pull request Jul 29, 2024

transport/manager: Add connection limits for inbound and outbound established connections #185

Merged

dmitry-markin approved these changes Jul 31, 2024

View reviewed changes

dmitry-markin self-requested a review July 31, 2024 14:55

dmitry-markin approved these changes Aug 1, 2024

View reviewed changes

lexnv added 4 commits August 2, 2024 18:20

transport/manager: Update comment with link to issue

062ed6d

Signed-off-by: Alexandru Vasile <[email protected]>

Merge remote-tracking branch 'origin/master' into lexnv/transport-man…

07d516b

…ager-fix Signed-off-by: Alexandru Vasile <[email protected]>

transport/tests: Reject secondary unknown from theoretical state

e659baf

Signed-off-by: Alexandru Vasile <[email protected]>

transport/manager: Remove debug_asserts

adc4657

Signed-off-by: Alexandru Vasile <[email protected]>

lexnv merged commit 7befb3a into master Aug 2, 2024

lexnv deleted the lexnv/transport-manager-fix branch August 2, 2024 15:58

lexnv mentioned this pull request Sep 5, 2024

chore: Release litep2p 0.7.0 #240

Merged

lexnv mentioned this pull request Oct 16, 2024

[litep2p-milestone]: Improve litep2p robustness #140

Closed

68 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

transport: Fix double lock and state overwrite on disconnected peers#179

transport: Fix double lock and state overwrite on disconnected peers#179
lexnv merged 12 commits intomasterfrom
lexnv/transport-manager-fix

lexnv commented Jul 17, 2024

Uh oh!

lexnv commented Jul 18, 2024

Uh oh!

dmitry-markin Jul 31, 2024

Uh oh!

lexnv Jul 31, 2024

Uh oh!

dmitry-markin Aug 1, 2024

Uh oh!

Uh oh!

dmitry-markin Jul 31, 2024

Uh oh!

lexnv Jul 31, 2024

Uh oh!

Uh oh!

dmitry-markin commented Jul 31, 2024

Uh oh!

dmitry-markin left a comment •

edited

Loading

Uh oh!

dmitry-markin Aug 1, 2024

Uh oh!

lexnv Aug 2, 2024

Uh oh!

lexnv commented Aug 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		// However, the record score cannot be updated in the future.
		// TODO: context.addresses.insert(record.clone());

Conversation

lexnv commented Jul 17, 2024

High level repro case

Detailed repro case

Uh oh!

lexnv commented Jul 18, 2024

Uh oh!

dmitry-markin Jul 31, 2024

Choose a reason for hiding this comment

Uh oh!

lexnv Jul 31, 2024

Choose a reason for hiding this comment

Uh oh!

dmitry-markin Aug 1, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dmitry-markin Jul 31, 2024

Choose a reason for hiding this comment

Uh oh!

lexnv Jul 31, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dmitry-markin commented Jul 31, 2024

Uh oh!

dmitry-markin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dmitry-markin Aug 1, 2024

Choose a reason for hiding this comment

Uh oh!

lexnv Aug 2, 2024

Choose a reason for hiding this comment

Uh oh!

lexnv commented Aug 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dmitry-markin left a comment •

edited

Loading