Conversation
yamux `0.11.1` and `0.12.1` contains fixes for how its GC works and while upgrading to `0.11.1` would be trivial, the fix was not enough not enough to reduce memory leakage. `0.12.1`, on the other hand, contians significant API breakage and likely requires a rewrite of TCP/WebSocket connection handlers as `ControlledConnection` and `Control` were removed In the intersest of saving time, switch to `0.11.1`, backport the GC fix from `0.12.1` to `0.11.1`, and move source code of yamux to litep2p. This is by no means an ideal solution but considering that fixing the memory leaks are of highest importance and rewriting the connection handlers likely requires significant rewrite and extensive testing, it's easiest to just use yamux directly, like was done with multistream-select.
Rewrite connection handler to handle backpressure better by not blocking when sending notifications and instead allowing the connection handler to process inbound notifications while the outbound substream is blocked
Use the same size as in Polkadot SDK
To follow how Polkadot SDK's notification protocol implementation, if the outbound channel is clogged, close the connection.
Wait 10 seconds for reads/writes to complete after which the operation is canceled. Keeping the substream open without any time limit can keep the connection erroneously open because the substream holds a permit which keeps the connection open, preventing it from closing and releasing consumed memory.
1108ce0 to
c7082d8
Compare
dmitry-markin
approved these changes
Mar 11, 2024
Co-authored-by: Dmitry Markin <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The big memory leak was related to outbound backpressure. Notification connection handler would block on
Substream::send_framed()and accumulated a lot of memory in the notification channel which apparently didn't get released appropriately, possibly due to connection handler not getting closed. The connection handler is rewritten in a way where it processes both inbound and outbound notifications without blocking on either one of them so if the node is trying to send a large notification/many notifications, it won't block the entire connection event loop and instead gradually writes the notification(s) to the substream. Size of the channel that's used to send notifications synchronously is adjusted to match the one in Polkadot SDK.NotificationHandle::send_sync_notification()is modified to match the implementation currently found in Polkadot SDK whereby it closes the connection if the channel is clogged. I don't think this is a particularly good idea in the first place and once the protocols in Polkadot SDK have been written in such a way that they can deal with backpressure, e.g., by prioritizing certain notifications over others, this behavior can (and probably should) be removed.Third change is the largest and that is to (temporarily) move yamux into litep2p. The flamegraphs indicated that yamux was consuming a lot of memory and while bumping it to
0.11.1, where the patch released fixed an issue with GC, helped, it was still leaking memory. The follow-up for the GC was fixed in0.12.1but0.12.0introduced significant API breakage, probably requiring a rewrite of the TCP and WebSocket connection handlers, which I'm not interested in doing at this time. I backported the memory leak fix from0.12.1to0.11.1and transferred the source code of yamux to litep2p and it improved things to a point where litep2p doesn't leak memory anymore. Moving yamux into litep2p is not optimal but considering that0.12.0removed the APIs that litep2p is relying on, thus requiring non-trivial rewrites of the connection handlers, and that litep2p is leaking too much memory on0.11.1, it seems like the best approach for now. Yamux source code doesn't need a review and can be skipped.Lastly, add timeouts for IPFS Identify, IPFS Ping, and request-response protocols. It may be possible for the read/write to block indefinitely which could cause the future to stay active which in turn would keep the connection open as each substream holds a permit, preventing the connection from getting closed. Now that each substream operation has a timeout, the connection won't be kept open erroneously, allowing resources to be released.
This is big PR but I've run a burn-in for these changes in Versi so it should be safe to merge this.