Skip to content

Fix memory leaks#55

Merged
altonen merged 8 commits intomasterfrom
altonen-fix-memory-leaks
Mar 11, 2024
Merged

Fix memory leaks#55
altonen merged 8 commits intomasterfrom
altonen-fix-memory-leaks

Conversation

@altonen
Copy link
Copy Markdown
Contributor

@altonen altonen commented Mar 10, 2024

The big memory leak was related to outbound backpressure. Notification connection handler would block on Substream::send_framed() and accumulated a lot of memory in the notification channel which apparently didn't get released appropriately, possibly due to connection handler not getting closed. The connection handler is rewritten in a way where it processes both inbound and outbound notifications without blocking on either one of them so if the node is trying to send a large notification/many notifications, it won't block the entire connection event loop and instead gradually writes the notification(s) to the substream. Size of the channel that's used to send notifications synchronously is adjusted to match the one in Polkadot SDK.

NotificationHandle::send_sync_notification() is modified to match the implementation currently found in Polkadot SDK whereby it closes the connection if the channel is clogged. I don't think this is a particularly good idea in the first place and once the protocols in Polkadot SDK have been written in such a way that they can deal with backpressure, e.g., by prioritizing certain notifications over others, this behavior can (and probably should) be removed.

Third change is the largest and that is to (temporarily) move yamux into litep2p. The flamegraphs indicated that yamux was consuming a lot of memory and while bumping it to 0.11.1, where the patch released fixed an issue with GC, helped, it was still leaking memory. The follow-up for the GC was fixed in 0.12.1 but 0.12.0 introduced significant API breakage, probably requiring a rewrite of the TCP and WebSocket connection handlers, which I'm not interested in doing at this time. I backported the memory leak fix from 0.12.1 to 0.11.1 and transferred the source code of yamux to litep2p and it improved things to a point where litep2p doesn't leak memory anymore. Moving yamux into litep2p is not optimal but considering that 0.12.0 removed the APIs that litep2p is relying on, thus requiring non-trivial rewrites of the connection handlers, and that litep2p is leaking too much memory on 0.11.1, it seems like the best approach for now. Yamux source code doesn't need a review and can be skipped.

Lastly, add timeouts for IPFS Identify, IPFS Ping, and request-response protocols. It may be possible for the read/write to block indefinitely which could cause the future to stay active which in turn would keep the connection open as each substream holds a permit, preventing the connection from getting closed. Now that each substream operation has a timeout, the connection won't be kept open erroneously, allowing resources to be released.

This is big PR but I've run a burn-in for these changes in Versi so it should be safe to merge this.

altonen added 7 commits March 9, 2024 13:12
yamux `0.11.1` and `0.12.1` contains fixes for how its GC works and
while upgrading to `0.11.1` would be trivial, the fix was not enough not
enough to reduce memory leakage. `0.12.1`, on the other hand, contians
significant API breakage and likely requires a rewrite of TCP/WebSocket
connection handlers as `ControlledConnection` and `Control` were removed

In the intersest of saving time, switch to `0.11.1`, backport the GC fix
from `0.12.1` to `0.11.1`, and move source code of yamux to litep2p.

This is by no means an ideal solution but considering that fixing the
memory leaks are of highest importance and rewriting the connection
handlers likely requires significant rewrite and extensive testing,
it's easiest to just use yamux directly, like was done with
multistream-select.
Rewrite connection handler to handle backpressure better by not blocking
when sending notifications and instead allowing the connection handler
to process inbound notifications while the outbound substream is blocked
Use the same size as in Polkadot SDK
To follow how Polkadot SDK's notification protocol implementation,
if the outbound channel is clogged, close the connection.
Wait 10 seconds for reads/writes to complete after which the operation
is canceled. Keeping the substream open without any time limit can keep
the connection erroneously open because the substream holds a permit
which keeps the connection open, preventing it from closing and
releasing consumed memory.
@altonen altonen requested a review from dmitry-markin March 10, 2024 10:40
@altonen altonen force-pushed the altonen-fix-memory-leaks branch from 1108ce0 to c7082d8 Compare March 10, 2024 10:50
@altonen altonen merged commit cc043b6 into master Mar 11, 2024
@altonen altonen deleted the altonen-fix-memory-leaks branch March 11, 2024 16:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants