-
Notifications
You must be signed in to change notification settings - Fork 1.4k
fix: Fix memory leak caused by asynchronous prefetch #14722
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 5 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -101,6 +101,17 @@ class DirectCoalescedLoad : public cache::CoalescedLoad { | |
| return size; | ||
| } | ||
|
|
||
| void cancel() override { | ||
| folly::SemiFuture<bool> waitFuture(false); | ||
| if (!loadOrFuture(&waitFuture)) { | ||
| waitFuture.wait(); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. such wait can easily lead to deadlock in Prestissimo use case @kewang1024 @Yuhta There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you share more details? Let me check if Gluten has the issue or not.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| } | ||
| for (auto& request : requests_) { | ||
| pool_->freeNonContiguous(request.data); | ||
| } | ||
| CoalescedLoad::cancel(); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you help confirm if it is better to call cancel at first?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is incorrect to call the cancel function first, because doing so will immediately set state_ to cancelled, and as a result, the loadOrFuture function will not perform any actual waiting. |
||
| } | ||
|
|
||
| private: | ||
| const std::shared_ptr<IoStatistics> ioStats_; | ||
| const std::shared_ptr<filesystems::File::IoStats> fsStats_; | ||
|
|
@@ -243,21 +254,11 @@ class DirectBufferedInput : public BufferedInput { | |
| bool prefetch, | ||
| const std::vector<int32_t>& groupEnds); | ||
|
|
||
| // Holds the reference on the memory pool for async load in case of early task | ||
| // terminate. | ||
| struct AsyncLoadHolder { | ||
| std::shared_ptr<cache::CoalescedLoad> load; | ||
| std::shared_ptr<memory::MemoryPool> pool; | ||
|
|
||
| ~AsyncLoadHolder() { | ||
| // Release the load reference before the memory pool reference. | ||
| // This is to make sure the memory pool is not destroyed before we free up | ||
| // the allocated buffers. | ||
| // This is to handle the case that the associated task has already | ||
| // destroyed before the async load is done. The async load holds | ||
| // the last reference to the memory pool in that case. | ||
| load.reset(); | ||
| pool.reset(); | ||
| } | ||
| }; | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
cancel()function is invoked by~DirectBufferedInput(), which ultimately depends on the destructor of TableScan, as TableScan owns thestd::unique_ptr<connector::DataSource> dataSource_. This dependency relationship is illustrated in this discussion. As a result, TableScan must be destroyed before the memory manager, but the current approach in this PR does not explicitly guarantee that order.To ensure that the memory held by the load is always properly released, PR #8205 triggers close() through the TableScan::close() call chain to address the similar issue.
Would you like to share your opinions on the fixes? cc: @xiaoxmeng @FelixYBW Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The solution is either free the resources manually from top to bottom or free it in object destruction. Either way works.