Skip to content

Conversation

@j7nhai
Copy link

@j7nhai j7nhai commented Sep 4, 2025

In Gluten, if the folly executor exits unexpectedly, the memory allocated from the pool cannot be reclaimed in time when Gluten’s memory manager is destroyed. As a result, when the memory manager exits, it detects unreleased memory from the pool, which triggers an exception.

Simply increasing the pool’s reference count does not solve the memory leak problem; we need to properly release memory allocated by asynchronous threads. Since the pool is only used for allocating request buffers, we now correctly free memory during the cancelation phase of coalesced loads.

@j7nhai j7nhai requested a review from majetideepak as a code owner September 4, 2025 12:20
@netlify
Copy link

netlify bot commented Sep 4, 2025

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 0834da1
🔍 Latest deploy log https://app.netlify.com/projects/meta-velox/deploys/68d4dc0d3effa800087ab70f

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 4, 2025
@j7nhai
Copy link
Author

j7nhai commented Sep 4, 2025

please check this PR. @xiaoxmeng @tanjialiang @fzhedu @zhztheplayer @majetideepak

@j7nhai j7nhai changed the title Fix memory leak caused by asynchronous prefetch fix: Fix memory leak caused by asynchronous prefetch Sep 4, 2025
@j7nhai
Copy link
Author

j7nhai commented Sep 5, 2025

Hi, @nimesh1601
In issue #13168, It looks that there is some memory leak for TableScan for Spark using Gluten. And this PR may fix it. You can try to run the task with this patch.

@FelixYBW
Copy link

@pedroerp could you prioritize this PR? We noted the leak in our TPCDS test already

@pedroerp pedroerp requested a review from xiaoxmeng September 17, 2025 11:33
@pedroerp
Copy link
Contributor

@pedroerp could you prioritize this PR? We noted the leak in our TPCDS test already

I pinged @xiaoxmeng and asked him to take a look. Thank you guys for looking into this.

Copy link
Contributor

@xiaoxmeng xiaoxmeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@j7nhai thanks for the fix. So the pool is not destroyed on memory manager destruction? The async load is executed by folly executor and if we hold a reference to the pool in async load, why this caused the problem. The fix might introduce other memory pool lifecycle issue as discovered in Prestissimo use case. Can you explain a bit details that lead to the problem. I don't think hold a reference to the pool itself can cause issue.

@j7nhai
Copy link
Author

j7nhai commented Sep 18, 2025

@j7nhai thanks for the fix. So the pool is not destroyed on memory manager destruction? The async load is executed by folly executor and if we hold a reference to the pool in async load, why this caused the problem. The fix might introduce other memory pool lifecycle issue as discovered in Prestissimo use case. Can you explain a bit details that lead to the problem. I don't think hold a reference to the pool itself can cause issue.

It’s not the extra pool reference that causes the memory leak; there is indeed a memory leak issue here. When destructing, we need to wait for the asynchronous threads to exit to ensure that the memory allocated by the pool is properly released before the memory manager is destroyed (otherwise, there will be exceptions in Gluten’s Velox memory manager). I mistakenly thought that your pool reference was intended to fix the memory leak issue, but actually, it doesn’t solve it. What’s needed is to wait for the folly executor to finish execution in the destructor of DirectCoalescedLoad.

So, maybe I can keep your feature—both fixes can coexist and are not conflicting.

@j7nhai j7nhai requested a review from xiaoxmeng September 18, 2025 09:53
@j7nhai
Copy link
Author

j7nhai commented Sep 22, 2025

I have kept the previous features and only fixed the memory leak issue. Please take another look, thanks! @xiaoxmeng @tanjialiang @fzhedu @zhztheplayer @majetideepak @FelixYBW

for (auto& request : requests_) {
pool_->freeNonContiguous(request.data);
}
CoalescedLoad::cancel();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you help confirm if it is better to call cancel at first?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is incorrect to call the cancel function first, because doing so will immediately set state_ to cancelled, and as a result, the loadOrFuture function will not perform any actual waiting.

@j7nhai j7nhai requested a review from PHILO-HE September 23, 2025 10:16
Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix. I’d like to provide some additional context on this issue.

return size;
}

void cancel() override {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cancel() function is invoked by ~DirectBufferedInput(), which ultimately depends on the destructor of TableScan, as TableScan owns the std::unique_ptr<connector::DataSource> dataSource_. This dependency relationship is illustrated in this discussion. As a result, TableScan must be destroyed before the memory manager, but the current approach in this PR does not explicitly guarantee that order.

To ensure that the memory held by the load is always properly released, PR #8205 triggers close() through the TableScan::close() call chain to address the similar issue.

Would you like to share your opinions on the fixes? cc: @xiaoxmeng @FelixYBW Thanks.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The solution is either free the resources manually from top to bottom or free it in object destruction. Either way works.

@FelixYBW
Copy link

FelixYBW commented Nov 21, 2025

@j7nhai We will pick this PR to oap/velox and use it in Gluten. FYI.

IBM#1388

@pedroerp @xiaoxmeng can you review the PR?

void cancel() override {
folly::SemiFuture<bool> waitFuture(false);
if (state() == State::kLoading && !loadOrFuture(&waitFuture)) {
waitFuture.wait();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

such wait can easily lead to deadlock in Prestissimo use case @kewang1024 @Yuhta

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you share more details? Let me check if Gluten has the issue or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FelixYBW I explained this offline to @rui-mo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants