fix: Fix memory leak caused by asynchronous prefetch #14722

j7nhai · 2025-09-04T12:20:01Z

In Gluten, if the folly executor exits unexpectedly, the memory allocated from the pool cannot be reclaimed in time when Gluten’s memory manager is destroyed. As a result, when the memory manager exits, it detects unreleased memory from the pool, which triggers an exception.

Simply increasing the pool’s reference count does not solve the memory leak problem; we need to properly release memory allocated by asynchronous threads. Since the pool is only used for allocating request buffers, we now correctly free memory during the cancelation phase of coalesced loads.

netlify · 2025-09-04T12:20:07Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`0834da1`
🔍 Latest deploy log	https://app.netlify.com/projects/meta-velox/deploys/68d4dc0d3effa800087ab70f

j7nhai · 2025-09-04T13:05:28Z

please check this PR. @xiaoxmeng @tanjialiang @fzhedu @zhztheplayer @majetideepak

j7nhai · 2025-09-05T06:37:14Z

Hi, @nimesh1601
In issue #13168, It looks that there is some memory leak for TableScan for Spark using Gluten. And this PR may fix it. You can try to run the task with this patch.

FelixYBW · 2025-09-17T08:48:09Z

@pedroerp could you prioritize this PR? We noted the leak in our TPCDS test already

pedroerp · 2025-09-17T11:33:43Z

@pedroerp could you prioritize this PR? We noted the leak in our TPCDS test already

I pinged @xiaoxmeng and asked him to take a look. Thank you guys for looking into this.

xiaoxmeng

@j7nhai thanks for the fix. So the pool is not destroyed on memory manager destruction? The async load is executed by folly executor and if we hold a reference to the pool in async load, why this caused the problem. The fix might introduce other memory pool lifecycle issue as discovered in Prestissimo use case. Can you explain a bit details that lead to the problem. I don't think hold a reference to the pool itself can cause issue.

j7nhai · 2025-09-18T09:50:22Z

@j7nhai thanks for the fix. So the pool is not destroyed on memory manager destruction? The async load is executed by folly executor and if we hold a reference to the pool in async load, why this caused the problem. The fix might introduce other memory pool lifecycle issue as discovered in Prestissimo use case. Can you explain a bit details that lead to the problem. I don't think hold a reference to the pool itself can cause issue.

It’s not the extra pool reference that causes the memory leak; there is indeed a memory leak issue here. When destructing, we need to wait for the asynchronous threads to exit to ensure that the memory allocated by the pool is properly released before the memory manager is destroyed (otherwise, there will be exceptions in Gluten’s Velox memory manager). I mistakenly thought that your pool reference was intended to fix the memory leak issue, but actually, it doesn’t solve it. What’s needed is to wait for the folly executor to finish execution in the destructor of DirectCoalescedLoad.

So, maybe I can keep your feature—both fixes can coexist and are not conflicting.

j7nhai · 2025-09-22T03:24:49Z

I have kept the previous features and only fixed the memory leak issue. Please take another look, thanks! @xiaoxmeng @tanjialiang @fzhedu @zhztheplayer @majetideepak @FelixYBW

PHILO-HE · 2025-09-22T09:23:04Z

velox/dwio/common/DirectBufferedInput.h

+    for (auto& request : requests_) {
+      pool_->freeNonContiguous(request.data);
+    }
+    CoalescedLoad::cancel();


Could you help confirm if it is better to call cancel at first?

It is incorrect to call the cancel function first, because doing so will immediately set state_ to cancelled, and as a result, the loadOrFuture function will not perform any actual waiting.

rui-mo

Thanks for the fix. I’d like to provide some additional context on this issue.

rui-mo · 2025-10-31T12:14:05Z

velox/dwio/common/DirectBufferedInput.h

    return size;
  }

+  void cancel() override {


The cancel() function is invoked by ~DirectBufferedInput(), which ultimately depends on the destructor of TableScan, as TableScan owns the std::unique_ptr<connector::DataSource> dataSource_. This dependency relationship is illustrated in this discussion. As a result, TableScan must be destroyed before the memory manager, but the current approach in this PR does not explicitly guarantee that order.

To ensure that the memory held by the load is always properly released, PR #8205 triggers close() through the TableScan::close() call chain to address the similar issue.

Would you like to share your opinions on the fixes? cc: @xiaoxmeng @FelixYBW Thanks.

The solution is either free the resources manually from top to bottom or free it in object destruction. Either way works.

FelixYBW · 2025-11-21T22:37:17Z

@j7nhai We will pick this PR to oap/velox and use it in Gluten. FYI.

IBM#1388

@pedroerp @xiaoxmeng can you review the PR?

xiaoxmeng · 2025-11-23T05:58:53Z

velox/dwio/common/DirectBufferedInput.h

+  void cancel() override {
+    folly::SemiFuture<bool> waitFuture(false);
+    if (state() == State::kLoading && !loadOrFuture(&waitFuture)) {
+      waitFuture.wait();


such wait can easily lead to deadlock in Prestissimo use case @kewang1024 @Yuhta

Can you share more details? Let me check if Gluten has the issue or not.

@FelixYBW I explained this offline to @rui-mo

shamirchen added 2 commits September 4, 2025 20:12

ok now !!!!

2f9ef0b

followup

53f48a9

j7nhai requested a review from majetideepak as a code owner September 4, 2025 12:20

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 4, 2025

formatted.

b608a15

This was referenced Sep 4, 2025

[VL] Increase the frequency of calling tryDestructSafe, and make one final attempt at the end apache/incubator-gluten#10249

Closed

fix: Fix the async direct coalesce load memory leak in case of early task failure in pyspark #12729

Closed

j7nhai changed the title ~~Fix memory leak caused by asynchronous prefetch~~ fix: Fix memory leak caused by asynchronous prefetch Sep 4, 2025

fix

7336360

waiting until finish.

62b31c4

This was referenced Sep 11, 2025

Memory leak for TableScan for Spark using Gluten #13168

Open

[VL] Memory Leak possibly TableScan apache/incubator-gluten#9456

Open

pedroerp requested a review from xiaoxmeng September 17, 2025 11:33

xiaoxmeng reviewed Sep 18, 2025

View reviewed changes

j7nhai requested a review from xiaoxmeng September 18, 2025 09:53

shamirchen added 2 commits September 22, 2025 11:21

revert

b37b4c8

format

29d86d9

PHILO-HE reviewed Sep 22, 2025

View reviewed changes

j7nhai requested a review from PHILO-HE September 23, 2025 10:16

wait only when state is loading

0834da1

rui-mo reviewed Oct 31, 2025

View reviewed changes

xiaoxmeng reviewed Nov 23, 2025

View reviewed changes

rui-mo mentioned this pull request Dec 4, 2025

[GLUTEN-9456][VL] Release Velox memory manager when the reservedBytes become empty apache/incubator-gluten#11249

Merged

fix: Fix memory leak caused by asynchronous prefetch #14722

Are you sure you want to change the base?

fix: Fix memory leak caused by asynchronous prefetch #14722

Conversation

j7nhai commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for meta-velox canceled.

Uh oh!

j7nhai commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

j7nhai commented Sep 5, 2025

Uh oh!

FelixYBW commented Sep 17, 2025

Uh oh!

pedroerp commented Sep 17, 2025

Uh oh!

xiaoxmeng left a comment

Choose a reason for hiding this comment

Uh oh!

j7nhai commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

j7nhai commented Sep 22, 2025

Uh oh!

PHILO-HE Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

j7nhai Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

rui-mo left a comment

Choose a reason for hiding this comment

Uh oh!

rui-mo Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

FelixYBW Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

FelixYBW commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xiaoxmeng Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

FelixYBW Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

xiaoxmeng Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

j7nhai commented Sep 4, 2025 •

edited

Loading

netlify bot commented Sep 4, 2025 •

edited

Loading

j7nhai commented Sep 4, 2025 •

edited

Loading

j7nhai commented Sep 18, 2025 •

edited

Loading

FelixYBW commented Nov 21, 2025 •

edited

Loading