-
Notifications
You must be signed in to change notification settings - Fork 1.4k
fix: Fix the async direct coalesce load memory leak in case of early task failure in pyspark #12729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This pull request was exported from Phabricator. Differential Revision: D71529929 |
✅ Deploy Preview for meta-velox canceled.
|
…task failure in pyspark (facebookincubator#12729) Summary: Here is the race condition can cause memory leak in race between async direct coalesce load and early task failure: T1. file reader triggers stripe load which does coalesce prefetch T2. coalesce prefetch kicks off and pass the cancellation check T3. coalesce load does the memory allocations and do read from storage T4. before table scan do on-demand load, the task fails or prefetch data is skipped by filtering and task finishes. T5. task destruction frees the memory pool hits the memory leak check failure. T6. if we disable the memory leak check failure (with memory leak metric reporting in Meta production), then the buffer free will throw with bad memory pool pointer. Verified the fix with unit test that reproduce the race condition. This is exposed by pyspark use case. Differential Revision: D71529929
1f7152e to
0103b38
Compare
…task failure in pyspark (facebookincubator#12729) Summary: Pull Request resolved: facebookincubator#12729 Here is the race condition can cause memory leak in race between async direct coalesce load and early task failure: T1. file reader triggers stripe load which does coalesce prefetch T2. coalesce prefetch kicks off and pass the cancellation check T3. coalesce load does the memory allocations and do read from storage T4. before table scan do on-demand load, the task fails or prefetch data is skipped by filtering and task finishes. T5. task destruction frees the memory pool hits the memory leak check failure. T6. if we disable the memory leak check failure (with memory leak metric reporting in Meta production), then the buffer free will throw with bad memory pool pointer. Verified the fix with unit test that reproduce the race condition. This is exposed by pyspark use case. Differential Revision: D71529929
|
This pull request was exported from Phabricator. Differential Revision: D71529929 |
0103b38 to
ebd682c
Compare
tanjialiang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick fix
|
This pull request has been merged in 48b9803. |
|
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
…task failure in pyspark (facebookincubator#12729) Summary: Pull Request resolved: facebookincubator#12729 Here is the race condition can cause memory leak in race between async direct coalesce load and early task failure: T1. file reader triggers stripe load which does coalesce prefetch T2. coalesce prefetch kicks off and pass the cancellation check T3. coalesce load does the memory allocations and do read from storage T4. before table scan do on-demand load, the task fails or prefetch data is skipped by filtering and task finishes. T5. task destruction frees the memory pool hits the memory leak check failure. T6. if we disable the memory leak check failure (with memory leak metric reporting in Meta production), then the buffer free will throw with bad memory pool pointer. Verified the fix with unit test that reproduce the race condition. This is exposed by pyspark use case. Reviewed By: tanjialiang, oerling Differential Revision: D71529929 fbshipit-source-id: de1c545c48dc23602a79238b878020ce2f29c4f4
|
@xiaoxmeng |
Actually, the core dump is caused by memory leak. refer to my PR fix: #14722 |
Summary:
Here is the race condition can cause memory leak in race between async direct coalesce load and early task failure:
T1. file reader triggers stripe load which does coalesce prefetch
T2. coalesce prefetch kicks off and pass the cancellation check
T3. coalesce load does the memory allocations and do read from storage
T4. before table scan do on-demand load, the task fails or prefetch data is skipped by filtering and task finishes.
T5. task destruction frees the memory pool hits the memory leak check failure.
T6. if we disable the memory leak check failure (with memory leak metric reporting in Meta production), then
the buffer free will throw with bad memory pool pointer.
Verified the fix with unit test that reproduce the race condition. This is exposed by pyspark use case.
Differential Revision: D71529929