-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-18487][SQL] Add completion listener to HashAggregate to avoid memory leak #15916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #68763 has finished for PR 15916 at commit
|
|
@viirya don't we use task completion listeners for this? |
|
@hvanhovell Thanks for the hint. I am trying it. However this listener would be in operators case by case, e.g., HashAggregate here. If there are any other operators needed to release resources after all rows iterated, we may need to add similar listeners too. Originally I am little worrying the possible performance regression due to extra consuming of row. With task completion listener, this is not an issue. |
|
Did you actually see an issue with memory leak? |
|
@rxin Yeah, the added test case will report memory leak failure before this patch. |
|
What do you mean by "report"? A warning message was logged? If a warning message was logged, it is generated by the callback itself which just releases the memory. |
|
@rxin An exception will be thrown, not just a warning message. This exception is thrown at The exception looks like: |
|
Of course it is not actually a real memory leak because the memory is released at the end by calling Or we just need to turn |
|
Test build #68818 has finished for PR 15916 at commit
|
|
I'd set it maybe to false. You are just adding a completion listener, which is the same as taskMemoryManager.cleanUpAllAllocatedMemory anyway ... |
|
Yeah, I see. As I said in previous comment, the memory is released at the end anyway. I would guess the default setting as true is to find potential memory leak during development. So turn it to false is a good idea? This patch is coming from #15874 which hits the exception by @sethah during test. Although If you still think this is not necessary, I can close it. |
|
Test build #68824 has finished for PR 15916 at commit
|
|
@hvanhovell @rxin What do you think? Please let me know if we need this or not. Thanks. |
|
Hello @rxin @hvanhovell Can you let me know if this is needed or not? So I can close it. Thank you. |
|
No it's not needed if you are using exactly the same way as the catch-all to release memory. It is basically just the catch-all itself. |
|
@rxin Thanks. Appreciate your feedback. I could close this now. |
|
@rxin BTW, I see you merged #15989 to downgrade error message level in TaskMemoryManager. I'd like to modify the error message in Executor too, because the current one is little confusing to developers if they don't know this part exactly and they would think there is memory leak happened. What do you think? If it is ok for you, I'd submit a pr for it. Thanks. |
|
Can you show an example of a leak that would happen in Executor but not in the callback? Thanks. |
|
The test case I added in this pr: would thrown an exception like this: I submitted this pr because @sethah encountered this exception during his test. I think it might be other developers hit this in the future. If they don't know this part, from the error message they would think a memory leak happened. In order to avoid this and provide more useful info, I'd like to modify this error message too. |
|
Forget to say, of course, this example will thrown the exception only running in "test". Other developers would possibly encounter this when they write test codes in the future. If we could provide more info in this error message, we could save their time to investigate this. What do you think? |
|
ping @rxin Appreciate your feedback if you can let me know what you think of my suggestion. Thanks. |
What changes were proposed in this pull request?
The methods such as Dataset.show and take use
Limit(i.e.,CollectLimitExec) which leverages SparkPlan.executeTake to efficiently collect required number of elements back to the driver.However, under wholestage codege, we usually release resources after all elements are consumed (e.g., HashAggregate). In this case, we will not release the resources and cause memory leak with Dataset.show, for example.
An exception will be thrown in this case. This exception is thrown at
Executorafter it callstaskMemoryManager.cleanUpAllAllocatedMemoryand finds there are memory not released after the task completion.The exception looks like:
This pr adds task completion listener to HashAggregate to avoid the memory leak.
How was this patch tested?
Added test in DatasetSuite.
Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.