[SYCL] Fix memory leaks caused by failed kernel enqueue #7594

raaiq1 · 2022-11-30T18:13:25Z

This is a continuation of this PR

…ntel#5120) If a kernel enqueue fails the runtime will immediately try and clean it up. However, if it has any dependencies or users the cleanup will be skipped. This can cause the dependencies to stay alive and leak. These changes forces a full sub-graph cleanup of the command if enqueuing failed. Additionally, sub-graph cleanup is changed to account for failed kernel enqueues and will remove the failed command from its leaves.

Signed-off-by: Rauf, Rana <[email protected]>

raaiq1 · 2022-12-08T19:47:30Z

Pinging @sergey-semenov for review

KseniyaTikhomirova · 2022-12-09T14:05:55Z

Hello @raaiq1 may I ask you to educate me a bit about this PR and original issue? My questions are the following:

What is the benefit of replacing failed kernel command with empty command? If it still will be released in the pipeline the only benefit I see is to release memory objects earlier. Where exactly empty command will be released? Why replaced failed command could not be released there?
if the main purpose is to release memory buffers hold by failed kernel why it is not enough to call clearStreams and clearAuxiliaryResources + some other possibly needed stuff?
Thank you

KseniyaTikhomirova · 2022-12-09T18:59:44Z

@gmlueck hello, could you please help to clarify expected RT behavior when kernel enqueue is failed? Should we be able to "continue" work (discard failed command and enqueue users of failed job) or we should fail (skip execution and release resources) for all other commands depending on failed one? I haven't found any description of expected behavior for this case.
To me it looks like we need to "skip" all execution after failed job since it has no sense to continue if data (input or intermediate) may be not updated and so we very likely will produce incorrect results.

raaiq1 · 2022-12-12T18:15:10Z

Hello @raaiq1 may I ask you to educate me a bit about this PR and original issue? My questions are the following:

What is the benefit of replacing failed kernel command with empty command? If it still will be released in the pipeline the only benefit I see is to release memory objects earlier. Where exactly empty command will be released? Why replaced failed command could not be released there?

if the main purpose is to release memory buffers hold by failed kernel why it is not enough to call clearStreams and clearAuxiliaryResources + some other possibly needed stuff?
Thank you

Pinging @steffenlarsen , as I'm not sure I can answer all of these questions

raaiq1 · 2022-12-13T14:45:42Z

@KseniyaTikhomirova I talked with Stephen and these are some of the responses:

What is the benefit of replacing failed kernel command with empty command?

Replacing failed command with an empty command allows us to rightfully clean up the dependencies of the command, without being at the mercy of future changes.

Where exactly empty command will be released?

It's cleanup by the normal post enqueue cleanup process

Why replaced failed command could not be released there?

Problem with the current one is that the cleanup will skip cleaning up the existing command if it has dependencies, like accessors. The empty command won't have that.

gmlueck · 2022-12-19T15:59:08Z

could you please help to clarify expected RT behavior when kernel enqueue is failed? Should we be able to "continue" work (discard failed command and enqueue users of failed job) or we should fail (skip execution and release resources) for all other commands depending on failed one? I haven't found any description of expected behavior for this case.

@KseniyaTikhomirova: This depends on whether the failure is synchronous or asynchronous. Are you asking about a failure that is diagnosed by throwing a synchronous exception from (e.g.) handler::parallel_for or queue::submit?

The asynchronous case is, unfortunately, not well defined in the spec right now. The committee has discussed this briefly in the past, but there was not much progress. In general, we feel that it's better to diagnose errors synchronously whenever possible.

raaiq1 · 2022-12-20T22:14:32Z

This PR might need to be redesigned to not use empty commands since there're plans on removing it: #7832

KseniyaTikhomirova · 2022-12-21T09:56:59Z

This PR might need to be redesigned to not use empty commands since there're plans on removing it: #7832

Hi, I do not want to block you here since this is functional fix while empty task removal is refactoring. Although the questions/concerns we discussed in person is still applicable

KseniyaTikhomirova · 2022-12-22T10:44:39Z

Hi @gmlueck I think the main question is about async errors since it doesn't allow to validate kernel in submit path and brings the questions what to do with its users.
Lets imagine the case when we have host accessor/task using memory buffer. Then we submit kernel 1 using the same memory buffer and kernel 2 dependent on kernel 1. They both will not be enqueued in their q.submit call because host acc/task prevent them from being enqueued. And then after host acc/task completion we asynchronously trying to enqueue kernel 1 and kernel 2 if kernel 1 fails - we have dilemma what to do with kernel 2. If we simply report async error to queue and disregard kernel 1 failure - we will execute kernel 2 and utilize resources without perspective to get valid/expected result in common case. For user it also is not very clear what part of work was affected. But if we not execute kernel and just skip this part of work - it is not clear how to report it to user since we have no way to do that properly. For example, we have no "failed" state for event that maps with submitted jobs or any other mechanism to report failed commands.

gmlueck · 2022-12-22T17:51:03Z

@KseniyaTikhomirova

I think the main question is about async errors ...

I have a two part answer. First, I think DPC++ should not defer error checking that normally happens synchronously just because a kernel is dependent upon a host task. Some errors are required to be diagnosed synchronously such as the case when a kernel decorated with [[sycl::reqd_work_group_size]] is submitted with an nd-range that has a different work-group size. These errors must be diagnosed synchronously even when the kernel is dependent upon a host task. Other errors are are not specifically mandated by the spec, but it's still better for the user if we diagnose them synchronously. I think we could achieve this on the OpenCL backend by using user-defined events. This allows you to submit a kernel to the device even before its dependent host task finishes. DPC++ can then signal the event when the host task finishes, and this will tell the device driver that it is safe to start executing the kernel. I'm not sure if Level Zero has a similar feature, but we could request it if it doesn't exist.

Even aside from the host-task scenario, there will be some errors that are reported asynchronously. As I mentioned before, this is not well defined in the spec currently. You can see the discussion in this internal Khronos issue if you are interested, but there is no resolution. (Khronos access is required to see that issue.) Since there is no resolution yet, I suggest that DPC++ should simply report the asynchronous error as specified in the spec and leave all dependencies unresolved. This means that any dependent commands will simply hang. The SYCL spec does not provide enough guarantees currently to allow an application to recover from an asynchronous error. I think that most application will simply report some sort of error message from their async error handler and then terminate (if they bother to set up an async error handler at all).

If / when the Khronos group clarifies the behavior of the dependency graph after an async error, then we can adjust the DPC++ implementation.

github-actions · 2024-09-16T02:00:48Z

This pull request is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be automatically closed in 30 days.

github-actions · 2024-10-16T02:01:05Z

This pull request was closed because it has been stalled for 30 days with no activity.

raaiq1 force-pushed the memory_leak branch from e78fb70 to 97892dc Compare November 30, 2022 21:55

steffenlarsen and others added 9 commits December 1, 2022 14:19

Remove cl namepspace and regression

b517228

Signed-off-by: Rauf, Rana <[email protected]>

Expand BlockReason enum

45c989b

Signed-off-by: Rauf, Rana <[email protected]>

Fix build issues

2fda10b

Signed-off-by: Rauf, Rana <[email protected]>

Refactor test kernel in unittest

15c7cac

Signed-off-by: Rauf, Rana <[email protected]>

Fix build fails

8ce3179

Signed-off-by: Rauf, Rana <[email protected]>

Small

b944aad

Signed-off-by: Rauf, Rana <[email protected]>

Testing

c3af9cd

Signed-off-by: Rauf, Rana <[email protected]>

WIP

dfd8d95

Signed-off-by: Rauf, Rana <[email protected]>

raaiq1 force-pushed the memory_leak branch from 8980006 to dfd8d95 Compare December 5, 2022 16:03

raaiq1 marked this pull request as ready for review December 6, 2022 14:27

raaiq1 requested a review from a team as a code owner December 6, 2022 14:27

raaiq1 requested a review from cperkinsintel December 6, 2022 14:27

Added block reason

b5f65a3

Signed-off-by: Rauf, Rana <[email protected]>

raaiq1 marked this pull request as draft December 6, 2022 21:19

raaiq1 marked this pull request as ready for review December 6, 2022 22:18

cperkinsintel requested a review from sergey-semenov December 7, 2022 22:26

steffenlarsen mentioned this pull request Dec 9, 2022

Draft: [SYCL] Fixes memory dependency leaks caused by failed kernel enqueue #5788

Closed

github-actions bot added the Stale label Sep 16, 2024

github-actions bot closed this Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SYCL] Fix memory leaks caused by failed kernel enqueue #7594

[SYCL] Fix memory leaks caused by failed kernel enqueue #7594

raaiq1 commented Nov 30, 2022

Uh oh!

raaiq1 commented Dec 8, 2022

Uh oh!

KseniyaTikhomirova commented Dec 9, 2022

Uh oh!

KseniyaTikhomirova commented Dec 9, 2022

Uh oh!

raaiq1 commented Dec 12, 2022

Uh oh!

raaiq1 commented Dec 13, 2022

Uh oh!

gmlueck commented Dec 19, 2022

Uh oh!

raaiq1 commented Dec 20, 2022

Uh oh!

KseniyaTikhomirova commented Dec 21, 2022

Uh oh!

KseniyaTikhomirova commented Dec 22, 2022

Uh oh!

gmlueck commented Dec 22, 2022

Uh oh!

github-actions bot commented Sep 16, 2024

Uh oh!

github-actions bot commented Oct 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SYCL] Fix memory leaks caused by failed kernel enqueue #7594

[SYCL] Fix memory leaks caused by failed kernel enqueue #7594

Conversation

raaiq1 commented Nov 30, 2022

Uh oh!

raaiq1 commented Dec 8, 2022

Uh oh!

KseniyaTikhomirova commented Dec 9, 2022

Uh oh!

KseniyaTikhomirova commented Dec 9, 2022

Uh oh!

raaiq1 commented Dec 12, 2022

Uh oh!

raaiq1 commented Dec 13, 2022

Uh oh!

gmlueck commented Dec 19, 2022

Uh oh!

raaiq1 commented Dec 20, 2022

Uh oh!

KseniyaTikhomirova commented Dec 21, 2022

Uh oh!

KseniyaTikhomirova commented Dec 22, 2022

Uh oh!

gmlueck commented Dec 22, 2022

Uh oh!

github-actions bot commented Sep 16, 2024

Uh oh!

github-actions bot commented Oct 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants