-
Notifications
You must be signed in to change notification settings - Fork 802
[SYCL] Fix memory leaks caused by failed kernel enqueue #7594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ntel#5120) If a kernel enqueue fails the runtime will immediately try and clean it up. However, if it has any dependencies or users the cleanup will be skipped. This can cause the dependencies to stay alive and leak. These changes forces a full sub-graph cleanup of the command if enqueuing failed. Additionally, sub-graph cleanup is changed to account for failed kernel enqueues and will remove the failed command from its leaves.
Signed-off-by: Rauf, Rana <[email protected]>
Signed-off-by: Rauf, Rana <[email protected]>
Signed-off-by: Rauf, Rana <[email protected]>
Signed-off-by: Rauf, Rana <[email protected]>
Signed-off-by: Rauf, Rana <[email protected]>
Signed-off-by: Rauf, Rana <[email protected]>
Signed-off-by: Rauf, Rana <[email protected]>
Signed-off-by: Rauf, Rana <[email protected]>
Signed-off-by: Rauf, Rana <[email protected]>
|
Pinging @sergey-semenov for review |
|
Hello @raaiq1 may I ask you to educate me a bit about this PR and original issue? My questions are the following:
|
|
@gmlueck hello, could you please help to clarify expected RT behavior when kernel enqueue is failed? Should we be able to "continue" work (discard failed command and enqueue users of failed job) or we should fail (skip execution and release resources) for all other commands depending on failed one? I haven't found any description of expected behavior for this case. |
Pinging @steffenlarsen , as I'm not sure I can answer all of these questions |
|
@KseniyaTikhomirova I talked with Stephen and these are some of the responses:
Replacing failed command with an empty command allows us to rightfully clean up the dependencies of the command, without being at the mercy of future changes.
It's cleanup by the normal post enqueue cleanup process
Problem with the current one is that the cleanup will skip cleaning up the existing command if it has dependencies, like accessors. The empty command won't have that. |
@KseniyaTikhomirova: This depends on whether the failure is synchronous or asynchronous. Are you asking about a failure that is diagnosed by throwing a synchronous exception from (e.g.) The asynchronous case is, unfortunately, not well defined in the spec right now. The committee has discussed this briefly in the past, but there was not much progress. In general, we feel that it's better to diagnose errors synchronously whenever possible. |
|
This PR might need to be redesigned to not use empty commands since there're plans on removing it: #7832 |
Hi, I do not want to block you here since this is functional fix while empty task removal is refactoring. Although the questions/concerns we discussed in person is still applicable |
|
Hi @gmlueck I think the main question is about async errors since it doesn't allow to validate kernel in submit path and brings the questions what to do with its users. |
I have a two part answer. First, I think DPC++ should not defer error checking that normally happens synchronously just because a kernel is dependent upon a host task. Some errors are required to be diagnosed synchronously such as the case when a kernel decorated with Even aside from the host-task scenario, there will be some errors that are reported asynchronously. As I mentioned before, this is not well defined in the spec currently. You can see the discussion in this internal Khronos issue if you are interested, but there is no resolution. (Khronos access is required to see that issue.) Since there is no resolution yet, I suggest that DPC++ should simply report the asynchronous error as specified in the spec and leave all dependencies unresolved. This means that any dependent commands will simply hang. The SYCL spec does not provide enough guarantees currently to allow an application to recover from an asynchronous error. I think that most application will simply report some sort of error message from their async error handler and then terminate (if they bother to set up an async error handler at all). If / when the Khronos group clarifies the behavior of the dependency graph after an async error, then we can adjust the DPC++ implementation. |
|
This pull request is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be automatically closed in 30 days. |
|
This pull request was closed because it has been stalled for 30 days with no activity. |
This is a continuation of this PR