Skip to content

Conversation

@raaiq1
Copy link
Contributor

@raaiq1 raaiq1 commented Nov 30, 2022

This is a continuation of this PR

steffenlarsen and others added 9 commits December 1, 2022 14:19
…ntel#5120)

If a kernel enqueue fails the runtime will immediately try and clean it up. However, if it has any dependencies or users the cleanup will be skipped. This can cause the dependencies to stay alive and leak. These changes forces a full sub-graph cleanup of the command if enqueuing failed. Additionally, sub-graph cleanup is changed to account for failed kernel enqueues and will remove the failed command from its leaves.
Signed-off-by: Rauf, Rana <[email protected]>
Signed-off-by: Rauf, Rana <[email protected]>
Signed-off-by: Rauf, Rana <[email protected]>
Signed-off-by: Rauf, Rana <[email protected]>
Signed-off-by: Rauf, Rana <[email protected]>
Signed-off-by: Rauf, Rana <[email protected]>
@raaiq1 raaiq1 marked this pull request as ready for review December 6, 2022 14:27
@raaiq1 raaiq1 requested a review from a team as a code owner December 6, 2022 14:27
@raaiq1 raaiq1 requested a review from cperkinsintel December 6, 2022 14:27
Signed-off-by: Rauf, Rana <[email protected]>
@raaiq1 raaiq1 marked this pull request as draft December 6, 2022 21:19
@raaiq1 raaiq1 marked this pull request as ready for review December 6, 2022 22:18
@raaiq1
Copy link
Contributor Author

raaiq1 commented Dec 8, 2022

Pinging @sergey-semenov for review

@KseniyaTikhomirova
Copy link
Contributor

Hello @raaiq1 may I ask you to educate me a bit about this PR and original issue? My questions are the following:

  1. What is the benefit of replacing failed kernel command with empty command? If it still will be released in the pipeline the only benefit I see is to release memory objects earlier. Where exactly empty command will be released? Why replaced failed command could not be released there?
  2. if the main purpose is to release memory buffers hold by failed kernel why it is not enough to call clearStreams and clearAuxiliaryResources + some other possibly needed stuff?
    Thank you

@KseniyaTikhomirova
Copy link
Contributor

@gmlueck hello, could you please help to clarify expected RT behavior when kernel enqueue is failed? Should we be able to "continue" work (discard failed command and enqueue users of failed job) or we should fail (skip execution and release resources) for all other commands depending on failed one? I haven't found any description of expected behavior for this case.
To me it looks like we need to "skip" all execution after failed job since it has no sense to continue if data (input or intermediate) may be not updated and so we very likely will produce incorrect results.

@raaiq1
Copy link
Contributor Author

raaiq1 commented Dec 12, 2022

Hello @raaiq1 may I ask you to educate me a bit about this PR and original issue? My questions are the following:

  1. What is the benefit of replacing failed kernel command with empty command? If it still will be released in the pipeline the only benefit I see is to release memory objects earlier. Where exactly empty command will be released? Why replaced failed command could not be released there?
  2. if the main purpose is to release memory buffers hold by failed kernel why it is not enough to call clearStreams and clearAuxiliaryResources + some other possibly needed stuff?
    Thank you

Pinging @steffenlarsen , as I'm not sure I can answer all of these questions

@raaiq1
Copy link
Contributor Author

raaiq1 commented Dec 13, 2022

@KseniyaTikhomirova I talked with Stephen and these are some of the responses:

What is the benefit of replacing failed kernel command with empty command?

Replacing failed command with an empty command allows us to rightfully clean up the dependencies of the command, without being at the mercy of future changes.

Where exactly empty command will be released?

It's cleanup by the normal post enqueue cleanup process

Why replaced failed command could not be released there?

Problem with the current one is that the cleanup will skip cleaning up the existing command if it has dependencies, like accessors. The empty command won't have that.

@gmlueck
Copy link
Contributor

gmlueck commented Dec 19, 2022

could you please help to clarify expected RT behavior when kernel enqueue is failed? Should we be able to "continue" work (discard failed command and enqueue users of failed job) or we should fail (skip execution and release resources) for all other commands depending on failed one? I haven't found any description of expected behavior for this case.

@KseniyaTikhomirova: This depends on whether the failure is synchronous or asynchronous. Are you asking about a failure that is diagnosed by throwing a synchronous exception from (e.g.) handler::parallel_for or queue::submit?

The asynchronous case is, unfortunately, not well defined in the spec right now. The committee has discussed this briefly in the past, but there was not much progress. In general, we feel that it's better to diagnose errors synchronously whenever possible.

@raaiq1
Copy link
Contributor Author

raaiq1 commented Dec 20, 2022

This PR might need to be redesigned to not use empty commands since there're plans on removing it: #7832

@KseniyaTikhomirova
Copy link
Contributor

This PR might need to be redesigned to not use empty commands since there're plans on removing it: #7832

Hi, I do not want to block you here since this is functional fix while empty task removal is refactoring. Although the questions/concerns we discussed in person is still applicable

@KseniyaTikhomirova
Copy link
Contributor

Hi @gmlueck I think the main question is about async errors since it doesn't allow to validate kernel in submit path and brings the questions what to do with its users.
Lets imagine the case when we have host accessor/task using memory buffer. Then we submit kernel 1 using the same memory buffer and kernel 2 dependent on kernel 1. They both will not be enqueued in their q.submit call because host acc/task prevent them from being enqueued. And then after host acc/task completion we asynchronously trying to enqueue kernel 1 and kernel 2 if kernel 1 fails - we have dilemma what to do with kernel 2. If we simply report async error to queue and disregard kernel 1 failure - we will execute kernel 2 and utilize resources without perspective to get valid/expected result in common case. For user it also is not very clear what part of work was affected. But if we not execute kernel and just skip this part of work - it is not clear how to report it to user since we have no way to do that properly. For example, we have no "failed" state for event that maps with submitted jobs or any other mechanism to report failed commands.

@gmlueck
Copy link
Contributor

gmlueck commented Dec 22, 2022

@KseniyaTikhomirova

I think the main question is about async errors ...

I have a two part answer. First, I think DPC++ should not defer error checking that normally happens synchronously just because a kernel is dependent upon a host task. Some errors are required to be diagnosed synchronously such as the case when a kernel decorated with [[sycl::reqd_work_group_size]] is submitted with an nd-range that has a different work-group size. These errors must be diagnosed synchronously even when the kernel is dependent upon a host task. Other errors are are not specifically mandated by the spec, but it's still better for the user if we diagnose them synchronously. I think we could achieve this on the OpenCL backend by using user-defined events. This allows you to submit a kernel to the device even before its dependent host task finishes. DPC++ can then signal the event when the host task finishes, and this will tell the device driver that it is safe to start executing the kernel. I'm not sure if Level Zero has a similar feature, but we could request it if it doesn't exist.

Even aside from the host-task scenario, there will be some errors that are reported asynchronously. As I mentioned before, this is not well defined in the spec currently. You can see the discussion in this internal Khronos issue if you are interested, but there is no resolution. (Khronos access is required to see that issue.) Since there is no resolution yet, I suggest that DPC++ should simply report the asynchronous error as specified in the spec and leave all dependencies unresolved. This means that any dependent commands will simply hang. The SYCL spec does not provide enough guarantees currently to allow an application to recover from an asynchronous error. I think that most application will simply report some sort of error message from their async error handler and then terminate (if they bother to set up an async error handler at all).

If / when the Khronos group clarifies the behavior of the dependency graph after an async error, then we can adjust the DPC++ implementation.

@github-actions
Copy link
Contributor

This pull request is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be automatically closed in 30 days.

@github-actions github-actions bot added the Stale label Sep 16, 2024
@github-actions
Copy link
Contributor

This pull request was closed because it has been stalled for 30 days with no activity.

@github-actions github-actions bot closed this Oct 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants