Skip to content

Conversation

@linehill
Copy link
Collaborator

OpenCL backend called ´chipstar::Event::addDependency()´ without ever calling ´chipstar::Event::releaseDependencies()` during HIP application's lifetime and this had two possible outcomes:

  1. Crash due to out of memory error because the unreleased event objects. This occured after a HIP program had streamed enough commands - like >30000 kernels.

  2. Crash due to stack overflow at program exit / chipStar uninitialization. Because the event dependencies were not released, this led to a build up of very long event dependency chain. At uninitialization, the destruction of a Queue's last event led to destruction of its dependent events which led to destruction their dependend events and this possibly kept going until the crash which was caused by stack overflow from numerous call frames.

Both cases are fixed by removing the addDependency() call. AFAIK, the event dependency system is meant for timing safe release of the backend driver objects (cl_events in this case). OpenCL backend does not need this as the driver releases the objects when they are not needed by the application or by the driver for internal in-progress tasks.

@pvelesko
Copy link
Collaborator

OpenCL backend does not need this as the driver releases the objects when they are not needed by the application or by the driver for internal in-progress tasks.

but the chipStar runtime might need these events to stay alive so that we can use them for syncing queues.

Henry Linjamäki added 2 commits April 29, 2024 11:36
OpenCL backend called ´chipstar::Event::addDependency()´ without ever
calling ´chipstar::Event::releaseDependencies()` during HIP
application's lifetime and this had two possible outcomes:

1) Crash due to out of memory error because the unreleased event
   objects. This occured after a HIP program had streamed enough
   commands - like >30000 kernels.

2) Crash due to stack overflow at program exit / chipStar
   uninitialization. Because the event dependencies were not released,
   this led to a build up of very long event dependency chain. At
   uninitialization, the destruction of a Queue's last event led to
   destruction of its dependent events which led to destruction their
   dependend events and this possibly kept going until the crash which
   was caused by stack overflow from numerous call frames.

Both cases are fixed by removing the `addDependency()` call. AFAIK,
the event dependency system is meant for timing safe release of the
backend driver objects (cl_events in this case). OpenCL backend does
not need this as the driver releaseses the objects when they are not
needed by the application or by the driver for in-progress tasks.
... before they were passed to driver.
@linehill linehill force-pushed the fix-event-memleak branch from 77497ae to e8bcd2c Compare April 29, 2024 08:54
@linehill
Copy link
Collaborator Author

but the chipStar runtime might need these events to stay alive so that we can use them for syncing queues.

Aren’t the event referenced via shared_ptrs which keeps them alive as long as needed? Or do you mean that some objects in the chipStar runtime might reference the events via raw pointers?

Henry Linjamäki added 2 commits April 29, 2024 14:52
Level0 backend is known to fail the check on iGPU. Assuming the check
fails on dGPU.
@pvelesko pvelesko merged commit 4d479e4 into main May 1, 2024
@pvelesko pvelesko deleted the fix-event-memleak branch May 1, 2024 09:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants