Implement CUDA event pool to minimize runtime resource allocation overhead#919
Implement CUDA event pool to minimize runtime resource allocation overhead#919kingcrimsontianyu wants to merge 19 commits intorapidsai:mainfrom
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
| if (event == nullptr) { | ||
| // Create an event outside the lock to improve performance. | ||
| // The pool is not updated here; the returned Event object will automatically return the event | ||
| // to the pool when it goes out of scope | ||
| CUDA_DRIVER_TRY(cudaAPI::instance().EventCreate(&event, CU_EVENT_DISABLE_TIMING)); | ||
| } |
There was a problem hiding this comment.
question: Should we have a max size on this event pool? Since we never destroy events, could the pool unboundedly grow in a long-running application? I suppose it will depend on the maximum concurrency of threads issuing reads?
There was a problem hiding this comment.
I think it is probably fine to let it grow unbounded. The idea I'm trying to implement in #921 is that each pread() builds a "pread context" that gets a num_threads number of events from the pool, i.e. a single event for each thread. Each 4-MiB chunked read() originating from a specific pread() performs the following in sequence:
- I/O (file->bounce buffer ring ([WIP] Optimize easy-handle remote I/O using bounce buffer ring #916))
- H2D copy async (bounce buffer ring->device) on a per-thread, per-context stream
- Enqueue the per-
pread, per-thread, per-context event on the stream
This event is reused for all the chunks on the same thread originating from the same pread() call. So the space complexity overall is O(num_threads * num_concurrent_pread), which I think is not likely to blow up the RAM in a long running application.
But I do think that in the future a limitation on the resource pools (currently we have bounce buffer pool, this event pool, and libcurl easy handle pool) is a good feature.
cpp/src/detail/event.cpp
Outdated
| _pools[context].push_back(event); | ||
| } catch (...) { | ||
| // If returning to pool fails, destroy the event | ||
| cudaAPI::instance().EventDestroy(event); |
There was a problem hiding this comment.
question: Not checking the error code because this is called from ~Event()?
I think we should at least log an error in that case.
There was a problem hiding this comment.
Thanks. Agreed. I've added the error logging.
cpp/src/detail/event.cpp
Outdated
| std::size_t EventPool::num_free_events(CUcontext context) const | ||
| { | ||
| std::lock_guard const lock(_mutex); | ||
| auto it = _pools.find(context); | ||
| return (it != _pools.end()) ? it->second.size() : 0; | ||
| } | ||
|
|
||
| std::size_t EventPool::total_free_events() const |
There was a problem hiding this comment.
question: Do you need these for correct usage, or are they just going to be introspection facilities?
There was a problem hiding this comment.
They are just going to be introspection facilities. I plan to include them in the unit test.
Co-authored-by: Lawrence Mitchell <wence@gmx.li>
Related PR
Depends on #917
Addresses part of #914