Skip to content

Implement CUDA event pool to minimize runtime resource allocation overhead#919

Draft
kingcrimsontianyu wants to merge 19 commits intorapidsai:mainfrom
kingcrimsontianyu:event-pool
Draft

Implement CUDA event pool to minimize runtime resource allocation overhead#919
kingcrimsontianyu wants to merge 19 commits intorapidsai:mainfrom
kingcrimsontianyu:event-pool

Conversation

@kingcrimsontianyu
Copy link
Contributor

@kingcrimsontianyu kingcrimsontianyu commented Feb 2, 2026

Related PR

Depends on #917
Addresses part of #914

@kingcrimsontianyu kingcrimsontianyu added improvement Improves an existing functionality non-breaking Introduces a non-breaking change c++ Affects the C++ API of KvikIO labels Feb 2, 2026
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 2, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@kingcrimsontianyu kingcrimsontianyu changed the title Implement CUDA event pool to minimize runtime resource allocation overhead [WIP] Implement CUDA event pool to minimize runtime resource allocation overhead Feb 2, 2026
Comment on lines +70 to +75
if (event == nullptr) {
// Create an event outside the lock to improve performance.
// The pool is not updated here; the returned Event object will automatically return the event
// to the pool when it goes out of scope
CUDA_DRIVER_TRY(cudaAPI::instance().EventCreate(&event, CU_EVENT_DISABLE_TIMING));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Should we have a max size on this event pool? Since we never destroy events, could the pool unboundedly grow in a long-running application? I suppose it will depend on the maximum concurrency of threads issuing reads?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is probably fine to let it grow unbounded. The idea I'm trying to implement in #921 is that each pread() builds a "pread context" that gets a num_threads number of events from the pool, i.e. a single event for each thread. Each 4-MiB chunked read() originating from a specific pread() performs the following in sequence:

This event is reused for all the chunks on the same thread originating from the same pread() call. So the space complexity overall is O(num_threads * num_concurrent_pread), which I think is not likely to blow up the RAM in a long running application.

But I do think that in the future a limitation on the resource pools (currently we have bounce buffer pool, this event pool, and libcurl easy handle pool) is a good feature.

_pools[context].push_back(event);
} catch (...) {
// If returning to pool fails, destroy the event
cudaAPI::instance().EventDestroy(event);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Not checking the error code because this is called from ~Event()?

I think we should at least log an error in that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Agreed. I've added the error logging.

Comment on lines 94 to 101
std::size_t EventPool::num_free_events(CUcontext context) const
{
std::lock_guard const lock(_mutex);
auto it = _pools.find(context);
return (it != _pools.end()) ? it->second.size() : 0;
}

std::size_t EventPool::total_free_events() const
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Do you need these for correct usage, or are they just going to be introspection facilities?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are just going to be introspection facilities. I plan to include them in the unit test.

@kingcrimsontianyu kingcrimsontianyu changed the title [WIP] Implement CUDA event pool to minimize runtime resource allocation overhead Implement CUDA event pool to minimize runtime resource allocation overhead Feb 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

c++ Affects the C++ API of KvikIO DO NOT MERGE improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants