Add per-block-device thread pools for multi-drive I/O parallelism#880
Add per-block-device thread pools for multi-drive I/O parallelism#880rapids-bot[bot] merged 10 commits intorapidsai:mainfrom
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
KyleFromNVIDIA
left a comment
There was a problem hiding this comment.
Approved trivial CMake changes
madsbk
left a comment
There was a problem hiding this comment.
Looks good!
In a follow-up PR, I suggest adding Python bindings for get_block_device_info() and including block device details in the Python benchmarks, for example in pprint_sys_info().
Additional notesThe utility function
|
| auto block_dev_info = get_block_device_info(file_path); | ||
|
|
||
| // Check if we already have a thread pool for this block device | ||
| std::lock_guard lock(mtx); |
There was a problem hiding this comment.
This lock lasts until the function is returned from the try block.
| try { | ||
| // Fast path: check if this exact file path has been seen before | ||
| { | ||
| std::lock_guard lock(mtx); |
There was a problem hiding this comment.
This lock lasts until the block ends. It is to prevent race condition when one thread updates file_path_to_thread_pool_map and another thread reads from it.
| } | ||
|
|
||
| // Resolve file path to its underlying block device | ||
| auto block_dev_info = get_block_device_info(file_path); |
There was a problem hiding this comment.
This part is not locked to permit concurrent execution.
| { | ||
| KVIKIO_NVTX_FUNC_RANGE(); | ||
|
|
||
| static std::mutex mtx; |
There was a problem hiding this comment.
Similar to the internal function get_thread_pool_per_block_device, the public API get_block_device_info locks the mutex twice to prevent race condition. The block device resolution is not locked so as to avoid serialization and permit concurrent execution. It is possible that two threads concurrently resolve two exactly same, uncached file path, in which case only one thread will update the file_path_to_info_map cache, and the second thread holding the same result will skip it.
|
/merge |


Summary
We have identified the root cause of #850: With a single global thread pool, I/O tasks for different drives could cluster on submission in a sequential fashion, preventing true hardware parallelism. This PR tackles this problem by creating dedicated thread pools for each physical block device, to ensure that each drive receives concurrent I/O requests independently, achieving expected bandwidth scaling across multiple drives.
This PR resolves the file paths to their underlying block device and creates a new thread pool for each separate block device used by the I/O tasks. For efficient operations, KvikIO first searches for the file path in a
file path --> thread poolcache, and if not found, performs the device resolving and searches for the block device in ablock device --> thread poolmapping.Usage
The new feature of per-block-device thread pool can be queried and set programmatically in C++:
It can also be controlled using the environment variable
KVIKIO_THREAD_POOL_PER_BLOCK_DEVICE=1/true/yes/on(case insensitive; disabled by default)Limitation
KVIKIO_NTHREADSnow means the number of threads per pool. There is currently no way to limit the total number of threads across all pools.Closes #850