Skip to content

Add per-block-device thread pools for multi-drive I/O parallelism#880

Merged
rapids-bot[bot] merged 10 commits intorapidsai:mainfrom
kingcrimsontianyu:per-drive-pool
Dec 10, 2025
Merged

Add per-block-device thread pools for multi-drive I/O parallelism#880
rapids-bot[bot] merged 10 commits intorapidsai:mainfrom
kingcrimsontianyu:per-drive-pool

Conversation

@kingcrimsontianyu
Copy link
Contributor

@kingcrimsontianyu kingcrimsontianyu commented Nov 27, 2025

Summary

We have identified the root cause of #850: With a single global thread pool, I/O tasks for different drives could cluster on submission in a sequential fashion, preventing true hardware parallelism. This PR tackles this problem by creating dedicated thread pools for each physical block device, to ensure that each drive receives concurrent I/O requests independently, achieving expected bandwidth scaling across multiple drives.

This PR resolves the file paths to their underlying block device and creates a new thread pool for each separate block device used by the I/O tasks. For efficient operations, KvikIO first searches for the file path in a file path --> thread pool cache, and if not found, performs the device resolving and searches for the block device in a block device --> thread pool mapping.

Usage

The new feature of per-block-device thread pool can be queried and set programmatically in C++:

kvikio::defaults::thread_pool_per_block_device();
kvikio::defaults::set_thread_pool_per_block_device(flag);

It can also be controlled using the environment variable KVIKIO_THREAD_POOL_PER_BLOCK_DEVICE=1/true/yes/on (case insensitive; disabled by default)

Limitation

  • KVIKIO_NTHREADS now means the number of threads per pool. There is currently no way to limit the total number of threads across all pools.
  • For device-mapper devices (LVM, dm-crypt), this returns the dm device ID, not the underlying physical device(s). This may be suboptimal when multiple LVs share the same underlying physical drive (over-subscription) or when a single LV is striped across multiple drives (under-utilization).

Closes #850

@kingcrimsontianyu kingcrimsontianyu added improvement Improves an existing functionality non-breaking Introduces a non-breaking change c++ Affects the C++ API of KvikIO labels Nov 27, 2025
@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 27, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@kingcrimsontianyu
Copy link
Contributor Author

kingcrimsontianyu commented Nov 30, 2025

Root cause

  • Main branch

    Tasks of the same color come from the same kvikio::FileHandle::pread call of the same file handle. With a single global pool, I/O tasks from a single file are being executed concurrently at any point, essentially leaving out hardware-level parallelism.

image
  • This PR

    With per-block-device, I/O tasks from different files on different NVMes are being executed concurrently, leveraging hardware-level parallelism. In this example, each thread pool only uses 1 thread for Direct I/O. If more than 1 thread is used, thread-level parallelism can be achieved as well, further increasing the overall bandwidth as shown in the result below.

image

Performance result

System setup

  • 1 x H100 GPU
  • 4 x Micron_7450_MTFDKCB1T9TFR NVMes

KvikIO sequential read

  • Setting
    • Read four 4-GiB files to the device
    • Each file is on a separate NVMe
    • All pread calls are made on the main thread in sequence
    • KvikIO compatibility mode (no GDS)
    • File open/close per iteration
    • Cold page cache, direct I/O
git branch number of files total number of threads bandwidth [MiB/s]
main 1 1 3141.3071
main 1 4 6120.5389
main 4 4 6236.5598
main 4 16 6236.5131
this PR 4 4 11257.1692
this PR 4 16 24384.0260

Block device resolution performance

  • get_thread_pool_per_block_device (uncached) : ~940 $\mu$s
  • get_block_device_info (uncached): ~200 $\mu$s

@kingcrimsontianyu kingcrimsontianyu marked this pull request as ready for review December 2, 2025 06:34
@kingcrimsontianyu kingcrimsontianyu requested a review from a team as a code owner December 2, 2025 06:34
@kingcrimsontianyu kingcrimsontianyu requested a review from a team as a code owner December 2, 2025 19:14
Copy link
Member

@KyleFromNVIDIA KyleFromNVIDIA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved trivial CMake changes

Copy link
Member

@madsbk madsbk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

In a follow-up PR, I suggest adding Python bindings for get_block_device_info() and including block device details in the Python benchmarks, for example in pprint_sys_info().

@kingcrimsontianyu
Copy link
Contributor Author

Additional notes

The utility function kvikio::get_block_device_info was tested using paths on different file systems. The outcomes are all as expected, shown below.

file path provided file system outcome
/raid ext4 RAID0 resolved the block device: md127
${HOME} nfs4 thrown an exception: sysfs path "/sys/dev/block/0:65" for file "/mnt/nfs" does not exist. The file may reside on a virtual file system (overlayfs, tmpfs) with no backing block device. Linux system/library function call error at: /home/coder/kvikio/cpp/src/file_utils.cpp:277: No such file or directory
/proc pseudo fs thrown an exception: sysfs path "/sys/dev/block/0:64" for file "/proc" does not exist. The file may reside on a virtual file system (overlayfs, tmpfs) with no backing block device. Linux system/library function call error at: /home/coder/kvikio/cpp/src/file_utils.cpp:277: No such file or directory
/dev/shm pseudo fs thrown an exception: sysfs path "/sys/dev/block/0:67" for file "/dev/shm" does not exist. The file may reside on a virtual file system (overlayfs, tmpfs) with no backing block device. Linux system/library function call error at: /home/coder/kvikio/cpp/src/file_utils.cpp:277: No such file or directory
/tmp (within docker) overlayfs thrown an exception: sysfs path "/sys/dev/block/0:49" for file "/tmp" does not exist. The file may reside on a virtual file system (overlayfs, tmpfs) with no backing block device. Linux system/library function call error at: /home/coder/kvikio/cpp/src/file_utils.cpp:277: No such file or directory

auto block_dev_info = get_block_device_info(file_path);

// Check if we already have a thread pool for this block device
std::lock_guard lock(mtx);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lock lasts until the function is returned from the try block.

try {
// Fast path: check if this exact file path has been seen before
{
std::lock_guard lock(mtx);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lock lasts until the block ends. It is to prevent race condition when one thread updates file_path_to_thread_pool_map and another thread reads from it.

}

// Resolve file path to its underlying block device
auto block_dev_info = get_block_device_info(file_path);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is not locked to permit concurrent execution.

{
KVIKIO_NVTX_FUNC_RANGE();

static std::mutex mtx;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the internal function get_thread_pool_per_block_device, the public API get_block_device_info locks the mutex twice to prevent race condition. The block device resolution is not locked so as to avoid serialization and permit concurrent execution. It is possible that two threads concurrently resolve two exactly same, uncached file path, in which case only one thread will update the file_path_to_info_map cache, and the second thread holding the same result will skip it.

@kingcrimsontianyu
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit ef28f97 into rapidsai:main Dec 10, 2025
77 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

c++ Affects the C++ API of KvikIO improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Discussion] What’s the appropriate KVIKIO_NTHREADS setting for GDS?

4 participants