-
Notifications
You must be signed in to change notification settings - Fork 85
Description
Motivation
Currently KvikIO uses related but separate approaches to manage bounce buffers for local and remote I/O:
| Backend | Buffer strategy | Limitations |
|---|---|---|
| Local I/O (pread/pwrite) | Single buffer from global pool, synchronous | Stream wait after every chunk |
| Remote I/O (easy-handle) | Accumulate small transfers, then H2D | Same single-buffer limitation |
| Remote I/O (multi-poll, #896) | Independent multi-buffer impl | Not integrated with existing pool |
This fragmentation leads to scattered and duplicate logic across BounceBufferPool, BounceBufferManager, and BounceBufferH2D, and prevents overlap between I/O and memory transfers.
Proposed solution
K-way Bounce Buffer Ring
A unified, direction-agnostic ring supporting configurable parallelism. With k buffers:
- Buffer[i] can be filled while buffer[i-1] transfers to/from GPU
- Synchronization only required on wrap-around (every k operations)
- Same abstraction serves H2D (reads) and D2H (writes)
See #520 for the original double-buffering discussion.
Global CUDA Event Pool
Enables efficient stream synchronization across thread pool workers without per-operation allocation overhead.
Scope
This effort focuses on optimizing the read (H2D) path. The write (D2H) path will adopt the unified ring infrastructure but remain locked to k=1, preserving current behavior. Write-path optimization (pipelining D2H with I/O) is deferred to future work due to additional complexities.
Implementation plan
Foundation
- Stream cache fix: Fix per-thread, per-context stream race condition #917
- K-way bounce buffer ring: Implement a k-way bounce buffer ring, and unify bounce buffer management #913 (WIP)
- Global CUDA event pool: Implement CUDA event pool to minimize runtime resource allocation overhead #919 (WIP)
Local I/O (depends on foundation)
- pread/pwrite backend: Use bounce buffer ring to optimize local pread #921
- io_uring backend: [26.04+ experimental] Add io_uring backend to improve I/O performance in general #870
Remote I/O (depends on foundation)
- Easy-handle backend: [WIP] Optimize easy-handle remote I/O using bounce buffer ring #916 (WIP)
- Multi-handle poll-based backend: Add a new remote I/O backend based on libcurl poll-based multi API #896
Future work beyond the scope of this issue
Write (D2H) path optimization with k > 1