Skip to content

Unified k-way bounce buffer infrastructure for local and remote read #914

@kingcrimsontianyu

Description

@kingcrimsontianyu

Motivation

Currently KvikIO uses related but separate approaches to manage bounce buffers for local and remote I/O:

Backend Buffer strategy Limitations
Local I/O (pread/pwrite) Single buffer from global pool, synchronous Stream wait after every chunk
Remote I/O (easy-handle) Accumulate small transfers, then H2D Same single-buffer limitation
Remote I/O (multi-poll, #896) Independent multi-buffer impl Not integrated with existing pool

This fragmentation leads to scattered and duplicate logic across BounceBufferPool, BounceBufferManager, and BounceBufferH2D, and prevents overlap between I/O and memory transfers.

Proposed solution

K-way Bounce Buffer Ring

A unified, direction-agnostic ring supporting configurable parallelism. With k buffers:

  • Buffer[i] can be filled while buffer[i-1] transfers to/from GPU
  • Synchronization only required on wrap-around (every k operations)
  • Same abstraction serves H2D (reads) and D2H (writes)

See #520 for the original double-buffering discussion.

Global CUDA Event Pool

Enables efficient stream synchronization across thread pool workers without per-operation allocation overhead.

Scope

This effort focuses on optimizing the read (H2D) path. The write (D2H) path will adopt the unified ring infrastructure but remain locked to k=1, preserving current behavior. Write-path optimization (pipelining D2H with I/O) is deferred to future work due to additional complexities.

Implementation plan

Foundation

Local I/O (depends on foundation)

Remote I/O (depends on foundation)

Future work beyond the scope of this issue

Write (D2H) path optimization with k > 1

Metadata

Metadata

Labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions