Unified k-way bounce buffer infrastructure for local and remote read

## Motivation

Currently KvikIO uses related but separate approaches to manage bounce buffers for local and remote I/O:

| Backend | Buffer strategy | Limitations |
|---|---|---|
| Local I/O (pread/pwrite) | Single buffer from global pool, synchronous | Stream wait after every chunk |
| Remote I/O (easy-handle) | Accumulate small transfers, then H2D | Same single-buffer limitation |
| Remote I/O (multi-poll, #896) | Independent multi-buffer impl | Not integrated with existing pool |

This fragmentation leads to scattered and duplicate logic across `BounceBufferPool`, `BounceBufferManager`, and `BounceBufferH2D`, and prevents overlap between I/O and memory transfers.

## Proposed solution

### K-way Bounce Buffer Ring
A unified, direction-agnostic ring supporting configurable parallelism. With k buffers:
- Buffer[i] can be filled while buffer[i-1] transfers to/from GPU
- Synchronization only required on wrap-around (every k operations)
- Same abstraction serves H2D (reads) and D2H (writes)

See #520 for the original double-buffering discussion.

### Global CUDA Event Pool
Enables efficient stream synchronization across thread pool workers without per-operation allocation overhead.

## Scope

This effort focuses on optimizing the read (H2D) path. The write (D2H) path will adopt the unified ring infrastructure but remain locked to k=1, preserving current behavior. Write-path optimization (pipelining D2H with I/O) is deferred to future work due to additional complexities.

## Implementation plan

**Foundation**
- [x] Stream cache fix: #917
- [ ] K-way bounce buffer ring: #913 (WIP)
- [ ] Global CUDA event pool: #919 (WIP)

**Local I/O** (depends on foundation)
- [ ] pread/pwrite backend: #921
- [ ] io_uring backend: #870

**Remote I/O** (depends on foundation)  
- [ ] Easy-handle backend: #916 (WIP)
- [ ] Multi-handle poll-based backend: #896

## Future work beyond the scope of this issue

Write (D2H) path optimization with k > 1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unified k-way bounce buffer infrastructure for local and remote read #914

Motivation

Proposed solution

K-way Bounce Buffer Ring

Global CUDA Event Pool

Scope

Implementation plan

Future work beyond the scope of this issue

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Backend	Buffer strategy	Limitations
Local I/O (pread/pwrite)	Single buffer from global pool, synchronous	Stream wait after every chunk
Remote I/O (easy-handle)	Accumulate small transfers, then H2D	Same single-buffer limitation
Remote I/O (multi-poll, #896)	Independent multi-buffer impl	Not integrated with existing pool

Unified k-way bounce buffer infrastructure for local and remote read #914

Description

Motivation

Proposed solution

K-way Bounce Buffer Ring

Global CUDA Event Pool

Scope

Implementation plan

Future work beyond the scope of this issue

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions