[QST] How to achieve *stable* I/O Pipelining for `read_parquet` with Multithreaded/Multistream

Hi cudf Team,

I’m trying I/O pipelining for `read_parquet` in multithreaded/multistream workflows, using `cudf/cpp/examples/parquet_io/parquet_io_multithreaded.cpp` as a common starting point. 

Before asking my questions, I have reviewed #16936, so I will ask my questions about achieving **efficient pipelining for non-first batches**, as the first read batch’s behavior seems hard to control regarding the issue 16936.
- Also in the profiling figure from Issue #16936, Batch 2 and Batch 3 appear to begin nvCOMP decompression immediately, with no visible I/O stage like `read` functions.

# **What is your question?**

My question builds on the existing issue but with read I/O pipelines with nvCOMP kernels. The attached profiling results show clear I/O and computation from nvCOMP pipelining.

Test Setup is Using the standard parquet_io_multithreaded.cpp, I ran: `./parquet_io_multithreaded SNAPPY.parquet 10 FILEPATH 1 3`
This configures 3 threads/streams to call read_parquet 10 times total, with thread 0 handling 4 reads and threads 1-2 handling 3 reads each. The uneven workload distribution appears to prevent effective I/O pipelining.

Profiling Observations:
In the attached nsys profile overview, you can see the 3 threads/streams.
- The first read batch (~1.0s per thread) can be ignored due to the #16936
- Then I’ll highlight the I/O patterns in the second read batch to demonstrate where pipelining breaks.

![Overview](https://github.com/user-attachments/assets/941fa472-7735-4005-b7c9-e0516c2a3b3e)

## Thread0

![Thread0, Second Read Batch](https://github.com/user-attachments/assets/a6917917-9b4f-49f1-9b24-8b0c597097da)

We can see the dense blue lines, which are cuFile reads to the SSD. Notably, thread0 does not show any cuFile reads at the start of its `read_parquet` call.

## Thread1&2 -> Questions

![Thread1, Second Read Batch](https://github.com/user-attachments/assets/d4699c14-0ab5-40c5-b87d-a7bc161d7a47)

![Thread2, Second Read Batch](https://github.com/user-attachments/assets/c1453a1d-208a-4b46-a6c5-7ebbd1230a39)

Now, we see the key issue I want to ask about: **threads 1 and 2 each have two cuFile read range appearing simultaneously.** This suggests that KvikIO is handling I/O for both `read_parquet` calls at the same time, meaning there is no I/O pipeline, ordering, or priority between the two threads in their second read batch.

My question is: What is the standard approach to prevent such I/O overlap between multiple `read_parquet` calls? If I/O operations overlap, it means there is no exclusive ownership of I/O, which could lead to slower performance compared to a scenario where each thread has exclusive access. This is then also sounds like the issue #16936

I should note that in most cases, I/O pipelining works as expected—after the initial read batch, only one thread performs I/O at a time. However, this profiling result is an uncommon case (possibly difficult to reproduce due to nondeterministic) that I encountered, and I wanted to ask about it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] How to achieve stable I/O Pipelining for `read_parquet` with Multithreaded/Multistream #17873

What is your question?

Thread0

Thread1&2 -> Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[QST] How to achieve *stable* I/O Pipelining for read_parquet with Multithreaded/Multistream #17873

Description

What is your question?

Thread0

Thread1&2 -> Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[QST] How to achieve stable I/O Pipelining for `read_parquet` with Multithreaded/Multistream #17873