-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Hi cudf Team,
I’m trying I/O pipelining for read_parquet in multithreaded/multistream workflows, using cudf/cpp/examples/parquet_io/parquet_io_multithreaded.cpp as a common starting point.
Before asking my questions, I have reviewed #16936, so I will ask my questions about achieving efficient pipelining for non-first batches, as the first read batch’s behavior seems hard to control regarding the issue 16936.
- Also in the profiling figure from Issue [FEA] Add synchronization for IO between
read_parquetcalls on different threads #16936, Batch 2 and Batch 3 appear to begin nvCOMP decompression immediately, with no visible I/O stage likereadfunctions.
What is your question?
My question builds on the existing issue but with read I/O pipelines with nvCOMP kernels. The attached profiling results show clear I/O and computation from nvCOMP pipelining.
Test Setup is Using the standard parquet_io_multithreaded.cpp, I ran: ./parquet_io_multithreaded SNAPPY.parquet 10 FILEPATH 1 3
This configures 3 threads/streams to call read_parquet 10 times total, with thread 0 handling 4 reads and threads 1-2 handling 3 reads each. The uneven workload distribution appears to prevent effective I/O pipelining.
Profiling Observations:
In the attached nsys profile overview, you can see the 3 threads/streams.
- The first read batch (~1.0s per thread) can be ignored due to the [FEA] Add synchronization for IO between
read_parquetcalls on different threads #16936 - Then I’ll highlight the I/O patterns in the second read batch to demonstrate where pipelining breaks.
Thread0
We can see the dense blue lines, which are cuFile reads to the SSD. Notably, thread0 does not show any cuFile reads at the start of its read_parquet call.
Thread1&2 -> Questions
Now, we see the key issue I want to ask about: threads 1 and 2 each have two cuFile read range appearing simultaneously. This suggests that KvikIO is handling I/O for both read_parquet calls at the same time, meaning there is no I/O pipeline, ordering, or priority between the two threads in their second read batch.
My question is: What is the standard approach to prevent such I/O overlap between multiple read_parquet calls? If I/O operations overlap, it means there is no exclusive ownership of I/O, which could lead to slower performance compared to a scenario where each thread has exclusive access. This is then also sounds like the issue #16936
I should note that in most cases, I/O pipelining works as expected—after the initial read batch, only one thread performs I/O at a time. However, this profiling result is an uncommon case (possibly difficult to reproduce due to nondeterministic) that I encountered, and I wanted to ask about it.



