Skip to content

[QST] How to achieve *stable* I/O Pipelining for read_parquet with Multithreaded/Multistream #17873

@JigaoLuo

Description

@JigaoLuo

Hi cudf Team,

I’m trying I/O pipelining for read_parquet in multithreaded/multistream workflows, using cudf/cpp/examples/parquet_io/parquet_io_multithreaded.cpp as a common starting point.

Before asking my questions, I have reviewed #16936, so I will ask my questions about achieving efficient pipelining for non-first batches, as the first read batch’s behavior seems hard to control regarding the issue 16936.

What is your question?

My question builds on the existing issue but with read I/O pipelines with nvCOMP kernels. The attached profiling results show clear I/O and computation from nvCOMP pipelining.

Test Setup is Using the standard parquet_io_multithreaded.cpp, I ran: ./parquet_io_multithreaded SNAPPY.parquet 10 FILEPATH 1 3
This configures 3 threads/streams to call read_parquet 10 times total, with thread 0 handling 4 reads and threads 1-2 handling 3 reads each. The uneven workload distribution appears to prevent effective I/O pipelining.

Profiling Observations:
In the attached nsys profile overview, you can see the 3 threads/streams.

Overview

Thread0

Thread0, Second Read Batch

We can see the dense blue lines, which are cuFile reads to the SSD. Notably, thread0 does not show any cuFile reads at the start of its read_parquet call.

Thread1&2 -> Questions

Thread1, Second Read Batch

Thread2, Second Read Batch

Now, we see the key issue I want to ask about: threads 1 and 2 each have two cuFile read range appearing simultaneously. This suggests that KvikIO is handling I/O for both read_parquet calls at the same time, meaning there is no I/O pipeline, ordering, or priority between the two threads in their second read batch.

My question is: What is the standard approach to prevent such I/O overlap between multiple read_parquet calls? If I/O operations overlap, it means there is no exclusive ownership of I/O, which could lead to slower performance compared to a scenario where each thread has exclusive access. This is then also sounds like the issue #16936

I should note that in most cases, I/O pipelining works as expected—after the initial read batch, only one thread performs I/O at a time. However, this profiling result is an uncommon case (possibly difficult to reproduce due to nondeterministic) that I encountered, and I wanted to ask about it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    cuIOcuIO issuelibcudfAffects libcudf (C++/CUDA) code.questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions