[FEA] Unnecessary StreamSynchronization in reading hinders the pipeline establishment

Hi cuIO team,

I'm writing to report a performance issue I've discovered while working on building a pipeline for Parquet reading. I was struggling with this pipeline yesterday and also raised some questions in #18268 . Now, I've delved deeper into the problem and revisited my previous inquiries.

I have a setup with:
- NVIDIA A100-SXM4-40GB

# The Problem with the Pipeline

One persistent issue I've noticed is the peculiar behavior of the pipeline. **Sometimes**, when reading Parquet files, the operations within each stream tend to cluster together, preventing the establishment of a stable I/O&compute pipeline.

## Profiling and Experiments

To investigate this issue further, I conducted profiling. The main difference from my previous experiment in #18268  is that I used 3 threads/streams to better saturate the GPU. All other aspects of the experiment remained the same as in the previous issue. 

I closely examined the interaction of the 3 threads in the profiling data. What I found particularly strange was the large, prominent part of the `FileHandle::pread()` call and the subsequent stream synchronization beneath it. This part of the code comes from KvikIO, and the function calls are highlighted in different colors for easier identification. For instance, in Thread 1, this section was quite noticeable:
![Image](https://github.com/user-attachments/assets/d7da1ccb-6680-480c-af02-223147f1ab28)

I hypothesized that the long duration of this read operation was due to stream synchronization. So, I tried to identify other threads that were also performing synchronization at the same time. I discovered that Thread 3 was sorting data and synchronizing the default stream upon completion of the sorting operation: 
![Image](https://github.com/user-attachments/assets/e02ca84d-288c-4ef2-b561-07c66c702df1)

The alignment of the stream synchronizations in Thread 1 and Thread 3 made me question why the `FileHandle::pread()` function needed to synchronize with the default stream.

## Locating the Synchronization in the Code

Finally, I found that in both cuDF and KvikIO's pread implementation as `FileHandle::pread`, that `sync_default_stream` is default true to synchronize with the default stream. In my case, this meant that the I/O operation had to wait for the previous sorting operation to finish before it could start:
https://github.com/rapidsai/cudf/blob/79109d435bf2b1221d8eef88b9b66e3bd8083166/cpp/src/io/utilities/datasource.cpp#L113

https://github.com/rapidsai/kvikio/blob/3adbe7e6435c728dc8a03aba97094519f2a299c5/cpp/include/kvikio/file_handle.hpp#L256

I plan to submit a simple PR to address this issue. However, I'm not certain if it will improve performance in all cases, as the behavior of the pipeline doesn't seem to be entirely deterministic. Thank you for your attention to this matter. I look forward to your feedback and suggestions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Unnecessary StreamSynchronization in reading hinders the pipeline establishment #18278

The Problem with the Pipeline

Profiling and Experiments

Locating the Synchronization in the Code

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEA] Unnecessary StreamSynchronization in reading hinders the pipeline establishment #18278

Description

The Problem with the Pipeline

Profiling and Experiments

Locating the Synchronization in the Code

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions