Skip to content

[FEA] Unnecessary StreamSynchronization in reading hinders the pipeline establishment #18278

@JigaoLuo

Description

@JigaoLuo

Hi cuIO team,

I'm writing to report a performance issue I've discovered while working on building a pipeline for Parquet reading. I was struggling with this pipeline yesterday and also raised some questions in #18268 . Now, I've delved deeper into the problem and revisited my previous inquiries.

I have a setup with:

  • NVIDIA A100-SXM4-40GB

The Problem with the Pipeline

One persistent issue I've noticed is the peculiar behavior of the pipeline. Sometimes, when reading Parquet files, the operations within each stream tend to cluster together, preventing the establishment of a stable I/O&compute pipeline.

Profiling and Experiments

To investigate this issue further, I conducted profiling. The main difference from my previous experiment in #18268 is that I used 3 threads/streams to better saturate the GPU. All other aspects of the experiment remained the same as in the previous issue.

I closely examined the interaction of the 3 threads in the profiling data. What I found particularly strange was the large, prominent part of the FileHandle::pread() call and the subsequent stream synchronization beneath it. This part of the code comes from KvikIO, and the function calls are highlighted in different colors for easier identification. For instance, in Thread 1, this section was quite noticeable:
Image

I hypothesized that the long duration of this read operation was due to stream synchronization. So, I tried to identify other threads that were also performing synchronization at the same time. I discovered that Thread 3 was sorting data and synchronizing the default stream upon completion of the sorting operation:
Image

The alignment of the stream synchronizations in Thread 1 and Thread 3 made me question why the FileHandle::pread() function needed to synchronize with the default stream.

Locating the Synchronization in the Code

Finally, I found that in both cuDF and KvikIO's pread implementation as FileHandle::pread, that sync_default_stream is default true to synchronize with the default stream. In my case, this meant that the I/O operation had to wait for the previous sorting operation to finish before it could start:

return _kvikio_handle.pread(dst, read_size, offset);

https://github.com/rapidsai/kvikio/blob/3adbe7e6435c728dc8a03aba97094519f2a299c5/cpp/include/kvikio/file_handle.hpp#L256

I plan to submit a simple PR to address this issue. However, I'm not certain if it will improve performance in all cases, as the behavior of the pipeline doesn't seem to be entirely deterministic. Thank you for your attention to this matter. I look forward to your feedback and suggestions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions