-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Hi cuIO team,
I'm writing to report a performance issue I've discovered while working on building a pipeline for Parquet reading. I was struggling with this pipeline yesterday and also raised some questions in #18268 . Now, I've delved deeper into the problem and revisited my previous inquiries.
I have a setup with:
- NVIDIA A100-SXM4-40GB
The Problem with the Pipeline
One persistent issue I've noticed is the peculiar behavior of the pipeline. Sometimes, when reading Parquet files, the operations within each stream tend to cluster together, preventing the establishment of a stable I/O&compute pipeline.
Profiling and Experiments
To investigate this issue further, I conducted profiling. The main difference from my previous experiment in #18268 is that I used 3 threads/streams to better saturate the GPU. All other aspects of the experiment remained the same as in the previous issue.
I closely examined the interaction of the 3 threads in the profiling data. What I found particularly strange was the large, prominent part of the FileHandle::pread() call and the subsequent stream synchronization beneath it. This part of the code comes from KvikIO, and the function calls are highlighted in different colors for easier identification. For instance, in Thread 1, this section was quite noticeable:

I hypothesized that the long duration of this read operation was due to stream synchronization. So, I tried to identify other threads that were also performing synchronization at the same time. I discovered that Thread 3 was sorting data and synchronizing the default stream upon completion of the sorting operation:

The alignment of the stream synchronizations in Thread 1 and Thread 3 made me question why the FileHandle::pread() function needed to synchronize with the default stream.
Locating the Synchronization in the Code
Finally, I found that in both cuDF and KvikIO's pread implementation as FileHandle::pread, that sync_default_stream is default true to synchronize with the default stream. In my case, this meant that the I/O operation had to wait for the previous sorting operation to finish before it could start:
cudf/cpp/src/io/utilities/datasource.cpp
Line 113 in 79109d4
| return _kvikio_handle.pread(dst, read_size, offset); |
I plan to submit a simple PR to address this issue. However, I'm not certain if it will improve performance in all cases, as the behavior of the pipeline doesn't seem to be entirely deterministic. Thank you for your attention to this matter. I look forward to your feedback and suggestions.