-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[Parquet] Reduce one copy in SerializedPageReader
#8745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @XiangpengHao and @MikeWalrus ❤️
I don't think the parquet crate should be setting a global allocator, but otherwise this PR looks great to me
I think this change is justified on code simplification grounds alone even if we can't measure a performance difference
I am also going to explore the potential decompressor optimziation in a follow on PR
parquet/Cargo.toml
Outdated
| rand = { version = "0.9", default-features = false, features = ["std", "std_rng", "thread_rng"] } | ||
| object_store = { version = "0.12.0", default-features = false, features = ["azure", "fs"] } | ||
| sysinfo = { version = "0.37.1", default-features = false, features = ["system"] } | ||
| mimalloc = { version = "*" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should add this in the parquet crate as it will conflict with downstream crates that want to use a different allocator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only dev dependency, used to make benchmark more accurate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's get some benchmark results with the different allocator
However, I still don't think it is a good idea to use a non-system allocator for dev/benchmarks as it will make the benchmarks potentially farther from what the (average) user actually experiences
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense, I've removed mimalloc
| let decompressed_size = uncompressed_page_size - offset; | ||
| let mut decompressed = Vec::with_capacity(uncompressed_page_size); | ||
| decompressed.extend_from_slice(&buffer.as_ref()[..offset]); | ||
| decompressed.extend_from_slice(&buffer[..offset]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems an unrelated (but nice) cleanup
While looking at this code, it seems like it always copies the compressed bytes, even when it then decompresses it immediately. I'll make a small PR to see if I can remove that unecessary copy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| read | ||
| )); | ||
| } | ||
| let buffer = self.reader.get_bytes(data_start, data_len)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confirm on review this can potentially avoid a copy if the underlying reader is already Bytes
SerializedPageReaderSerializedPageReader
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
Looks like we have some meaningful perf difference here: But I'm not sure if this is because a reduced copy or mimalloc, should we run |
|
will rerun |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you again @XiangpengHao and @MikeWalrus
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
The regression is indeed related to memory allocation (brk, page faults, etc.), but I'm not sure if it's due to the read buffers being held. Those buffers are small and freed quickly. I traced that query and made a visualization. Interactive version. Script and raw traces. @XiangpengHao @alamb Any thoughts? |
So in my mind the most recent benchmark results show a performance improvement for this branch In an ideal world, benchmark results would be 100% reproducible and free of noise. However, in the real world, especially on the "machine" I am using to benchmark (a VM) there are many sources of noise in the measurements:
So while I applaud our efforts here to be scientific, I also think it has passed the level of scrutiny needed From my perspective, the code after this PR is clearly doing less work, and shows improvements in the benchmarks (even if there is some noise), thus it is a net improvement over what is on main |
|
I agree 100% with this |
|
Let's merge this one in and keep making the code better going forward |
|
@alamb I agree. Thanks! |
Thank you for the (great!) find |






This was originally found by @MikeWalrus
Basically the ChunkReader for the async reader is
ColumnChunkData:arrow-rs/parquet/src/arrow/in_memory_row_group.rs
Lines 282 to 292 in 2eabb59
Which by itself is
Bytes. The original implementation will copy the data from it and later only to make it a newBytes.This PR removes it.
Normally this should mean performance improvements across the board, but here're the nuances:
tldr: with mimalloc, it will always improve performance, or at least as fast as the original implementation, tested locally with
arrow_reader_clickbenchcc @tustvold and @alamb who might know this better