-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-36879][SQL][FOLLOWUP] Address comments and fix code style #35212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@sunchao, @LuciferYang, @MaxGekk |
|
@parthchandra We can open a new JIRA for this. BTW did you use GitHub workflow to get the bench results? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for @sunchao comments. In addition, @parthchandra , you need to run Java 8/Java 11/Java 17 together and attach here. Currently only Java 11 result exists. Are the others updated already in the previous PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a regression from 81x to 66x. Is this expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto. Does this mean a regression from 80x to 40x? Two times slower than before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the indent correct for these brackets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not in this PR, but the table name looks like a general parquet table. If this is particular for delta binary, it seems better to have a specified table name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think I will rename the case instead of the table. Instead of SQL Parquet MR (Delta Binary) make it SQL Parquet V2 MR. I can do that in a subsequent PR as I add the remaining delta encodings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should compare every scene for Parquet data page V1 and Parquet data page V2 in DataSourceReadBenchmark, if it is only for the data type 'delta binary', maybe we should add a new benchmark file
what do you think @sunchao @dongjoon-hyun @viirya @parthchandra
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not in this pr, but I used GA to run a comparison for parquet V1 and V2 and I found that in scenes Single INT Column and Single BIGINT Column, V2 is obviously slower than V1 now:
OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Linux 5.11.0-1025-azure
Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
SQL Single INT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
SQL CSV 19203 19257 76 0.8 1220.9 1.0X
SQL Json 11971 11982 15 1.3 761.1 1.6X
SQL Parquet Vectorized: DataPageV1 195 201 8 80.7 12.4 98.5X
SQL Parquet MR: DataPageV1 2665 2680 20 5.9 169.4 7.2X
SQL Parquet Vectorized: DataPageV2 377 390 17 41.8 23.9 51.0X
SQL Parquet MR: DataPageV2 2326 2356 43 6.8 147.9 8.3X
SQL ORC Vectorized 302 316 9 52.0 19.2 63.5X
SQL ORC MR 2168 2178 14 7.3 137.8 8.9X
OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Linux 5.11.0-1025-azure
Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
Parquet Reader Single INT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------
ParquetReader Vectorized: DataPageV1 289 296 7 54.4 18.4 1.0X
ParquetReader Vectorized -> Row: DataPageV1 283 292 9 55.6 18.0 1.0X
ParquetReader Vectorized: DataPageV2 460 470 7 34.2 29.2 0.6X
ParquetReader Vectorized -> Row: DataPageV2 461 474 15 34.1 29.3 0.6X
OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Linux 5.11.0-1025-azure
Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
SQL Single BIGINT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
SQL CSV 24759 24790 45 0.6 1574.1 1.0X
SQL Json 15345 15393 67 1.0 975.6 1.6X
SQL Parquet Vectorized: DataPageV1 304 314 12 51.8 19.3 81.5X
SQL Parquet MR: DataPageV1 2798 2824 38 5.6 177.9 8.9X
SQL Parquet Vectorized: DataPageV2 600 618 24 26.2 38.2 41.2X
SQL Parquet MR: DataPageV2 2423 2450 38 6.5 154.1 10.2X
SQL ORC Vectorized 376 381 3 41.9 23.9 65.9X
SQL ORC MR 2245 2260 21 7.0 142.7 11.0X
OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Linux 5.11.0-1025-azure
Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
Parquet Reader Single BIGINT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------
ParquetReader Vectorized: DataPageV1 372 423 96 42.2 23.7 1.0X
ParquetReader Vectorized -> Row: DataPageV1 376 382 7 41.8 23.9 1.0X
ParquetReader Vectorized: DataPageV2 678 694 23 23.2 43.1 0.5X
ParquetReader Vectorized -> Row: DataPageV2 673 686 13 23.4 42.8 0.6X
That's why I think we should compare every scene for V1 and V2, not just Delta Binary for V2, this may make the goal of subsequent optimization clearer
fa7f76f to
74a4a5d
Compare
91d939a to
f6681e8
Compare
Opened a new Jira - https://issues.apache.org/jira/browse/SPARK-37912 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto. This looks like roughly 3 times slower than before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
|
@dongjoon-hyun the benchmark results are not what we were hoping for. But it looks like Delta encoding is slower than RLE for the benchmark data. I tried the same with sorted data and delta encoding became 2x faster. However, with sorted data RLE also became marginally faster. In both cases RLE remained faster than Delta. |
|
Thank you for the confirmation, @parthchandra . Got it~ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For BooleanType, Parquet data page V2 will uses RLE encoding instead of Delta Binary encoding, so we shouldn't mark Delta Binary here.
|
PR 35226 supersedes this benchmark so I'm reverting the PR to just address the review comments. |
f6681e8 to
13fb899
Compare
|
Reverted the benchmark and results. This PR now only addresses the review comments left over from PR #34471 |
|
@parthchandra in this case, I think it's better to still reuse the original JIRA SPARK-36879 and mark this as a follow-up. We should also update the PR description accordingly. |
sunchao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM pending on the PR title change.
13fb899 to
e350546
Compare
|
Changed the description in the PR as well as the commit |
|
@parthchandra could you update the PR description as well? |
…or the vectorized path
e350546 to
8b7e391
Compare
|
@sunchao done |
I think this doesn't add benchmark and results now. The PR description looks out of dated. |
+1 , should update the pr description @parthchandra |
LuciferYang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM +1
|
@viirya, @LuciferYang Updated the description. |
|
Thanks @parthchandra , merged to master! |
What changes were proposed in this pull request?
Addresses some formatting changes that were requested in a previous PR (after it was merged).
Why are the changes needed?
Review comments addressed
Does this PR introduce any user-facing change?
no
How was this patch tested?
Not needed. Existing unit tests pass.