Skip to content

Conversation

@parthchandra
Copy link
Contributor

@parthchandra parthchandra commented Jan 14, 2022

What changes were proposed in this pull request?

Addresses some formatting changes that were requested in a previous PR (after it was merged).

Why are the changes needed?

Review comments addressed

Does this PR introduce any user-facing change?

no

How was this patch tested?

Not needed. Existing unit tests pass.

@github-actions github-actions bot added the SQL label Jan 14, 2022
@parthchandra
Copy link
Contributor Author

@sunchao, @LuciferYang, @MaxGekk
Added benchmark and addressed the PR comments that came in after the PR was merged.

@sunchao
Copy link
Member

sunchao commented Jan 14, 2022

@parthchandra We can open a new JIRA for this. BTW did you use GitHub workflow to get the bench results?

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for @sunchao comments. In addition, @parthchandra , you need to run Java 8/Java 11/Java 17 together and attach here. Currently only Java 11 result exists. Are the others updated already in the previous PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a regression from 81x to 66x. Is this expected?

Copy link
Member

@dongjoon-hyun dongjoon-hyun Jan 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. Does this mean a regression from 80x to 40x? Two times slower than before?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-36879][SQL] Add benchmark for vectorized delta binary packed encoding [SPARK-36879][SQL][TESTS] Add benchmark for vectorized delta binary packed encoding Jan 14, 2022
Comment on lines 280 to 281
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the indent correct for these brackets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in this PR, but the table name looks like a general parquet table. If this is particular for delta binary, it seems better to have a specified table name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think I will rename the case instead of the table. Instead of SQL Parquet MR (Delta Binary) make it SQL Parquet V2 MR. I can do that in a subsequent PR as I add the remaining delta encodings.

Copy link
Contributor

@LuciferYang LuciferYang Jan 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should compare every scene for Parquet data page V1 and Parquet data page V2 in DataSourceReadBenchmark, if it is only for the data type 'delta binary', maybe we should add a new benchmark file

what do you think @sunchao @dongjoon-hyun @viirya @parthchandra

Copy link
Contributor

@LuciferYang LuciferYang Jan 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in this pr, but I used GA to run a comparison for parquet V1 and V2 and I found that in scenes Single INT Column and Single BIGINT Column, V2 is obviously slower than V1 now:

OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Linux 5.11.0-1025-azure
Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
SQL Single INT Column Scan:               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
SQL CSV                                           19203          19257          76          0.8        1220.9       1.0X
SQL Json                                          11971          11982          15          1.3         761.1       1.6X
SQL Parquet Vectorized: DataPageV1                  195            201           8         80.7          12.4      98.5X
SQL Parquet MR: DataPageV1                         2665           2680          20          5.9         169.4       7.2X
SQL Parquet Vectorized: DataPageV2                  377            390          17         41.8          23.9      51.0X
SQL Parquet MR: DataPageV2                         2326           2356          43          6.8         147.9       8.3X
SQL ORC Vectorized                                  302            316           9         52.0          19.2      63.5X
SQL ORC MR                                         2168           2178          14          7.3         137.8       8.9X

OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Linux 5.11.0-1025-azure
Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
Parquet Reader Single INT Column Scan:       Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
---------------------------------------------------------------------------------------------------------------------------
ParquetReader Vectorized: DataPageV1                   289            296           7         54.4          18.4       1.0X
ParquetReader Vectorized -> Row: DataPageV1            283            292           9         55.6          18.0       1.0X
ParquetReader Vectorized: DataPageV2                   460            470           7         34.2          29.2       0.6X
ParquetReader Vectorized -> Row: DataPageV2            461            474          15         34.1          29.3       0.6X

OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Linux 5.11.0-1025-azure
Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
SQL Single BIGINT Column Scan:            Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
SQL CSV                                           24759          24790          45          0.6        1574.1       1.0X
SQL Json                                          15345          15393          67          1.0         975.6       1.6X
SQL Parquet Vectorized: DataPageV1                  304            314          12         51.8          19.3      81.5X
SQL Parquet MR: DataPageV1                         2798           2824          38          5.6         177.9       8.9X
SQL Parquet Vectorized: DataPageV2                  600            618          24         26.2          38.2      41.2X
SQL Parquet MR: DataPageV2                         2423           2450          38          6.5         154.1      10.2X
SQL ORC Vectorized                                  376            381           3         41.9          23.9      65.9X
SQL ORC MR                                         2245           2260          21          7.0         142.7      11.0X

OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Linux 5.11.0-1025-azure
Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
Parquet Reader Single BIGINT Column Scan:    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
---------------------------------------------------------------------------------------------------------------------------
ParquetReader Vectorized: DataPageV1                   372            423          96         42.2          23.7       1.0X
ParquetReader Vectorized -> Row: DataPageV1            376            382           7         41.8          23.9       1.0X
ParquetReader Vectorized: DataPageV2                   678            694          23         23.2          43.1       0.5X
ParquetReader Vectorized -> Row: DataPageV2            673            686          13         23.4          42.8       0.6X

That's why I think we should compare every scene for V1 and V2, not just Delta Binary for V2, this may make the goal of subsequent optimization clearer

@parthchandra parthchandra changed the title [SPARK-36879][SQL][TESTS] Add benchmark for vectorized delta binary packed encoding [SPARK-37912][SQL][TESTS] Add benchmark for vectorized delta binary packed encoding Jan 14, 2022
@parthchandra parthchandra force-pushed the SPARK-36879-PR2 branch 2 times, most recently from 91d939a to f6681e8 Compare January 14, 2022 21:36
@parthchandra
Copy link
Contributor Author

@parthchandra We can open a new JIRA for this. BTW did you use GitHub workflow to get the bench results?

Opened a new Jira - https://issues.apache.org/jira/browse/SPARK-37912

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto. This looks like roughly 3 times slower than before.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

@parthchandra
Copy link
Contributor Author

@dongjoon-hyun the benchmark results are not what we were hoping for. But it looks like Delta encoding is slower than RLE for the benchmark data. I tried the same with sorted data and delta encoding became 2x faster. However, with sorted data RLE also became marginally faster. In both cases RLE remained faster than Delta.

@dongjoon-hyun
Copy link
Member

Thank you for the confirmation, @parthchandra . Got it~

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For BooleanType, Parquet data page V2 will uses RLE encoding instead of Delta Binary encoding, so we shouldn't mark Delta Binary here.

@parthchandra
Copy link
Contributor Author

PR 35226 supersedes this benchmark so I'm reverting the PR to just address the review comments.

@parthchandra
Copy link
Contributor Author

Reverted the benchmark and results. This PR now only addresses the review comments left over from PR #34471

@parthchandra parthchandra changed the title [SPARK-37912][SQL][TESTS] Add benchmark for vectorized delta binary packed encoding [SPARK-36879][SQL] Address review comments for Delta Binary encoding Jan 18, 2022
@sunchao
Copy link
Member

sunchao commented Jan 18, 2022

@parthchandra in this case, I think it's better to still reuse the original JIRA SPARK-36879 and mark this as a follow-up. We should also update the PR description accordingly.

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending on the PR title change.

@parthchandra parthchandra changed the title [SPARK-36879][SQL] Address review comments for Delta Binary encoding [SPARK-36879][SQL] Support Parquet v2 data page encodings for the vectorized path (followup) Jan 18, 2022
@parthchandra
Copy link
Contributor Author

Changed the description in the PR as well as the commit

@sunchao sunchao changed the title [SPARK-36879][SQL] Support Parquet v2 data page encodings for the vectorized path (followup) [SPARK-36879][SQL][FOLLOWUP] Support Parquet v2 data page encodings for the vectorized path Jan 18, 2022
@sunchao
Copy link
Member

sunchao commented Jan 18, 2022

@parthchandra could you update the PR description as well?

@parthchandra
Copy link
Contributor Author

@sunchao done

@viirya
Copy link
Member

viirya commented Jan 19, 2022

Adding benchmark and results for the delta binary packed encoding.

I think this doesn't add benchmark and results now. The PR description looks out of dated.

@LuciferYang
Copy link
Contributor

I think this doesn't add benchmark and results now. The PR description looks out of dated.

+1 , should update the pr description @parthchandra

Copy link
Contributor

@LuciferYang LuciferYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM +1

@parthchandra
Copy link
Contributor Author

@viirya, @LuciferYang Updated the description.

@sunchao sunchao changed the title [SPARK-36879][SQL][FOLLOWUP] Support Parquet v2 data page encodings for the vectorized path [SPARK-36879][SQL][FOLLOWUP] Address comments and fix code style Jan 19, 2022
@sunchao sunchao closed this in c288b34 Jan 19, 2022
@sunchao
Copy link
Member

sunchao commented Jan 19, 2022

Thanks @parthchandra , merged to master!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants