[SPARK-40128][SQL] Make the VectorizedColumnReader recognize DELTA_LENGTH_BYTE_ARRAY as a standalone column encoding #37557

sfc-gh-dhuo · 2022-08-17T17:20:55Z

What changes were proposed in this pull request?

Add DELTA_LENGTH_BYTE_ARRAY as a recognized encoding in VectorizedColumnReader so that
vectorized reads succeed when there are columns using DELTA_LENGTH_BYTE_ARRAY as a standalone
encoding (meaning a column uses only DELTA_LENGTH_BYTE_ARRAY for its contents, compared to other "combined" encodings such as DELTA_BYTE_ARRAY, where a "common-prefix-length" is encoded with DELTA_BINARY_PACKED followed up non-shared "suffixes" that use DELTA_LENGTH_BYTE_ARRAY under the hood).

Why are the changes needed?

Spark currently throws an exception for DELTA_LENGTH_BYTE_ARRAY columns when vectorized
reads are enabled and trying to read delta_length_byte_array.parquet from https://github.com/apache/parquet-testing:

java.lang.UnsupportedOperationException: Unsupported encoding: DELTA_LENGTH_BYTE_ARRAY

Does this PR introduce any user-facing change?

Yes - previously throw UNSUPPORTED exception. Now reads the encoding same as if vectorized reads are disabled.

How was this patch tested?

Added test case to ParquetIOSuite; made sure it fails without the fix to VectorizedColumnReader and passes after.

…BYTE_ARRAY as a standalone encong column encoding. Otherwise, Spark currently throws an exception for DELTA_LENGTH_BYTE_ARRAY columns when vectorized reads are enabled: java.lang.UnsupportedOperationException: Unsupported encoding: DELTA_LENGTH_BYTE_ARRAY Added additional test case to ParquetIOSuite based on the associated test file from https://github.com/apache/parquet-testing

kazuyukitanimura · 2022-08-17T17:44:20Z

I am wondering if your GitHub actions are enabled...

sfc-gh-dhuo · 2022-08-17T19:50:26Z

@kazuyukitanimura Thanks for the reminder! I've enabled github actions now and verified at least "Build and test" and "Report test results" are enabled workflows; not sure if it takes some time for them to trigger though.

parthchandra

LGTM. Thank you for adding the test!

kazuyukitanimura · 2022-08-17T21:37:17Z

LGTM (non-binding) pending on CI. cc @sunchao @LuciferYang @sadikovi

sadikovi

Seems like we just forgot to add the encoder to the list. Can you update the PR description accordingly? I don't quite understand what "standalone column encoding" means in this context.

sadikovi · 2022-08-17T22:32:34Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala

+      checkAnswer(
+        // "fruit" column in this file is encoded using DELTA_LENGTH_BYTE_ARRAY.
+        // The file comes from https://github.com/apache/parquet-testing
+        readResourceParquetFile("test-data/delta_length_byte_array.parquet"),


Is it possible to generate this file instead?

It looks a bit tricky to generate data using DELTA_LENGTH_BYTE_ARRAY since parquet-mr by default uses DELTA_BYTE_ARRAY for binary type. We can probably do this separately later.

It is fine, let's keep as is.

Yeah, looks like there's unfortunately not much configurability for choosing encodings in typical writer libraries; I saw this discussion from a couple years ago on that topic for a similar question about wanting to target specific encodings: https://www.mail-archive.com/[email protected]/msg11826.html

One of the reasons I didn't include this in the list is because I found no way of actually creating a file with a DELTA_LENGTH_BYTE_ARRAY encoding. So yeah, it's hard to do !

sunchao · 2022-08-17T23:22:16Z

Merged, thanks!

sfc-gh-dhuo · 2022-08-17T23:25:16Z

Excellent, thanks for the quick turnaround!

LuciferYang · 2022-08-18T02:43:53Z

+1, late LGTM. Thank you @sfc-gh-dhuo @kazuyukitanimura @sunchao @sadikovi @parthchandra

github-actions bot added the SQL label Aug 17, 2022

Include jira id in the test description.

a45d1e4

parthchandra approved these changes Aug 17, 2022

View reviewed changes

sunchao changed the title ~~[SPARK-40128] [SQL] Make the VectorizedColumnReader recognize DELTA_LENGTH_…~~ [SPARK-40128][SQL] Make the VectorizedColumnReader recognize DELTA_LENGTH_BYTE_ARRAY as a standalone column encoding Aug 17, 2022

sadikovi approved these changes Aug 17, 2022

View reviewed changes

kazuyukitanimura approved these changes Aug 17, 2022

View reviewed changes

sunchao approved these changes Aug 17, 2022

View reviewed changes

sunchao closed this in 01f9d27 Aug 17, 2022

[SPARK-40128][SQL] Make the VectorizedColumnReader recognize DELTA_LENGTH_BYTE_ARRAY as a standalone column encoding #37557

[SPARK-40128][SQL] Make the VectorizedColumnReader recognize DELTA_LENGTH_BYTE_ARRAY as a standalone column encoding #37557

Conversation

sfc-gh-dhuo commented Aug 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

kazuyukitanimura commented Aug 17, 2022

Uh oh!

sfc-gh-dhuo commented Aug 17, 2022

Uh oh!

parthchandra left a comment

Choose a reason for hiding this comment

Uh oh!

kazuyukitanimura commented Aug 17, 2022

Uh oh!

sadikovi left a comment

Choose a reason for hiding this comment

Uh oh!

sadikovi Aug 17, 2022

Choose a reason for hiding this comment

Uh oh!

sunchao Aug 17, 2022

Choose a reason for hiding this comment

Uh oh!

sadikovi Aug 17, 2022

Choose a reason for hiding this comment

Uh oh!

sfc-gh-dhuo Aug 17, 2022

Choose a reason for hiding this comment

Uh oh!

parthchandra Aug 17, 2022

Choose a reason for hiding this comment

Uh oh!

sunchao commented Aug 17, 2022

Uh oh!

sfc-gh-dhuo commented Aug 17, 2022

Uh oh!

LuciferYang commented Aug 18, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

sfc-gh-dhuo commented Aug 17, 2022 •

edited

Loading