Skip to content

Conversation

@sfc-gh-dhuo
Copy link
Contributor

@sfc-gh-dhuo sfc-gh-dhuo commented Aug 17, 2022

What changes were proposed in this pull request?

Add DELTA_LENGTH_BYTE_ARRAY as a recognized encoding in VectorizedColumnReader so that
vectorized reads succeed when there are columns using DELTA_LENGTH_BYTE_ARRAY as a standalone
encoding (meaning a column uses only DELTA_LENGTH_BYTE_ARRAY for its contents, compared to other "combined" encodings such as DELTA_BYTE_ARRAY, where a "common-prefix-length" is encoded with DELTA_BINARY_PACKED followed up non-shared "suffixes" that use DELTA_LENGTH_BYTE_ARRAY under the hood).

Why are the changes needed?

Spark currently throws an exception for DELTA_LENGTH_BYTE_ARRAY columns when vectorized
reads are enabled and trying to read delta_length_byte_array.parquet from https://github.com/apache/parquet-testing:

java.lang.UnsupportedOperationException: Unsupported encoding: DELTA_LENGTH_BYTE_ARRAY

Does this PR introduce any user-facing change?

Yes - previously throw UNSUPPORTED exception. Now reads the encoding same as if vectorized reads are disabled.

How was this patch tested?

Added test case to ParquetIOSuite; made sure it fails without the fix to VectorizedColumnReader and passes after.

…BYTE_ARRAY

as a standalone encong column encoding.

Otherwise, Spark currently throws an exception for DELTA_LENGTH_BYTE_ARRAY columns when vectorized
reads are enabled:

java.lang.UnsupportedOperationException: Unsupported encoding: DELTA_LENGTH_BYTE_ARRAY

Added additional test case to ParquetIOSuite based on the associated
test file from https://github.com/apache/parquet-testing
@github-actions github-actions bot added the SQL label Aug 17, 2022
@kazuyukitanimura
Copy link
Contributor

I am wondering if your GitHub actions are enabled...

@sfc-gh-dhuo
Copy link
Contributor Author

@kazuyukitanimura Thanks for the reminder! I've enabled github actions now and verified at least "Build and test" and "Report test results" are enabled workflows; not sure if it takes some time for them to trigger though.

Copy link
Contributor

@parthchandra parthchandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you for adding the test!

@kazuyukitanimura
Copy link
Contributor

LGTM (non-binding) pending on CI. cc @sunchao @LuciferYang @sadikovi

@sunchao sunchao changed the title [SPARK-40128] [SQL] Make the VectorizedColumnReader recognize DELTA_LENGTH_… [SPARK-40128][SQL] Make the VectorizedColumnReader recognize DELTA_LENGTH_BYTE_ARRAY as a standalone column encoding Aug 17, 2022
Copy link
Contributor

@sadikovi sadikovi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we just forgot to add the encoder to the list. Can you update the PR description accordingly? I don't quite understand what "standalone column encoding" means in this context.

checkAnswer(
// "fruit" column in this file is encoded using DELTA_LENGTH_BYTE_ARRAY.
// The file comes from https://github.com/apache/parquet-testing
readResourceParquetFile("test-data/delta_length_byte_array.parquet"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to generate this file instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks a bit tricky to generate data using DELTA_LENGTH_BYTE_ARRAY since parquet-mr by default uses DELTA_BYTE_ARRAY for binary type. We can probably do this separately later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is fine, let's keep as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, looks like there's unfortunately not much configurability for choosing encodings in typical writer libraries; I saw this discussion from a couple years ago on that topic for a similar question about wanting to target specific encodings: https://www.mail-archive.com/[email protected]/msg11826.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the reasons I didn't include this in the list is because I found no way of actually creating a file with a DELTA_LENGTH_BYTE_ARRAY encoding. So yeah, it's hard to do !

@sunchao sunchao closed this in 01f9d27 Aug 17, 2022
@sunchao
Copy link
Member

sunchao commented Aug 17, 2022

Merged, thanks!

@sfc-gh-dhuo
Copy link
Contributor Author

Excellent, thanks for the quick turnaround!

@LuciferYang
Copy link
Contributor

+1, late LGTM. Thank you @sfc-gh-dhuo @kazuyukitanimura @sunchao @sadikovi @parthchandra

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants