-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-40128][SQL] Make the VectorizedColumnReader recognize DELTA_LENGTH_BYTE_ARRAY as a standalone column encoding #37557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…BYTE_ARRAY as a standalone encong column encoding. Otherwise, Spark currently throws an exception for DELTA_LENGTH_BYTE_ARRAY columns when vectorized reads are enabled: java.lang.UnsupportedOperationException: Unsupported encoding: DELTA_LENGTH_BYTE_ARRAY Added additional test case to ParquetIOSuite based on the associated test file from https://github.com/apache/parquet-testing
|
I am wondering if your GitHub actions are enabled... |
|
@kazuyukitanimura Thanks for the reminder! I've enabled github actions now and verified at least "Build and test" and "Report test results" are enabled workflows; not sure if it takes some time for them to trigger though. |
parthchandra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you for adding the test!
|
LGTM (non-binding) pending on CI. cc @sunchao @LuciferYang @sadikovi |
sadikovi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we just forgot to add the encoder to the list. Can you update the PR description accordingly? I don't quite understand what "standalone column encoding" means in this context.
| checkAnswer( | ||
| // "fruit" column in this file is encoded using DELTA_LENGTH_BYTE_ARRAY. | ||
| // The file comes from https://github.com/apache/parquet-testing | ||
| readResourceParquetFile("test-data/delta_length_byte_array.parquet"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to generate this file instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks a bit tricky to generate data using DELTA_LENGTH_BYTE_ARRAY since parquet-mr by default uses DELTA_BYTE_ARRAY for binary type. We can probably do this separately later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is fine, let's keep as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, looks like there's unfortunately not much configurability for choosing encodings in typical writer libraries; I saw this discussion from a couple years ago on that topic for a similar question about wanting to target specific encodings: https://www.mail-archive.com/[email protected]/msg11826.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the reasons I didn't include this in the list is because I found no way of actually creating a file with a DELTA_LENGTH_BYTE_ARRAY encoding. So yeah, it's hard to do !
|
Merged, thanks! |
|
Excellent, thanks for the quick turnaround! |
|
+1, late LGTM. Thank you @sfc-gh-dhuo @kazuyukitanimura @sunchao @sadikovi @parthchandra |
What changes were proposed in this pull request?
Add DELTA_LENGTH_BYTE_ARRAY as a recognized encoding in VectorizedColumnReader so that
vectorized reads succeed when there are columns using DELTA_LENGTH_BYTE_ARRAY as a standalone
encoding (meaning a column uses only DELTA_LENGTH_BYTE_ARRAY for its contents, compared to other "combined" encodings such as DELTA_BYTE_ARRAY, where a "common-prefix-length" is encoded with DELTA_BINARY_PACKED followed up non-shared "suffixes" that use DELTA_LENGTH_BYTE_ARRAY under the hood).
Why are the changes needed?
Spark currently throws an exception for DELTA_LENGTH_BYTE_ARRAY columns when vectorized
reads are enabled and trying to read
delta_length_byte_array.parquetfrom https://github.com/apache/parquet-testing:Does this PR introduce any user-facing change?
Yes - previously throw UNSUPPORTED exception. Now reads the encoding same as if vectorized reads are disabled.
How was this patch tested?
Added test case to ParquetIOSuite; made sure it fails without the fix to VectorizedColumnReader and passes after.