Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -329,6 +329,8 @@ private ValuesReader getValuesReader(Encoding encoding) {
return new VectorizedPlainValuesReader();
case DELTA_BYTE_ARRAY:
return new VectorizedDeltaByteArrayReader();
case DELTA_LENGTH_BYTE_ARRAY:
return new VectorizedDeltaLengthByteArrayReader();
case DELTA_BINARY_PACKED:
return new VectorizedDeltaBinaryPackedReader();
case RLE:
Expand Down
Binary file not shown.
Original file line number Diff line number Diff line change
Expand Up @@ -1307,6 +1307,16 @@ class ParquetIOSuite extends QueryTest with ParquetTest with SharedSparkSession
}
}

test("SPARK-40128 read DELTA_LENGTH_BYTE_ARRAY encoded strings") {
withAllParquetReaders {
checkAnswer(
// "fruit" column in this file is encoded using DELTA_LENGTH_BYTE_ARRAY.
// The file comes from https://github.com/apache/parquet-testing
readResourceParquetFile("test-data/delta_length_byte_array.parquet"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to generate this file instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks a bit tricky to generate data using DELTA_LENGTH_BYTE_ARRAY since parquet-mr by default uses DELTA_BYTE_ARRAY for binary type. We can probably do this separately later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is fine, let's keep as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, looks like there's unfortunately not much configurability for choosing encodings in typical writer libraries; I saw this discussion from a couple years ago on that topic for a similar question about wanting to target specific encodings: https://www.mail-archive.com/[email protected]/msg11826.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the reasons I didn't include this in the list is because I found no way of actually creating a file with a DELTA_LENGTH_BYTE_ARRAY encoding. So yeah, it's hard to do !

(0 to 999).map(i => Row("apple_banana_mango" + Integer.toString(i * i))))
}
}

test("SPARK-12589 copy() on rows returned from reader works for strings") {
withTempPath { dir =>
val data = (1, "abc") ::(2, "helloabcde") :: Nil
Expand Down