Describe the bug
Originally reported in apache/datafusion#1441 and encountered again in #1110, ParquetFileArrowReader appears to read incorrect data for string columns that contain nulls.
In particular the conditions required are for the column to be nullable, contain nulls, and multiple row groups.
To Reproduce
Read simple_strings.parquet.zip with the following code
#[test]
fn test_read_strings() {
let testdata = arrow::util::test_util::parquet_test_data();
let path = format!("{}/simple_strings.parquet", testdata);
let parquet_file_reader =
SerializedFileReader::try_from(File::open(&path).unwrap()).unwrap();
let mut arrow_reader = ParquetFileArrowReader::new(Arc::new(parquet_file_reader));
let record_batch_reader = arrow_reader
.get_record_reader(60)
.expect("Failed to read into array!");
let batches = record_batch_reader
.collect::<arrow::error::Result<Vec<_>>>()
.unwrap();
assert_eq!(batches.len(), 1);
let batch = batches.into_iter().next().unwrap();
assert_eq!(batch.num_rows(), 6);
let strings = batch
.column(0)
.as_any()
.downcast_ref::<StringArray>()
.unwrap();
let strings: Vec<_> = strings.iter().collect();
assert_eq!(
&strings,
&[
None,
Some("-1685637712"),
Some("512814980"),
Some("868743207"),
None,
Some("-1001940778")
]
)
}
Fails with
thread 'arrow::arrow_reader::tests::test_read_strings' panicked at 'assertion failed: `(left == right)`
left: `[None, Some("-1685637712"), Some("512814980"), Some("-1685637712"), None, Some("868743207")]`,
right: `[None, Some("-1685637712"), Some("512814980"), Some("868743207"), None, Some("-1001940778")]`', parquet/src/arrow/arrow_reader.rs:715:9
For comparison
$ python
> import duckdb
> duckdb.query("select * from 'simple_strings.parquet'").fetchall()
[(None,), ('-1685637712',), ('512814980',), ('868743207',), (None,), ('-1001940778',)]
The file consists of two row groups, each with 3 rows and was generated using #1110
Expected behavior
The test should pass
Describe the bug
Originally reported in apache/datafusion#1441 and encountered again in #1110,
ParquetFileArrowReaderappears to read incorrect data for string columns that contain nulls.In particular the conditions required are for the column to be nullable, contain nulls, and multiple row groups.
To Reproduce
Read simple_strings.parquet.zip with the following code
Fails with
For comparison
The file consists of two row groups, each with 3 rows and was generated using #1110
Expected behavior
The test should pass