Skip to content

Unable to write non-null Arrow structs to Parquet #244

@nevi-me

Description

@nevi-me

Describe the bug

Unable to correctly write nested structs where a struct is non-nullable.
I've noticed this behaviour before, but couldn't quite reproduce it easily.

To Reproduce

If we have the below test case (in parquet/src/arrow/arrow_writer.rs:

#[test]
fn arrow_writer_complex_mixed() {
    // define schema
    let offset_field = Field::new("offset", DataType::Int32, true);
    let partition_field = Field::new("partition", DataType::Int64, true);
    let topic_field = Field::new("topic", DataType::Utf8, true);
    let schema = Schema::new(vec![
        Field::new("some_nested_object", DataType::Struct(
            vec![
                offset_field.clone(),
                partition_field.clone(),
                topic_field.clone()
            ]
        ), false), // NOTE: this being false results in the array not being written correctly
    ]);

    // create some data
    let offset = Int32Array::from(vec![1, 2, 3, 4, 5]);
    let partition = Int64Array::from(vec![Some(1), None, None, Some(4), Some(5)]);
    let topic = StringArray::from(vec![Some("A"), None, Some("A"), Some(""), None]);

    let some_nested_object = StructArray::from(vec![
        (offset_field, Arc::new(offset) as ArrayRef),
        (partition_field, Arc::new(partition) as ArrayRef),
        (topic_field, Arc::new(topic) as ArrayRef),
    ]);

    // build a record batch
    let batch = RecordBatch::try_new(
        Arc::new(schema),
        vec![Arc::new(some_nested_object)],
    )
    .unwrap();

    roundtrip("test_arrow_writer_complex_mixed.parquet", batch);
}

We get a failure:

thread 'arrow::arrow_writer::tests::arrow_writer_complex_mixed' panicked at 'assertion failed: `(left == right)`
  left: `1`,
 right: `0`', parquet/src/util/bit_util.rs:332:9
test arrow::arrow_writer::tests::arrow_writer_complex_mixed ... FAILED

When the struct is nullable, the file is written correctly.

Expected behavior

The batch should be written without errors.

Additional context

From inspecting the levels that are generated for the passing and failing scenarios, they look identical (https://www.diffchecker.com/89qWByeI). It looks like the bug is with how levels of non-null structs are generated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugparquetChanges to the parquet crate

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions