Skip to content

Incorrect min/max statistics for strings in parquet files #641

@alamb

Description

@alamb

Describe the bug
The statistics written by the arrow / parquet writer for String columns seem to be incorrect.

To Reproduce
Run this code:

fn main() {
    let input = vec![
        Some("andover"),
        Some("reading"),
        Some("bedford"),
        Some("tewsbury"),
        Some("lexington"),
        Some("lawrence"),
    ];

    let input: StringArray = input.into_iter().collect();
    println!("Staring to test with array {:?}", input);

    let record_batch = RecordBatch::try_from_iter(vec![
        ("city", Arc::new(input) as _)
    ]).unwrap();

    println!("Opening output file /tmp/test.parquet");
    let out_file = File::create("/tmp/test.parquet").unwrap();

    println!("Creating writer...");
    let mut writer = ArrowWriter::try_new(out_file, record_batch.schema(), None)
        .expect("creating writer");

    println!("writing...");
    writer.write(&record_batch).expect("writing");

    println!("closing...");
    writer.close().expect("closing");

    println!("done...");
}

Then examine the resulting parquet file and note the min/max values for the "city" column are:

min: "andover"
max: "lexington"
alamb@MacBook-Pro rust_parquet % parquet-tools dump  /tmp/test.parquet 
parquet-tools dump  /tmp/test.parquet 
row group 0 
------------------------------------------------------------------------------------------------------------------------------
city:  BINARY UNCOMPRESSED DO:4 FPO:90 SZ:130/130/1.00 VC:6 ENC:RLE_DICTIONARY,PLAIN,RLE ST:[min: andover, max: lexi [more]...

    city TV=6 RL=0 DL=0 DS: 6 DE:PLAIN
    --------------------------------------------------------------------------------------------------------------------------
    page 0:                  DLE:RLE RLE:RLE VLE:RLE_DICTIONARY ST:[min: andover, max: lexington, num_nulls not defined] [more]... VC:6

BINARY city 
------------------------------------------------------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 6 *** 
value 1: R:0 D:0 V:andover
value 2: R:0 D:0 V:reading
value 3: R:0 D:0 V:bedford
value 4: R:0 D:0 V:tewsbury
value 5: R:0 D:0 V:lexington
value 6: R:0 D:0 V:lawrence

Expected behavior
The parquet file produced has min/max statistics for the city column:

min: "andover"
max: "tewsbury"

As 't' follows 'l'

Additional context

Since DataFusion now uses these statistics for pruning out row groups, this leads to incorrect results in DataFusion. I found this when investigating https://github.com/influxdata/influxdb_iox/issues/2153

Metadata

Metadata

Assignees

Labels

bugparquetChanges to the parquet crate

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions