-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Closed
Labels
Description
Describe the bug
The statistics written by the arrow / parquet writer for String columns seem to be incorrect.
To Reproduce
Run this code:
fn main() {
let input = vec![
Some("andover"),
Some("reading"),
Some("bedford"),
Some("tewsbury"),
Some("lexington"),
Some("lawrence"),
];
let input: StringArray = input.into_iter().collect();
println!("Staring to test with array {:?}", input);
let record_batch = RecordBatch::try_from_iter(vec![
("city", Arc::new(input) as _)
]).unwrap();
println!("Opening output file /tmp/test.parquet");
let out_file = File::create("/tmp/test.parquet").unwrap();
println!("Creating writer...");
let mut writer = ArrowWriter::try_new(out_file, record_batch.schema(), None)
.expect("creating writer");
println!("writing...");
writer.write(&record_batch).expect("writing");
println!("closing...");
writer.close().expect("closing");
println!("done...");
}Then examine the resulting parquet file and note the min/max values for the "city" column are:
min: "andover"
max: "lexington"
alamb@MacBook-Pro rust_parquet % parquet-tools dump /tmp/test.parquet
parquet-tools dump /tmp/test.parquet
row group 0
------------------------------------------------------------------------------------------------------------------------------
city: BINARY UNCOMPRESSED DO:4 FPO:90 SZ:130/130/1.00 VC:6 ENC:RLE_DICTIONARY,PLAIN,RLE ST:[min: andover, max: lexi [more]...
city TV=6 RL=0 DL=0 DS: 6 DE:PLAIN
--------------------------------------------------------------------------------------------------------------------------
page 0: DLE:RLE RLE:RLE VLE:RLE_DICTIONARY ST:[min: andover, max: lexington, num_nulls not defined] [more]... VC:6
BINARY city
------------------------------------------------------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 6 ***
value 1: R:0 D:0 V:andover
value 2: R:0 D:0 V:reading
value 3: R:0 D:0 V:bedford
value 4: R:0 D:0 V:tewsbury
value 5: R:0 D:0 V:lexington
value 6: R:0 D:0 V:lawrenceExpected behavior
The parquet file produced has min/max statistics for the city column:
min: "andover"
max: "tewsbury"
As 't' follows 'l'
Additional context
Since DataFusion now uses these statistics for pruning out row groups, this leads to incorrect results in DataFusion. I found this when investigating https://github.com/influxdata/influxdb_iox/issues/2153
xianwill