-
Notifications
You must be signed in to change notification settings - Fork 1.1k
feat: add row_group_is_[max/min]_value_exact to StatisticsConverter
#7574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @CookiePieWw -- this looks really close
I think we just need to tweak the tests a bit and this PR will be good to go
3df9165 to
9c5c5c7
Compare
| .build(); | ||
| .set_statistics_enabled(EnabledStatistics::Page); | ||
| if matches!(scenario, Scenario::TruncatedUTF8) { | ||
| // The same as default `column_index_truncate_length` to check both stats with one value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will not be needed after #7578
(but that won't be merged for a while)
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @CookiePieWw -- this is looking close
| make_utf8_batch(vec![ | ||
| Some(&("a".repeat(64) + "1")), | ||
| Some(&("b".repeat(64) + "2")), | ||
| Some(&("c".repeat(64) + "3")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Some(&("c".repeat(64) + "3")), | |
| None, |
Try throwing a null in the first batch, that seems to clear up the issue with empty string showing up in the stats. It seems like the schema is inferred from the first batch, so all values are not nullable. Putting a null in the first batch fixes that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable for me. But I still wonder if its as expected that a first batch without null values will cause the Nones in following batches are converted into empty strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have to study the API used here in more detail, but I would assume one would usually provide a schema prior to writing batches. The issue here AFAICT is that a null value is written, but the schema says values are not nullable, so the null is converted to an empty string. I think I'd prefer an error be thrown in this instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, with sleep I now see the schema is set in make_test_file_rg...the first batch is used.
arrow-rs/parquet/tests/arrow_reader/mod.rs
Lines 1037 to 1039 in 0ae9f66
| let schema = batches[0].schema(); | |
| let mut writer = ArrowWriter::try_new(&mut output_file, schema, Some(props)).unwrap(); |
Perhaps a note should be added that if nulls are desired, they need to be included in the first batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah thanks! So it's not a bug then.
2ee4746 to
b61a495
Compare
etseidl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @CookiePieWw! Looks good now.
|
Hi @alamb , since #7578 will not be merged till next major release, seems we need to wait for it and then remove the code setting default Separately, regarding the handling of None values in a non-nullable schema (thanks to @etseidl for catching this!), we currently convert None to empty strings implicitly. Would it be better to throw an error in such cases, or is this behavior by design? |
I don't think it is by design per se -- it is due to a lack of error checking However, given that we don't have a strong discipline around verifying the nullable flag in |
|
Thanks again @CookiePieWw and @etseidl |
Which issue does this PR close?
Rationale for this change
As described in apache/datafusion#15976 (comment), we can expose the
is_[max/min]_value_exactflags inStatisticsConverterin order to justify whether the stats are exact.What changes are included in this PR?
Add
row_group_is_[max/min]_value_exactto StatisticsConverter, also with some changes in the corresponding test files.Are there any user-facing changes?