Skip to content

Conversation

@CookiePieWw
Copy link
Contributor

@CookiePieWw CookiePieWw commented May 31, 2025

Which issue does this PR close?

Rationale for this change

As described in apache/datafusion#15976 (comment), we can expose the is_[max/min]_value_exact flags in StatisticsConverter in order to justify whether the stats are exact.

What changes are included in this PR?

Add row_group_is_[max/min]_value_exact to StatisticsConverter, also with some changes in the corresponding test files.

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label May 31, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @CookiePieWw -- this looks really close

I think we just need to tweak the tests a bit and this PR will be good to go

@CookiePieWw CookiePieWw force-pushed the check-stat-max-min-value-exact branch from 3df9165 to 9c5c5c7 Compare June 1, 2025 15:00
.build();
.set_statistics_enabled(EnabledStatistics::Page);
if matches!(scenario, Scenario::TruncatedUTF8) {
// The same as default `column_index_truncate_length` to check both stats with one value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not be needed after #7578
(but that won't be merged for a while)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @CookiePieWw -- this is looking close

make_utf8_batch(vec![
Some(&("a".repeat(64) + "1")),
Some(&("b".repeat(64) + "2")),
Some(&("c".repeat(64) + "3")),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Some(&("c".repeat(64) + "3")),
None,

Try throwing a null in the first batch, that seems to clear up the issue with empty string showing up in the stats. It seems like the schema is inferred from the first batch, so all values are not nullable. Putting a null in the first batch fixes that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable for me. But I still wonder if its as expected that a first batch without null values will cause the Nones in following batches are converted into empty strings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to study the API used here in more detail, but I would assume one would usually provide a schema prior to writing batches. The issue here AFAICT is that a null value is written, but the schema says values are not nullable, so the null is converted to an empty string. I think I'd prefer an error be thrown in this instance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, with sleep I now see the schema is set in make_test_file_rg...the first batch is used.

let schema = batches[0].schema();
let mut writer = ArrowWriter::try_new(&mut output_file, schema, Some(props)).unwrap();

Perhaps a note should be added that if nulls are desired, they need to be included in the first batch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks! So it's not a bug then.

@CookiePieWw CookiePieWw force-pushed the check-stat-max-min-value-exact branch from 2ee4746 to b61a495 Compare June 3, 2025 15:16
Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @CookiePieWw! Looks good now.

@CookiePieWw
Copy link
Contributor Author

Hi @alamb , since #7578 will not be merged till next major release, seems we need to wait for it and then remove the code setting default max_statistics_truncate_length here.

Separately, regarding the handling of None values in a non-nullable schema (thanks to @etseidl for catching this!), we currently convert None to empty strings implicitly. Would it be better to throw an error in such cases, or is this behavior by design?

@alamb
Copy link
Contributor

alamb commented Jun 6, 2025

Separately, regarding the handling of None values in a non-nullable schema (thanks to @etseidl for catching this!), we currently convert None to empty strings implicitly. Would it be better to throw an error in such cases, or is this behavior by design?

I don't think it is by design per se -- it is due to a lack of error checking

However, given that we don't have a strong discipline around verifying the nullable flag in Field I am somewhat wary about potentially adding additional checking

@alamb alamb merged commit 9e575bd into apache:main Jun 6, 2025
16 checks passed
@alamb
Copy link
Contributor

alamb commented Jun 6, 2025

Thanks again @CookiePieWw and @etseidl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants