-
Notifications
You must be signed in to change notification settings - Fork 1.1k
fix NaN handling in parquet statistics #256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
parquet/src/column/writer.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we introduce some generic T::is_nan function into the data type abstraction layer (that basically returns False for every non-float/double type and calls .is_nan() for float/double) or is this local workaround good enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's necessary. It'll just be adding overhead for us.
The compiler can figure out what we're trying to do, and will optimise out the branch for anything that's not a floating point number.
Look at this example: https://godbolt.org/z/z7M3Mfqzz
The is_nan::<i32>() always returns false, and the ucomisd does the floating point comparison
Codecov Report
@@ Coverage Diff @@
## master #256 +/- ##
==========================================
+ Coverage 82.52% 82.53% +0.01%
==========================================
Files 162 162
Lines 43672 43786 +114
==========================================
+ Hits 36039 36139 +100
- Misses 7633 7647 +14
Continue to review full report at Codecov.
|
acd4f52 to
fc907e5
Compare
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Thanks @crepererum ! |
fc907e5 to
9c60e29
Compare
nevi-me
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Which issue does this PR close?
Closes #255.
Rationale for this change
Fixes NaN handling in parquet statistics. This is in line with the C++ stack:
https://github.com/apache/arrow/blob/b3e43987c47b2f01b204a2d954f882f7161616ef/cpp/src/parquet/statistics_test.cc#L1000-L1043
What changes are included in this PR?
Filter out NaN values from statistics + tests.
Are there any user-facing changes?
Yes: formally NaN were included in the stats but at "random" (i.e. when the data started with an NaN than the min/max values are NaN, otherwise min/max are non-NaN). Now the behavior is: NaN are excluded always. If the only NaN values are present, then min/max are unset.