Skip to content
Closed
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions src/main/thrift/parquet.thrift
Original file line number Diff line number Diff line change
Expand Up @@ -194,6 +194,15 @@ enum FieldRepetitionType {
/**
* Statistics per row group and per page
* All fields are optional.
*
* For BinaryStatistics in Parquet, we want to distinguish between the
* statistics derived from comparisons of signed or unsigned bytes. The min
* and max fields are deprecated for BinaryStatistics, instead relying on
* specification of {unsigned,signed}_{min,max}. The filter API should allow
* clients to specify which statistics and method of comparison should be used
* for filtering. To maintain backward format compatibility, when filtering
* based on signed statistics the signed_min and signed_max are checked first,
* and if they are unset it falls back to using the values in min and max.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets clarify that min and max are always signed if present.

I wonder if any of this belongs in the README too?

*/
struct Statistics {
/** min and max value of the column, encoded in PLAIN encoding */
Expand All @@ -203,6 +212,12 @@ struct Statistics {
3: optional i64 null_count;
/** count of distinct values occurring */
4: optional i64 distinct_count;
/* Signed min and max for binary fields */
5: optional binary signed_max;
6: optional binary signed_min;
/* Unsigned min and max for binary fields */
7: optional binary unsigned_max;
8: optional binary unsigned_min;
}

/**
Expand Down