Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -212,6 +212,15 @@ The format is explicitly designed to separate the metadata from the data. This
allows splitting columns into multiple files, as well as having a single metadata
file reference multiple parquet files.

## RowGroup Statistics
In Parquet, the metadata for each RowGroup contains Statistics, which can be used by
clients for filtering purposes. An example implementation of filtering logic can be found
in [parquet-mr](https://github.com/apache/parquet-mr). Statistics include information
like the minimum and maximum for primitive types, while for binary data there is an
additional notion of _signed_ and _unsigned_ interpretations of the byte strings, which
have different comparison operations and are stored in the optional fields
`unsigned_min`, `unsigned_max`, `signed_min` and `signed_max`.

## Configurations
- Row group size: Larger row groups allow for larger column chunks which makes it
possible to do larger sequential IO. Larger groups also require more buffering in
Expand Down
22 changes: 22 additions & 0 deletions src/main/thrift/parquet.thrift
Original file line number Diff line number Diff line change
Expand Up @@ -194,6 +194,22 @@ enum FieldRepetitionType {
/**
* Statistics per row group and per page
* All fields are optional.
*
* Binaries are sorted lexicographically (byte by byte), treating each byte as
* an integer. The signed sorting treats each byte as a signed two's
* compliment number, and the unsigned treats the byte as an unsigned number.
* When one bytestring is a prefix of another, the containing bytestring is
* "greater than" the prefix.
*
* For BinaryStatistics in Parquet, we want to distinguish between the
* statistics derived from comparisons of signed or unsigned bytes. The min
* and max fields are deprecated for BinaryStatistics, instead relying on
* specification of {unsigned,signed}_{min,max}. The filter API should allow
* clients to specify which statistics and method of comparison should be used
* for filtering. To maintain backward format compatibility, when filtering
* based on signed statistics the signed_min and signed_max are checked first,
* and if they are unset it falls back to using the values in min and max,
* treating them as signed bytestrings.
*/
struct Statistics {
/** min and max value of the column, encoded in PLAIN encoding */
Expand All @@ -203,6 +219,12 @@ struct Statistics {
3: optional i64 null_count;
/** count of distinct values occurring */
4: optional i64 distinct_count;
/* Signed min and max for binary fields */
5: optional binary signed_max;
6: optional binary signed_min;
/* Unsigned min and max for binary fields */
7: optional binary unsigned_max;
8: optional binary unsigned_min;
}

/**
Expand Down