Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion LogicalTypes.md
Original file line number Diff line number Diff line change
Expand Up @@ -254,7 +254,10 @@ Used in contexts where precision is traded off for smaller footprint and potenti

The primitive type is a 2-byte `FIXED_LEN_BYTE_ARRAY`.

The sort order for `FLOAT16` is signed (with special handling of NANs and signed zeros); it uses the same [logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and `DOUBLE`.
The type-defined sort order for `FLOAT16` is signed (with special handling of NaNs and signed zeros),
as for `FLOAT` and `DOUBLE`. It is recommended that writers use IEEE754TotalOrder when writing columns
of this type for a well-defined handling of NaNs and signed zeros. See the `ColumnOrder` union in the
[Thrift definition](src/main/thrift/parquet.thrift) for details.

## Temporal Types

Expand Down
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -158,7 +158,9 @@ documented in [LogicalTypes.md][logical-types].
Parquet stores min/max statistics at several levels (such as Column Chunk,
Column Index, and Data Page). These statistics are according to a sort order,
which is defined for each column in the file footer. Parquet supports common
sort orders for logical and primitve types. The details are documented in the
sort orders for logical and primitve types and also special orders for types
where the common sort order is not unambiguously defined (e.g., NaN ordering
for floating point types). The details are documented in the
[Thrift definition](src/main/thrift/parquet.thrift) in the `ColumnOrder` union.

## Nested Encoding
Expand Down
59 changes: 57 additions & 2 deletions src/main/thrift/parquet.thrift
Original file line number Diff line number Diff line change
Expand Up @@ -1030,6 +1030,9 @@ struct RowGroup {
/** Empty struct to signal the order defined by the physical or logical type */
struct TypeDefinedOrder {}

/** Empty struct to signal IEEE 754 total order for floating point types */
struct IEEE754TotalOrder {}

/**
* Union to specify the order used for the min_value and max_value fields for a
* column. This union takes the role of an enhanced enum that allows rich
Expand All @@ -1038,6 +1041,7 @@ struct TypeDefinedOrder {}
* Possible values are:
* * TypeDefinedOrder - the column uses the order defined by its logical or
* physical type (if there is no logical type).
* * IEEE754TotalOrder - the floating point column uses IEEE 754 total order.
*
* If the reader does not support the value of this union, min and max stats
* for this column should be ignored.
Expand Down Expand Up @@ -1082,8 +1086,12 @@ union ColumnOrder {
* BYTE_ARRAY - unsigned byte-wise comparison
* FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison
*
* (*) Because the sorting order is not specified properly for floating
* point values (relations vs. total ordering) the following
* (*) Because the precise sorting order is ambiguous for floating
* point types due to underspecified handling of NaN and -0/+0,
* it is recommended that writers use IEEE_754_TOTAL_ORDER
* for these types.
*
* If this ordering is used for floating point types, then the following
* compatibility rules should be applied when reading statistics:
* - If the min is a NaN, it should be ignored.
* - If the max is a NaN, it should be ignored.
Expand All @@ -1099,6 +1107,53 @@ union ColumnOrder {
* `-0.0` should be written into the min statistics field.
*/
1: TypeDefinedOrder TYPE_ORDER;

/*
* The floating point type is ordered according to the totalOrder predicate,
* as defined in section 5.10 of IEEE-754 (2008 revision). Only columns of
* physical type FLOAT or DOUBLE, or logical type FLOAT16 may use this ordering.
*
* Intuitively, this orders floats mathematically, but defines -0 to be less
* than +0, -NaN to be less than anything else, and +NaN to be greater than
* anything else. It also defines an order between different bit representations
* of the same value.
*
* The formal definition is as follows:
* a) If x<y, totalOrder(x, y) is true.
* b) If x>y, totalOrder(x, y) is false.
* c) If x=y:
* 1) totalOrder(−0, +0) is true.
* 2) totalOrder(+0, −0) is false.
* 3) If x and y represent the same floating-point datum:
* i) If x and y have negative sign, totalOrder(x, y) is true if and
* only if the exponent of x ≥ the exponent of y
* ii) otherwise totalOrder(x, y) is true if and only if the exponent
* of x ≤ the exponent of y.
* d) If x and y are unordered numerically because x or y is NaN:
* 1) totalOrder(−NaN, y) is true where −NaN represents a NaN with
* negative sign bit and y is a non-NaN floating-point number.
* 2) totalOrder(x, +NaN) is true where +NaN represents a NaN with
* positive sign bit and x is a non-NaN floating-point number.
* 3) If x and y are both NaNs, then totalOrder reflects a total ordering
* based on:
* i) negative sign orders below positive sign
* ii) signaling orders below quiet for +NaN, reverse for −NaN
* iii) lesser payload, when regarded as an integer, orders below
* greater payload for +NaN, reverse for −NaN.
*
* Note that this ordering can be implemented efficiently in software by bit-wise
* operations on the integer representation of the floating point values.
* E.g., this is a possible implementation for DOUBLE in Rust:
*
* pub fn totalOrder(x: f64, y: f64) -> bool {
* let mut x_int = x.to_bits() as i64;
* let mut y_int = y.to_bits() as i64;
* x_int ^= (((x_int >> 63) as u64) >> 1) as i64;
* y_int ^= (((y_int >> 63) as u64) >> 1) as i64;
Comment on lines +1151 to +1152
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a comment explaining those lines?

Suggested change
* x_int ^= (((x_int >> 63) as u64) >> 1) as i64;
* y_int ^= (((y_int >> 63) as u64) >> 1) as i64;
* // Turn sign+mantissa into 2's complement representation
* x_int ^= (((x_int >> 63) as u64) >> 1) as i64;
* y_int ^= (((y_int >> 63) as u64) >> 1) as i64;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's assuming the comment is right. Please double-check :)

* return x_int <= y_int;
* }
*/
2: IEEE754TotalOrder IEEE_754_TOTAL_ORDER;
}

struct PageLocation {
Expand Down