apache · JFinis · Nov 10, 2023 · pitrou · Jun 5, 2025 · pitrou
diff --git a/LogicalTypes.md b/LogicalTypes.md
@@ -254,7 +254,10 @@ Used in contexts where precision is traded off for smaller footprint and potenti
 
 The primitive type is a 2-byte `FIXED_LEN_BYTE_ARRAY`.
 
-The sort order for `FLOAT16` is signed (with special handling of NANs and signed zeros); it uses the same [logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and `DOUBLE`.
+The type-defined sort order for `FLOAT16` is signed (with special handling of NaNs and signed zeros),
+as for `FLOAT` and `DOUBLE`. It is recommended that writers use IEEE754TotalOrder when writing columns
+of this type for a well-defined handling of NaNs and signed zeros. See the `ColumnOrder` union in the
+[Thrift definition](src/main/thrift/parquet.thrift) for details.
 
 ## Temporal Types
 

diff --git a/README.md b/README.md
@@ -158,7 +158,9 @@ documented in [LogicalTypes.md][logical-types].
 Parquet stores min/max statistics at several levels (such as Column Chunk,
 Column Index, and Data Page). These statistics are according to a sort order,
 which is defined for each column in the file footer. Parquet supports common
-sort orders for logical and primitve types. The details are documented in the
+sort orders for logical and primitve types and also special orders for types
+where the common sort order is not unambiguously defined (e.g., NaN ordering
+for floating point types). The details are documented in the
 [Thrift definition](src/main/thrift/parquet.thrift) in the `ColumnOrder` union.
 
 ## Nested Encoding

diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift
@@ -1030,6 +1030,9 @@ struct RowGroup {
 /** Empty struct to signal the order defined by the physical or logical type */
 struct TypeDefinedOrder {}
 
+/** Empty struct to signal IEEE 754 total order for floating point types */
+struct IEEE754TotalOrder {}
+
 /**
  * Union to specify the order used for the min_value and max_value fields for a
  * column. This union takes the role of an enhanced enum that allows rich
@@ -1038,6 +1041,7 @@ struct TypeDefinedOrder {}
  * Possible values are:
  * * TypeDefinedOrder - the column uses the order defined by its logical or
  *                      physical type (if there is no logical type).
+ * * IEEE754TotalOrder - the floating point column uses IEEE 754 total order.
  *
  * If the reader does not support the value of this union, min and max stats
  * for this column should be ignored.
@@ -1082,8 +1086,12 @@ union ColumnOrder {
    *   BYTE_ARRAY - unsigned byte-wise comparison
    *   FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison
    *
-   * (*) Because the sorting order is not specified properly for floating
-   *     point values (relations vs. total ordering) the following
+   * (*) Because the precise sorting order is ambiguous for floating
+   *     point types due to underspecified handling of NaN and -0/+0,
+   *     it is recommended that writers use IEEE_754_TOTAL_ORDER
+   *     for these types.
+   *
+   *     If this ordering is used for floating point types, then the following
    *     compatibility rules should be applied when reading statistics:
    *     - If the min is a NaN, it should be ignored.
    *     - If the max is a NaN, it should be ignored.
@@ -1099,6 +1107,53 @@ union ColumnOrder {
    *       `-0.0` should be written into the min statistics field.
    */
   1: TypeDefinedOrder TYPE_ORDER;
+
+  /*
+   * The floating point type is ordered according to the totalOrder predicate,
+   * as defined in section 5.10 of IEEE-754 (2008 revision). Only columns of
+   * physical type FLOAT or DOUBLE, or logical type FLOAT16 may use this ordering.
+   *
+   * Intuitively, this orders floats mathematically, but defines -0 to be less
+   * than +0, -NaN to be less than anything else, and +NaN to be greater than
+   * anything else. It also defines an order between different bit representations
+   * of the same value.
+   *
+   * The formal definition is as follows:
+   *   a) If x<y, totalOrder(x, y) is true.
+   *   b) If x>y, totalOrder(x, y) is false.
+   *   c) If x=y:
+   *     1) totalOrder(−0, +0) is true.
+   *     2) totalOrder(+0, −0) is false.
+   *     3) If x and y represent the same floating-point datum:
+   *        i) If x and y have negative sign, totalOrder(x, y) is true if and
+   *           only if the exponent of x ≥ the exponent of y
+   *       ii) otherwise totalOrder(x, y) is true if and only if the exponent
+   *           of x ≤ the exponent of y.
+   *   d) If x and y are unordered numerically because x or y is NaN:
+   *     1) totalOrder(−NaN, y) is true where −NaN represents a NaN with
+   *        negative sign bit and y is a non-NaN floating-point number.
+   *     2) totalOrder(x, +NaN) is true where +NaN represents a NaN with
+   *        positive sign bit and x is a non-NaN floating-point number.
+   *     3) If x and y are both NaNs, then totalOrder reflects a total ordering
+   *        based on:
+   *         i) negative sign orders below positive sign
+   *        ii) signaling orders below quiet for +NaN, reverse for −NaN
+   *       iii) lesser payload, when regarded as an integer, orders below
+   *            greater payload for +NaN, reverse for −NaN.
+   *
+   * Note that this ordering can be implemented efficiently in software by bit-wise
+   * operations on the integer representation of the floating point values.
+   * E.g., this is a possible implementation for DOUBLE in Rust:
+   *
+   *   pub fn totalOrder(x: f64, y: f64) -> bool {
+   *     let mut x_int = x.to_bits() as i64;
+   *     let mut y_int = y.to_bits() as i64;
+   *     x_int ^= (((x_int >> 63) as u64) >> 1) as i64;
+   *     y_int ^= (((y_int >> 63) as u64) >> 1) as i64;
-   *     x_int ^= (((x_int >> 63) as u64) >> 1) as i64;
-   *     y_int ^= (((y_int >> 63) as u64) >> 1) as i64;
+   *     // Turn sign+mantissa into 2's complement representation
+   *     x_int ^= (((x_int >> 63) as u64) >> 1) as i64;
+   *     y_int ^= (((y_int >> 63) as u64) >> 1) as i64;
-   *     x_int ^= (((x_int >> 63) as u64) >> 1) as i64;
-   *     y_int ^= (((y_int >> 63) as u64) >> 1) as i64;
+   *     // Turn sign+mantissa into 2's complement representation
+   *     x_int ^= (((x_int >> 63) as u64) >> 1) as i64;
+   *     y_int ^= (((y_int >> 63) as u64) >> 1) as i64;
+   *     return x_int <= y_int;
+   *   }
+   */
+  2: IEEE754TotalOrder IEEE_754_TOTAL_ORDER;
 }
 
 struct PageLocation {