Skip to content

Commit bef5438

Browse files
Zoltan Ivanfirdblue
authored andcommitted
PARQUET-686: Clarifications about min-max stats.
Changed some descriptions to reflect code changes that happened during code review without updating the corresponding comments and documentation: * Removed references to the `SIGNED` and `UNSIGNED` sort orders, which were removed in favour of a single `TYPE_ORDER`. * Removed obsolete references to `column_orders`'s effect on the `min` and `max` values, since those were declared obsolete instead and `column_orders` only affects the new `min_value` and `max_value` fields. * Clarified `ColumnOrder`'s purpose, since the purpose of a union containing a single empty struct was hard to grasp. Author: Zoltan Ivanfi <[email protected]> Closes #55 from zivanfi/master and squashes the following commits: a499d86 [Zoltan Ivanfi] Comparison rules updates. 0c973f7 [Zoltan Ivanfi] PARQUET-686: Further clarifications. f8fab0b [Zoltan Ivanfi] PARQUET-686: Minor improvements in Thrift comments. c86090d [Zoltan Ivanfi] PARQUET-686: Clarifications about min-max stats.
1 parent 523d7b6 commit bef5438

2 files changed

Lines changed: 64 additions & 29 deletions

File tree

LogicalTypes.md

Lines changed: 15 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ may require additional metadata fields, as well as rules for those fields.
3737
`UTF8` may only be used to annotate the binary primitive type and indicates
3838
that the byte array should be interpreted as a UTF-8 encoded character string.
3939

40-
The sort order used for `UTF8` strings is `UNSIGNED` byte-wise comparison.
40+
The sort order used for `UTF8` strings is unsigned byte-wise comparison.
4141

4242
## Numeric Types
4343

@@ -57,7 +57,7 @@ allows.
5757
implied by the `int32` and `int64` primitive types if no other annotation is
5858
present and should be considered optional.
5959

60-
The sort order used for signed integer types is `SIGNED`.
60+
The sort order used for signed integer types is signed.
6161

6262
### Unsigned Integers
6363

@@ -74,7 +74,7 @@ allows.
7474
`UINT_8`, `UINT_16`, and `UINT_32` must annotate an `int32` primitive type and
7575
`UINT_64` must annotate an `int64` primitive type.
7676

77-
The sort order used for unsigned integer types is `UNSIGNED`.
77+
The sort order used for unsigned integer types is unsigned.
7878

7979
### DECIMAL
8080

@@ -104,8 +104,8 @@ integer. A precision too large for the underlying type (see below) is an error.
104104
A `SchemaElement` with the `DECIMAL` `ConvertedType` must also have both
105105
`scale` and `precision` fields set, even if scale is 0 by default.
106106

107-
The sort order used for `DECIMAL` values is `SIGNED`. The order is equivalent
108-
to signed comparison of decimal values.
107+
The sort order used for `DECIMAL` values is signed comparison of the represented
108+
value.
109109

110110
If the column uses `int32` or `int64` physical types, then signed comparison of
111111
the integer values produces the correct ordering. If the physical type is
@@ -121,39 +121,39 @@ comparison.
121121
annotate an `int32` that stores the number of days from the Unix epoch, 1
122122
January 1970.
123123

124-
The sort order used for `DATE` is `SIGNED`.
124+
The sort order used for `DATE` is signed.
125125

126126
### TIME\_MILLIS
127127

128128
`TIME_MILLIS` is used for a logical time type with millisecond precision,
129129
without a date. It must annotate an `int32` that stores the number of
130130
milliseconds after midnight.
131131

132-
The sort order used for `TIME\_MILLIS` is `SIGNED`.
132+
The sort order used for `TIME\_MILLIS` is signed.
133133

134134
### TIME\_MICROS
135135

136136
`TIME_MICROS` is used for a logical time type with microsecond precision,
137137
without a date. It must annotate an `int64` that stores the number of
138138
microseconds after midnight.
139139

140-
The sort order used for `TIME\_MICROS` is `SIGNED`.
140+
The sort order used for `TIME\_MICROS` is signed.
141141

142142
### TIMESTAMP\_MILLIS
143143

144144
`TIMESTAMP_MILLIS` is used for a combined logical date and time type, with
145145
millisecond precision. It must annotate an `int64` that stores the number of
146146
milliseconds from the Unix epoch, 00:00:00.000 on 1 January 1970, UTC.
147147

148-
The sort order used for `TIMESTAMP\_MILLIS` is `SIGNED`.
148+
The sort order used for `TIMESTAMP\_MILLIS` is signed.
149149

150150
### TIMESTAMP\_MICROS
151151

152152
`TIMESTAMP_MICROS` is used for a combined logical date and time type with
153153
microsecond precision. It must annotate an `int64` that stores the number of
154154
microseconds from the Unix epoch, 00:00:00.000000 on 1 January 1970, UTC.
155155

156-
The sort order used for `TIMESTAMP\_MICROS` is `SIGNED`.
156+
The sort order used for `TIMESTAMP\_MICROS` is signed.
157157

158158
### INTERVAL
159159

@@ -169,7 +169,7 @@ example, there is no requirement that a large number of days should be
169169
expressed as a mix of months and days because there is not a constant
170170
conversion from days to months.
171171

172-
The sort order used for `INTERVAL` is `UNSIGNED`, produced by sorting by
172+
The sort order used for `INTERVAL` is unsigned, produced by sorting by
173173
the value of months, then days, then milliseconds with unsigned comparison.
174174

175175
## Embedded Types
@@ -184,6 +184,8 @@ string of valid JSON as defined by the [JSON specification][json-spec]
184184

185185
[json-spec]: http://json.org/
186186

187+
The sort order used for `JSON` is unsigned byte-wise comparison.
188+
187189
### BSON
188190

189191
`BSON` is used for an embedded BSON document. It must annotate a `binary`
@@ -192,6 +194,8 @@ defined by the [BSON specification][bson-spec].
192194

193195
[bson-spec]: http://bsonspec.org/spec.html
194196

197+
The sort order used for `BSON` is unsigned byte-wise comparison.
198+
195199
## Nested Types
196200

197201
This section specifies how `LIST` and `MAP` can be used to encode nested types

src/main/thrift/parquet.thrift

Lines changed: 49 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -28,17 +28,6 @@ namespace java org.apache.parquet.format
2828
* with the encodings to control the on disk storage format.
2929
* For example INT16 is not included as a type since a good encoding of INT32
3030
* would handle this.
31-
*
32-
* When a logical type is not present, the type-defined sort order of these
33-
* physical types are:
34-
* * BOOLEAN - false, true
35-
* * INT32 - signed comparison
36-
* * INT64 - signed comparison
37-
* * INT96 - signed comparison
38-
* * FLOAT - signed comparison
39-
* * DOUBLE - signed comparison
40-
* * BYTE_ARRAY - unsigned byte-wise comparison
41-
* * FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison
4231
*/
4332
enum Type {
4433
BOOLEAN = 0;
@@ -219,12 +208,12 @@ struct Statistics {
219208
* Values are encoded using PLAIN encoding, except that variable-length byte
220209
* arrays do not include a length prefix.
221210
*
222-
* These fields encode min and max values determined by SIGNED comparison
211+
* These fields encode min and max values determined by signed comparison
223212
* only. New files should use the correct order for a column's logical type
224213
* and store the values in the min_value and max_value fields.
225214
*
226215
* To support older readers, these may be set when the column order is
227-
* SIGNED.
216+
* signed.
228217
*/
229218
1: optional binary max;
230219
2: optional binary min;
@@ -582,7 +571,9 @@ struct RowGroup {
582571
struct TypeDefinedOrder {}
583572

584573
/**
585-
* Union to specify the order used for min, max, and sorting values in a column.
574+
* Union to specify the order used for the min_value and max_value fields for a
575+
* column. This union takes the role of an enhanced enum that allows rich
576+
* elements (which will be needed for a collation-based ordering in the future).
586577
*
587578
* Possible values are:
588579
* * TypeDefinedOrder - the column uses the order defined by its logical or
@@ -592,6 +583,41 @@ struct TypeDefinedOrder {}
592583
* for this column should be ignored.
593584
*/
594585
union ColumnOrder {
586+
587+
/**
588+
* The sort orders for logical types are:
589+
* UTF8 - unsigned byte-wise comparison
590+
* INT8 - signed comparison
591+
* INT16 - signed comparison
592+
* INT32 - signed comparison
593+
* INT64 - signed comparison
594+
* UINT8 - unsigned comparison
595+
* UINT16 - unsigned comparison
596+
* UINT32 - unsigned comparison
597+
* UINT64 - unsigned comparison
598+
* DECIMAL - signed comparison of the represented value
599+
* DATE - signed comparison
600+
* TIME_MILLIS - signed comparison
601+
* TIME_MICROS - signed comparison
602+
* TIMESTAMP_MILLIS - signed comparison
603+
* TIMESTAMP_MICROS - signed comparison
604+
* INTERVAL - unsigned comparison
605+
* JSON - unsigned byte-wise comparison
606+
* BSON - unsigned byte-wise comparison
607+
* ENUM - unsigned byte-wise comparison
608+
* LIST - undefined
609+
* MAP - undefined
610+
*
611+
* In the absence of logical types, the sort order is determined by the physical type:
612+
* BOOLEAN - false, true
613+
* INT32 - signed comparison
614+
* INT64 - signed comparison
615+
* INT96 (only used for legacy timestamps) - unsigned comparison
616+
* FLOAT - signed comparison of the represented value
617+
* DOUBLE - signed comparison of the represented value
618+
* BYTE_ARRAY - unsigned byte-wise comparison
619+
* FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison
620+
*/
595621
1: TypeDefinedOrder TYPE_ORDER;
596622
}
597623

@@ -626,11 +652,16 @@ struct FileMetaData {
626652
6: optional string created_by
627653

628654
/**
629-
* Sort order used for each column in this file.
655+
* Sort order used for the min_value and max_value fields of each column in
656+
* this file. Each sort order corresponds to one column, determined by its
657+
* position in the list, matching the position of the column in the schema.
658+
*
659+
* Without column_orders, the meaning of the min_value and max_value fields is
660+
* undefined. To ensure well-defined behaviour, if min_value and max_value are
661+
* written to a Parquet file, column_orders must be written as well.
630662
*
631-
* If this list is not present, then the order for each column is assumed to
632-
* be Signed. In addition, min and max values for INTERVAL or DECIMAL stored
633-
* as fixed or bytes should be ignored.
663+
* The obsolete min and max fields are always sorted by signed comparison
664+
* regardless of column_orders.
634665
*/
635666
7: optional list<ColumnOrder> column_orders;
636667
}

0 commit comments

Comments
 (0)