Skip to content

Commit 3ce0760

Browse files
authored
MINOR: Small documentation fixes and deduplication (#491)
1 parent 3d65cc9 commit 3ce0760

File tree

2 files changed

+7
-35
lines changed

2 files changed

+7
-35
lines changed

README.md

Lines changed: 4 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -155,40 +155,11 @@ documented in [LogicalTypes.md][logical-types].
155155
[logical-types]: LogicalTypes.md
156156

157157
### Sort Order
158-
159158
Parquet stores min/max statistics at several levels (such as Column Chunk,
160-
Column Index and Data Page). Comparison for values of a type obey the
161-
following rules:
162-
163-
1. Each logical type has a specified comparison order. If a column is
164-
annotated with an unknown logical type, statistics may not be used
165-
for pruning data. The sort order for logical types is documented in
166-
the [LogicalTypes.md][logical-types] page.
167-
2. For primitive types, the following rules apply:
168-
169-
* BOOLEAN - false, true
170-
* INT32, INT64 - Signed comparison.
171-
* FLOAT, DOUBLE - Signed comparison with special handling of NaNs and
172-
signed zeros. The details are documented in the
173-
[Thrift definition](src/main/thrift/parquet.thrift) in the
174-
`ColumnOrder` union. They are summarized here but the Thrift definition
175-
is considered authoritative:
176-
* NaNs should not be written to min or max statistics fields.
177-
* If the computed max value is zero (whether negative or positive),
178-
`+0.0` should be written into the max statistics field.
179-
* If the computed min value is zero (whether negative or positive),
180-
`-0.0` should be written into the min statistics field.
181-
182-
For backwards compatibility when reading files:
183-
* If the min is a NaN, it should be ignored.
184-
* If the max is a NaN, it should be ignored.
185-
* If the min is +0, the row group may contain -0 values as well.
186-
* If the max is -0, the row group may contain +0 values as well.
187-
* When looking for NaN values, min and max should be ignored.
188-
189-
* BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY - Lexicographic unsigned byte-wise
190-
comparison.
191-
159+
Column Index, and Data Page). These statistics are according to a sort order,
160+
which is defined for each column in the file footer. Parquet supports common
161+
sort orders for logical and primitve types. The details are documented in the
162+
[Thrift definition](src/main/thrift/parquet.thrift) in the `ColumnOrder` union.
192163

193164
## Nested Encoding
194165
To encode nested columns, Parquet uses the Dremel encoding with definition and

src/main/thrift/parquet.thrift

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -313,12 +313,12 @@ struct Statistics {
313313

314314
/** Empty structs to use as logical type annotations */
315315
struct StringType {} // allowed for BYTE_ARRAY, must be encoded with UTF-8
316-
struct UUIDType {} // allowed for FIXED[16], must encoded raw UUID bytes
316+
struct UUIDType {} // allowed for FIXED[16], must be encoded as raw UUID bytes
317317
struct MapType {} // see LogicalTypes.md
318318
struct ListType {} // see LogicalTypes.md
319319
struct EnumType {} // allowed for BYTE_ARRAY, must be encoded with UTF-8
320320
struct DateType {} // allowed for INT32
321-
struct Float16Type {} // allowed for FIXED[2], must encoded raw FLOAT16 bytes
321+
struct Float16Type {} // allowed for FIXED[2], must be encoded as raw FLOAT16 bytes (see LogicalTypes.md)
322322

323323
/**
324324
* Logical type to annotate a column that is always null.
@@ -1057,6 +1057,7 @@ union ColumnOrder {
10571057
* UINT64 - unsigned comparison
10581058
* DECIMAL - signed comparison of the represented value
10591059
* DATE - signed comparison
1060+
* FLOAT16 - signed comparison of the represented value (*)
10601061
* TIME_MILLIS - signed comparison
10611062
* TIME_MICROS - signed comparison
10621063
* TIMESTAMP_MILLIS - signed comparison

0 commit comments

Comments
 (0)