-
Notifications
You must be signed in to change notification settings - Fork 467
PARQUET-686: Add Order to store the order used for min/max stats. #46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 4 commits
4534062
faa9edb
9962df8
eed4d47
c6e43b0
ffbb60b
9447fb8
f878c34
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -37,6 +37,8 @@ may require additional metadata fields, as well as rules for those fields. | |
| `UTF8` may only be used to annotate the binary primitive type and indicates | ||
| that the byte array should be interpreted as a UTF-8 encoded character string. | ||
|
|
||
| The sort order used for `UTF8` strings must be `UNSIGNED` byte-wise comparison. | ||
|
|
||
| ## Numeric Types | ||
|
|
||
| ### Signed Integers | ||
|
|
@@ -55,6 +57,8 @@ allows. | |
| implied by the `int32` and `int64` primitive types if no other annotation is | ||
| present and should be considered optional. | ||
|
|
||
| The sort order used for signed integer types must be `SIGNED`. | ||
|
||
|
|
||
| ### Unsigned Integers | ||
|
|
||
| `UINT_8`, `UINT_16`, `UINT_32`, and `UINT_64` annotations can be used to | ||
|
|
@@ -70,6 +74,8 @@ allows. | |
| `UINT_8`, `UINT_16`, and `UINT_32` must annotate an `int32` primitive type and | ||
| `UINT_64` must annotate an `int64` primitive type. | ||
|
|
||
| The sort order used for unsigned integer types must be `UNSIGNED`. | ||
|
|
||
| ### DECIMAL | ||
|
|
||
| `DECIMAL` annotation represents arbitrary-precision signed decimal numbers of | ||
|
|
@@ -98,6 +104,15 @@ integer. A precision too large for the underlying type (see below) is an error. | |
| A `SchemaElement` with the `DECIMAL` `ConvertedType` must also have both | ||
| `scale` and `precision` fields set, even if scale is 0 by default. | ||
|
|
||
| The sort order used for `DECIMAL` values must be `SIGNED`. The order is | ||
|
||
| must be equivalent to signed comparison of decimal values. | ||
|
|
||
| If the column uses `int32` or `int64` physical types, then signed comparison of | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for this clarification. |
||
| the integer values produces the correct ordering. If the physical type is | ||
| fixed, then the correct ordering can be produced by flipping the | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed_len_byte_array?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes. Fixed is used interchangeably with the full name elsewhere as well.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Gotcha, I'm not totally familiar with some of these conventions. |
||
| most-significant bit in the first byte and then using unsigned byte-wise | ||
| comparison. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What about decimals encoded into variable-length binary?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is an optimization, but it is too long to describe here. The main point is that the order must match the signed order produced by comparing the decimal values. This paragraph only gives optimizations for easy types.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Makes sense, thanks for explaining. |
||
|
|
||
| ## Date/Time Types | ||
|
|
||
| ### DATE | ||
|
|
@@ -106,30 +121,40 @@ A `SchemaElement` with the `DECIMAL` `ConvertedType` must also have both | |
| annotate an `int32` that stores the number of days from the Unix epoch, 1 | ||
| January 1970. | ||
|
|
||
| The sort order used for `DATE` is `SIGNED`. | ||
|
|
||
| ### TIME\_MILLIS | ||
|
|
||
| `TIME_MILLIS` is used for a logical time type with millisecond precision, | ||
| without a date. It must annotate an `int32` that stores the number of | ||
| milliseconds after midnight. | ||
|
|
||
| The sort order used for `TIME\_MILLIS` is `SIGNED`. | ||
|
|
||
| ### TIME\_MICROS | ||
|
|
||
| `TIME_MICROS` is used for a logical time type with microsecond precision, | ||
| without a date. It must annotate an `int64` that stores the number of | ||
| microseconds after midnight. | ||
|
|
||
| The sort order used for `TIME\_MICROS` is `SIGNED`. | ||
|
|
||
| ### TIMESTAMP\_MILLIS | ||
|
|
||
| `TIMESTAMP_MILLIS` is used for a combined logical date and time type, with | ||
| millisecond precision. It must annotate an `int64` that stores the number of | ||
| milliseconds from the Unix epoch, 00:00:00.000 on 1 January 1970, UTC. | ||
|
|
||
| The sort order used for `TIMESTAMP\_MILLIS` is `SIGNED`. | ||
|
|
||
| ### TIMESTAMP\_MICROS | ||
|
|
||
| `TIMESTAMP_MICROS` is used for a combined logical date and time type with | ||
| microsecond precision. It must annotate an `int64` that stores the number of | ||
| microseconds from the Unix epoch, 00:00:00.000000 on 1 January 1970, UTC. | ||
|
|
||
| The sort order used for `TIMESTAMP\_MICROS` is `SIGNED`. | ||
|
|
||
| ### INTERVAL | ||
|
|
||
| `INTERVAL` is used for an interval of time. It must annotate a | ||
|
|
@@ -144,8 +169,13 @@ example, there is no requirement that a large number of days should be | |
| expressed as a mix of months and days because there is not a constant | ||
| conversion from days to months. | ||
|
|
||
| The sort order used for `INTERVAL` is `UNSIGNED`, produced by sorting by | ||
| the value of months, then days, then milliseconds with unsigned comparison. | ||
|
|
||
| ## Embedded Types | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would be good to clarify what the expected sort order is for embedded types (or if we want to punt on that question, make it explicit that readers should ignore it).
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In the latest commit, I clarified that there isn't a required ordering for embedded types. |
||
|
|
||
| Embedded types do not have type-specific orderings. | ||
|
|
||
| ### JSON | ||
|
|
||
| `JSON` is used for an embedded JSON document. It must annotate a `binary` | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -547,6 +547,58 @@ struct RowGroup { | |
| 4: optional list<SortingColumn> sorting_columns | ||
| } | ||
|
|
||
| /** Identifier for built-in sort order used to produce min and max values. */ | ||
| enum Order { | ||
| /** | ||
| * The signed ordering is the order produced by comparing single primitive | ||
| * values with a signed comparator, or the lexicographic ordering produced by | ||
| * comparing each byte of a binary or fixed using a signed comparator. | ||
| * | ||
| * (A signed comparator uses the most-significant bit as a sign bit; an | ||
| * unsigned comparator uses the most-significant bit as part of the value's | ||
| * magnitude. Note that unsigned comparison is not defined for floating | ||
| * point values.) | ||
| */ | ||
| SIGNED = 0; | ||
|
|
||
| /** | ||
| * The unsigned ordering is produced by comparing single primitive values | ||
| * with an unsigned comparison, or the lexicographic ordering produced by | ||
| * comparing each byte of a binary or fixed using an unsigned comparator. | ||
| * | ||
| * (A signed comparator uses the most-significant bit as a sign bit; an | ||
| * unsigned comparator uses the most-significant bit as part of the value's | ||
| * magnitude. Note that unsigned comparison is not defined for floating | ||
| * point values.) | ||
| */ | ||
| UNSIGNED = 1; | ||
|
|
||
| /** | ||
| * Identifiers for custom orderings, to be defined in the ColumnOrder struct. | ||
| */ | ||
| //CUSTOM = 2; | ||
|
||
| } | ||
|
|
||
| /** Descriptor for the order used for min, max, and sorting values in a column | ||
| */ | ||
| struct ColumnOrder { | ||
| /** The order used for this column */ | ||
|
||
| 1: required Order order; | ||
|
|
||
| /** | ||
| * A string that identifies the order for this column. This field should be | ||
| * set if the order is any value other than SIGNED or UNSIGNED and is used to | ||
| * identify the actual order used for min, max, and soring values. | ||
| * | ||
| * This identifier should follow one of the following formats: | ||
| * * 'icu54:<locale-keyword>' - ICU 54 ordering for the ICU 54 locale keyword | ||
|
||
| * | ||
| * To define order formats other than those listed above, contact the Parquet | ||
| * list. | ||
| */ | ||
| //2: optional string custom_order; | ||
| } | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sorry for getting back at this. (I blame the jet lag for answering too fast on my previous comment) and naturally in the future we would add the following as a second type in the union: |
||
|
|
||
| /** | ||
| * Description for file metadata | ||
| */ | ||
|
|
@@ -576,5 +628,14 @@ struct FileMetaData { | |
| * e.g. impala version 1.0 (build 6cf94d29b2b7115df4de2c06e2ab4326d721eb55) | ||
| **/ | ||
| 6: optional string created_by | ||
|
|
||
| /** | ||
| * Sort order used for each column in this file. | ||
| * | ||
| * If this list is not present, then the order for each column is assumed to | ||
| * be SIGNED. In addition, min and max values for INTERVAL or DECIMAL stored | ||
| * as fixed or bytes should be ignored. | ||
| */ | ||
| 7: optional list<ColumnOrder> column_orders; | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How do you know from 1 ColumnOrder which column it applies to? It's position in this list has to match the position in the
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is one
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another option is to add an optional column_order field in SchemaElement which would set only for primitive types.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My reasoning for not adding this to the schema is that I don't think I'd add this to the string-based schema definition. I want to keep those roughly the same by not adding things to SchemaElement that aren't in the schema strings.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. good point
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What's the plan for backwards/forwards compatibility. It seems like a newer implementation can assume that stats were written in the old way if column_orders is unset (this might be useful still when reading old data with int or floating point stats) or in the new way if column_orders is set. But is there a plan to guarantee that old implementations will ignore the new stats?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, if column_orders is missing that means they are written the old way.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 to making them a different field.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, I think this was clear from previous discussion, but not from the PR. Readers must assume that the order is signed, unless this
Unsigned ints and intervals aren't used in any engine that I know about. Decimals are used, but I doubt anyone is filtering based on the stats because there isn't a way to pass a decimal to the filter code path in Parquet Java. (You'd have to pass an encoded decimal Binary; same problem for intervals.) That means that the only type with incorrect stats that is really a problem is UTF8. The reason why we didn't catch this problem sooner is because characters stored as a single byte without the msb set (including ASCII) have the same sort order using signed and unsigned comparison. This is why Parquet Java has a property to ignore the wrong sort order and use stats anyway. Using a separate field for the new stats would mean that old readers can't use stats for UTF8 fields, even though there are lots of cases where it is fine, and it is not really worse than what they do already -- behavior no one noticed was wrong for a year. It would also require more format changes and could prevent older readers from using stats for signed fields that are correct.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you want to add a note that there is exactly one ColumnOrder per column as defined by the schema (1 column per leaf node in the schema)? |
||
| } | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be helpful for implementers to note that this is equivalent to ordering by unicode code points, since that's a bit subtle. If you feel like that's overly verbose feel free to leave it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this true? I thought the sort order for strings should be unsigned, from now on, but it's valid and possible for old files to have signed ordering. Should we clarify that both orderings are supported but one is preferred?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is stating that all UTF8 from now on should be written with UNSIGNED comparison. It is still the case that files without the SortOrder list used SIGNED.