Skip to content
Closed
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions LogicalTypes.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,8 @@ may require additional metadata fields, as well as rules for those fields.
`UTF8` may only be used to annotate the binary primitive type and indicates
that the byte array should be interpreted as a UTF-8 encoded character string.

The sort order used for `UTF8` strings must be `UNSIGNED` byte-wise comparison.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be helpful for implementers to note that this is equivalent to ordering by unicode code points, since that's a bit subtle. If you feel like that's overly verbose feel free to leave it out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true? I thought the sort order for strings should be unsigned, from now on, but it's valid and possible for old files to have signed ordering. Should we clarify that both orderings are supported but one is preferred?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is stating that all UTF8 from now on should be written with UNSIGNED comparison. It is still the case that files without the SortOrder list used SIGNED.


## Numeric Types

### Signed Integers
Expand All @@ -55,6 +57,8 @@ allows.
implied by the `int32` and `int64` primitive types if no other annotation is
present and should be considered optional.

The sort order used for signed integer types must be `SIGNED`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this means that writers are responsible for setting order to SIGNED when writing a column of this type, right? What is the expected behaviour if a reader encounters a logical type of signed integer but UNSIGNED order? Is this invalid according to the spec, or does the sort order setting override the default sort for the logical type?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My interpretation is we should fail.
This is redundant with logical type but more explicit and deals with backward compatibility.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any implementation should ignore stats that it cannot use. I think Lars mentioned in the recent hangout that there are transforms that can be done in some cases to use stats that are written with some compatible orderings.

A simple example is where a UINT32 field has min=10 and max=1000. Since both are positive and not in the top half of the UINT32 range, we know that the same min/max are produced by both signed and unsigned comparison and the implementation can use it.


### Unsigned Integers

`UINT_8`, `UINT_16`, `UINT_32`, and `UINT_64` annotations can be used to
Expand All @@ -70,6 +74,8 @@ allows.
`UINT_8`, `UINT_16`, and `UINT_32` must annotate an `int32` primitive type and
`UINT_64` must annotate an `int64` primitive type.

The sort order used for unsigned integer types must be `UNSIGNED`.

### DECIMAL

`DECIMAL` annotation represents arbitrary-precision signed decimal numbers of
Expand Down Expand Up @@ -98,6 +104,15 @@ integer. A precision too large for the underlying type (see below) is an error.
A `SchemaElement` with the `DECIMAL` `ConvertedType` must also have both
`scale` and `precision` fields set, even if scale is 0 by default.

The sort order used for `DECIMAL` values must be `SIGNED`. The order is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"is" or "must be" seem superfluous. It seems like a logical conclusion, so I'd keep "is".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

must be equivalent to signed comparison of decimal values.

If the column uses `int32` or `int64` physical types, then signed comparison of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this clarification.

the integer values produces the correct ordering. If the physical type is
fixed, then the correct ordering can be produced by flipping the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed_len_byte_array?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Fixed is used interchangeably with the full name elsewhere as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, I'm not totally familiar with some of these conventions.

most-significant bit in the first byte and then using unsigned byte-wise
comparison.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about decimals encoded into variable-length binary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an optimization, but it is too long to describe here. The main point is that the order must match the signed order produced by comparing the decimal values. This paragraph only gives optimizations for easy types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks for explaining.


## Date/Time Types

### DATE
Expand All @@ -106,30 +121,40 @@ A `SchemaElement` with the `DECIMAL` `ConvertedType` must also have both
annotate an `int32` that stores the number of days from the Unix epoch, 1
January 1970.

The sort order used for `DATE` is `SIGNED`.

### TIME\_MILLIS

`TIME_MILLIS` is used for a logical time type with millisecond precision,
without a date. It must annotate an `int32` that stores the number of
milliseconds after midnight.

The sort order used for `TIME\_MILLIS` is `SIGNED`.

### TIME\_MICROS

`TIME_MICROS` is used for a logical time type with microsecond precision,
without a date. It must annotate an `int64` that stores the number of
microseconds after midnight.

The sort order used for `TIME\_MICROS` is `SIGNED`.

### TIMESTAMP\_MILLIS

`TIMESTAMP_MILLIS` is used for a combined logical date and time type, with
millisecond precision. It must annotate an `int64` that stores the number of
milliseconds from the Unix epoch, 00:00:00.000 on 1 January 1970, UTC.

The sort order used for `TIMESTAMP\_MILLIS` is `SIGNED`.

### TIMESTAMP\_MICROS

`TIMESTAMP_MICROS` is used for a combined logical date and time type with
microsecond precision. It must annotate an `int64` that stores the number of
microseconds from the Unix epoch, 00:00:00.000000 on 1 January 1970, UTC.

The sort order used for `TIMESTAMP\_MICROS` is `SIGNED`.

### INTERVAL

`INTERVAL` is used for an interval of time. It must annotate a
Expand All @@ -144,8 +169,13 @@ example, there is no requirement that a large number of days should be
expressed as a mix of months and days because there is not a constant
conversion from days to months.

The sort order used for `INTERVAL` is `UNSIGNED`, produced by sorting by
the value of months, then days, then milliseconds with unsigned comparison.

## Embedded Types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to clarify what the expected sort order is for embedded types (or if we want to punt on that question, make it explicit that readers should ignore it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the latest commit, I clarified that there isn't a required ordering for embedded types.


Embedded types do not have type-specific orderings.

### JSON

`JSON` is used for an embedded JSON document. It must annotate a `binary`
Expand Down
61 changes: 61 additions & 0 deletions src/main/thrift/parquet.thrift
Original file line number Diff line number Diff line change
Expand Up @@ -547,6 +547,58 @@ struct RowGroup {
4: optional list<SortingColumn> sorting_columns
}

/** Identifier for built-in sort order used to produce min and max values. */
enum Order {
/**
* The signed ordering is the order produced by comparing single primitive
* values with a signed comparator, or the lexicographic ordering produced by
* comparing each byte of a binary or fixed using a signed comparator.
*
* (A signed comparator uses the most-significant bit as a sign bit; an
* unsigned comparator uses the most-significant bit as part of the value's
* magnitude. Note that unsigned comparison is not defined for floating
* point values.)
*/
SIGNED = 0;

/**
* The unsigned ordering is produced by comparing single primitive values
* with an unsigned comparison, or the lexicographic ordering produced by
* comparing each byte of a binary or fixed using an unsigned comparator.
*
* (A signed comparator uses the most-significant bit as a sign bit; an
* unsigned comparator uses the most-significant bit as part of the value's
* magnitude. Note that unsigned comparison is not defined for floating
* point values.)
*/
UNSIGNED = 1;

/**
* Identifiers for custom orderings, to be defined in the ColumnOrder struct.
*/
//CUSTOM = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timestamps are currently stored in INT96 by Impala and Hive and with INT96 depreciation on the horizon, can we enable custom ordering for binary / fixed types in this change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would CUSTOM mean in that context?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impala and Hive store timestamps as a tuple (int64_t, uint32_t), where the first element is the time of the day in nanoseconds, and the second is the day according to the Julian calendar. Ordering these means to sort them by the second element first (the day), then by the first (the time). None of the other orderings seem to reflect this, so it would be good if we could at least set it to CUSTOM to signal that any validation will fail unless the client knows how to interpret them.

Does it make more sense to document something like IMPALA_TIMESTAMP as its own logical type?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely should change the situation as currently the values stored as INT96 aren't really interpretable as integers, better open a separate JIRA to untangle it a bit from the sorting discussion here.

@lekv I also tried to document the current state here: #49 would be nice if you also could have a short look if the formula is correct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xhochy - I left some comments in #49. I agree that it's better to untangle / document Impala timestamps in a separate JIRA. Should we use PARQUET-861 for that, or shall I go ahead and make a new one?

@julienledem - Do you think it's better to defer adding "CUSTOM" orderings until adding the first LogicalType that needs one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the claim that INT96 aren't interpretable as integers. They don't map to a common hardware integer type but they're still logically a twos-complement integer.

I think on the Impala side we may end up implementing min-max stats for INT96 ahead of doing a larger rework of the timestamp data type, so it would be good to have clarity about what other implementations should do when encountered with INT96 min-max stats.

My take is that they should either ignore INT96 stats (if they don't want to implement a deprecated type) or use signed ordering. Parquet-mr doesn't do this currently. Does everyone agree that https://issues.apache.org/jira/browse/PARQUET-840 is a bug in parquet-mr?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm assuming that signed INT96 ordering is equivalent to the logical IMPALA_TIMESTAMP ordering - I think this is true. because the least significant 8 bytes of the INT96 are the unsigned nanoseconds component and the most significant 4 bytes of the INT96 are the signed Julian day.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timarmstrong if that's the case then we can just use SIGNED Ordering for them.
@lekv do you agree?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timarmstrong I assumed we are writing (nanoseconds, day) tuples, and the UNSIGNED order would not be equivalent to the logical ordering. Internally Impala stores them as (time, date) and we don't reverse those. Hive seems to do it the same way as Impala: https://github.com/Parquet/parquet-mr/blob/fa8957d7939b59e8d391fa17000b34e865de015d/parquet-column/src/main/java/parquet/example/data/simple/NanoTime.java#L58

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it works out because INT96 is little-endian - the most significant 4 bytes of the INT96 line up with the 4 bytes of the date

}

/** Descriptor for the order used for min, max, and sorting values in a column
*/
struct ColumnOrder {
/** The order used for this column */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add to the comment that this has to be ascending?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. I think min and max are clear enough.

1: required Order order;

/**
* A string that identifies the order for this column. This field should be
* set if the order is any value other than SIGNED or UNSIGNED and is used to
* identify the actual order used for min, max, and soring values.
*
* This identifier should follow one of the following formats:
* * 'icu54:<locale-keyword>' - ICU 54 ordering for the ICU 54 locale keyword
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about something like:

struct ColumnOrder {
  1: required OrderType orderType
  2: optional string orderSubType
  3: required boolean is_ascending
}
enum OrderType {
  SIGNED, UNSIGNED, ICU_54, OTHER // more can be added as needed
}

and orderSubType isn't set for SIGNED/UNSIGNED ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, if we really want to avoid the confusion of mixing specific orders and classes of orders, we can use a union instead:

struct ColumnOrder {
  1: required Order order
  2: required boolean is_ascending
}

struct Signed {}
struct Unsigned {}
struct ICU_54 {
  1: required string locale_keyword
}
struct Other {
  1: required string name
}
union Order {
  1: Signed signed
  2: Unsigned unsigned
  3: ICU_54 icu_54
  4: Other
  // add more as needed
}

This will generate an Order struct that can only be one of {Signed / Unsigned / ICU_54(locale_keyword) / Other(name) and avoids having to deal with the fact that Signed / Unsigned require no additional info, and it also means we can add extra info to each type of ordering as needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the other symbols. We can revisit this when we know more about the requirements of processing engines.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need to finalize this aspect of the discussion.
the signed/unsigned aspect is what's blocking impala/sparksql right now.
When they move to implement collation it will be easier to converge on the collation string spec.
Let's open a separate JIRA for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, should we remove the commented out fields + their docs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue It sounds like Union? +1 from me as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we've confirmed that the union is forward-compatible. I haven't had time to check, so if either of you wants to that would be helpful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for thrift 0.9.3. I don't have thrift 0.7 working on my laptop right now.
If you want to try with thrift 0.7:
apache/parquet-java#405

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related: we should just update parquet-format to thrift 0.9.3
#50

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue: works for thrift 0.7 per travis-ci: apache/parquet-java#405

*
* To define order formats other than those listed above, contact the Parquet
* list.
*/
//2: optional string custom_order;
}
Copy link
Member

@julienledem julienledem Apr 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for getting back at this. (I blame the jet lag for answering too fast on my previous comment)
Actually I'd suggest the following (to have a more Union like definition and less ENUM as Union):

/** 
 * sorted according to the physical type (considered signed or not)
 * binary is in lexicographic order (assuming each byte is signed or unsigned)
 * numerical types are sorted according to their natural order (signed of unsigned)
 */
struct PhysicalTypeColumnOrder { 
  1: required boolean signed;
}
/** Union containing the order used for min, max, and sorting values in a column */
 union ColumnOrder {
   1: PhysicalTypeColumnOrder physical;
 }

and naturally in the future we would add the following as a second type in the union:

/** the data is sorted according to the provided collation */
struct CollatedColumnOrder {
  1: required string collation_string;
}


/**
* Description for file metadata
*/
Expand Down Expand Up @@ -576,5 +628,14 @@ struct FileMetaData {
* e.g. impala version 1.0 (build 6cf94d29b2b7115df4de2c06e2ab4326d721eb55)
**/
6: optional string created_by

/**
* Sort order used for each column in this file.
*
* If this list is not present, then the order for each column is assumed to
* be SIGNED. In addition, min and max values for INTERVAL or DECIMAL stored
* as fixed or bytes should be ignored.
*/
7: optional list<ColumnOrder> column_orders;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you know from 1 ColumnOrder which column it applies to? It's position in this list has to match the position in the schema list above? If that's the case I think we should add a comment for that here. I guess this has to be here to avoid repeating it in each Statistics object?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one ColumnOrder for each column, in the same order as the column chunks (which is required to match the order in the schema).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is to add an optional column_order field in SchemaElement which would set only for primitive types.
10: optional ColumnOrder column_order;
https://github.com/rdblue/parquet-format/blob/9962df8e0ea85858cf032451d0ee83ec3f4d39fe/src/main/thrift/parquet.thrift#L260

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reasoning for not adding this to the schema is that I don't think I'd add this to the string-based schema definition. I want to keep those roughly the same by not adding things to SchemaElement that aren't in the schema strings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the plan for backwards/forwards compatibility.

It seems like a newer implementation can assume that stats were written in the old way if column_orders is unset (this might be useful still when reading old data with int or floating point stats) or in the new way if column_orders is set. But is there a plan to guarantee that old implementations will ignore the new stats?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if column_orders is missing that means they are written the old way.
In the current PR it looks like old versions of the library won't tell the difference between old and new stats. If we want to make sure they don't see the new stats, we need to put them in a different field in the Stats struct.
Do we want to do that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to making them a different field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I think this was clear from previous discussion, but not from the PR.

Readers must assume that the order is signed, unless this SortOrder is set because that is what the Java library currently writes. The original min and max stats fields are still used because the majority of the values are valid. There are only a few types that are currently stored incorrectly:

  • UTF8 strings
  • Decimals (stored as fixed or bytes)
  • Intervals
  • UINTs

Unsigned ints and intervals aren't used in any engine that I know about. Decimals are used, but I doubt anyone is filtering based on the stats because there isn't a way to pass a decimal to the filter code path in Parquet Java. (You'd have to pass an encoded decimal Binary; same problem for intervals.)

That means that the only type with incorrect stats that is really a problem is UTF8. The reason why we didn't catch this problem sooner is because characters stored as a single byte without the msb set (including ASCII) have the same sort order using signed and unsigned comparison. This is why Parquet Java has a property to ignore the wrong sort order and use stats anyway.

Using a separate field for the new stats would mean that old readers can't use stats for UTF8 fields, even though there are lots of cases where it is fine, and it is not really worse than what they do already -- behavior no one noticed was wrong for a year. It would also require more format changes and could prevent older readers from using stats for signed fields that are correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to add a note that there is exactly one ColumnOrder per column as defined by the schema (1 column per leaf node in the schema)?

}