Skip to content
Draft
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions docs/source/format/CanonicalExtensions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -483,6 +483,28 @@ binary values look like.

.. _variant_primitive_type_mapping:

Timestamp With Offset
=============
This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes.

* Extension name: ``arrow.timestamp_with_offset``.

* The storage type of the extension is a ``Struct`` with 2 fields, in order:

* ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why explicitly saying that it should be non-nullable?

Is that because the nullability can (should) be defined at the struct level, and you want to avoid having an inconsistency between the "timestamp" and "offset_minutes" fields? (e.g. the case where only the "offset_minutes" field would be null for a given row, what does that mean?)

I am only not sure how practical this limitation is in practice. For example when creating a struct from its individual fields, typically the fields itself will contain nulls. Alternatively we could also specify that if one is null, the other should be null as well?

Copy link
Contributor

@felipecrv felipecrv Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting points! We should expand the spec text here and clarify expectations.

Since I can see many operations on this array not caring about the two fields, having a validity buffer on the timestamp field could be a simplification in these cases. It would reduce the risk of computation being performed on garbage values if the struct's validity bitmap is being ignored.

But a top-level validity buffer is necessary to keep generic code going through columns processing nulls correctly.

One way we can adapt to this reality is to make a recommendation against validity on the timestamp field and a warning that even when the offset field is not touched, the validity bitmap of the computation's result should come from the struct validity, or, if both have validity buffers, the & of the two bitmaps.

For the offset column we can recommend the absence of validity bitmap as well (non-nullable) but if a value is null, process it as if it were zero.

Copy link
Author

@serramatutu serramatutu Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively we could also specify that if one is null, the other should be null as well?

Yea, that's more or less what I was thinking about. In principle this type only has meaning if both fields are set. To relax these constraints we'd need to come up with a meaning for what a null timestamp and non-null offset would mean and vice versa.

Could be:

  • If timestamp is set and offset is null, assume offset=0, i.e timestamp is UTC
  • If timestamp is null and offset is set, assume the whole value is null (a standalone offset floating around has no meaning)

Or, alternatively:

  • If any of the fields is null, assume the whole value is null as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I prefer the current wording (require nullability to be handled at the struct level) instead of trying to assign semantics to the other combinations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I am not arguing for assigning a specific meaning to a certain combination of nullability, but just for allowing the fields to be null as well.

For example, we could say that if the element is null (top-level struct validity), the individual fields are allowed to contain a null as well.

Of course, when constructing a timestamp with offset from the individual fields, it is relatively straightforward to just drop the validity bitmaps of the individual fields, and ensure a union of both bitmaps is assigned to the struct.
(it is just that the current pyarrow APIs don't make this particularly easy .. but that is something we can also improve in the exposed APIs)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche now I understand what you meant.

One complexity here is that some compute kernels might want to look at just the UTC timestamp field because they only care about the instant, so we should at least warn/recommend what to do when two bitmaps exist. If we make the spec require that the top-level bitmap can be more selective than the inner bitmap, kernels looking at just the timestamp would be allowed to grab the top-level bitmap and apply it to the processing and the output.

@lidavidm I think top-level bitmaps is the best, but inevitably someone will have to make a decision on what to do when more than one bitmap exists and the spec having recommendations could prevent divergence between implementations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing we could do with nullable offset_minutes field is to have nulls indicate timezone-naive timestamps. As per discussion on the Arrow call - cases with mixed timezone-naive and timezone-aware columns are probably not common, so I'm only bringing this up here for completeness.


* ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) representing the offset in minutes from the UTC timezone. Negative offsets represent time zones west of UTC, while positive offsets represent east. Offsets range from -779 (-12:59) to +780 (+13:00).
Copy link
Member

@rok rok Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe (current) timezones in the wild cover a range of -12:00 to + 14:00.

We could specify offsets should preferably be multiples of 15 minutes as suggested here:

By convention, every inhabited place in the world has a UTC offset that is a multiple of 15 minutes but the majority of offsets are stated in whole hours. There are many cases where the national standard time uses a UTC offset that is not defined solely by longitude.

Alternatively - if we wanted to represent old sun time offsets - we'd have to consider fractions of seconds.

Copy link
Author

@serramatutu serramatutu Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! We will send a [DISCUSS] in the mailing list to discuss this shortly (next few days, still drafting it). Let's discuss it there! 😄

Copy link
Author

@serramatutu serramatutu Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But...

the main reason behind this proposal is compatibility with ANSI SQL TIMESTAMP WITH TIME ZONE, which is supported by multiple database systems (Snowflake, MS SQL Server, Oracle, Trino).

This is the reasoning behind why we're proposing an offset in minutes as signed 16-bit int:

In ANSI SQL, the time zone information is defined in terms of an "INTERVAL" offset ranging from "INTERVAL - '12:59' HOUR TO MINUTE" to "INTERVAL + '13:00' HOUR TO MINUTE". Since "MINUTE" is the smallest granularity with which you can represent a time zone offset, and the maximum minutes in the offset is 13*60=780, we believe it makes sense for the offset to be stored as a 16-bit integer in minutes.

It is important to point out that some systems such as MS SQL Server do implement data types that can represent offsets with sub-minute granularity. We believe representing sub-minute granularity is out of scope for this proposal given that no current or past time zone standards have ever specified sub-minute offsets [9], and that is what we're trying to solve for. Furthermore, representing the offset in seconds rather than minutes would mean the maximum offset is 136060=46800, which is greater than the maximum positive integer an int16 can represent (32768), and thus the offset type would need to be wider (int32).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rok minutes is coarse enough to fit in 16 bits. 15-min blocks would give us the ability of using just 8 bits, but I'm not so comfortable with the promise of the 15-minute convention holding forever everywhere in the planet.

And it would create awkwardness when parsing inputs that contain non-15-minute-multiple offsets as @serramatutu pointed above.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh and @rok you're right in saying timezones can go up until +14:00 in the wild, even if that's not standard. Politics is weird... We should maybe take these hard limits off of the format spec.

Anyways, I digress. Let's discuss these things in the mailing list. Would love if you chimed in too @rok !

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@serramatutu aligning with ANSI SQL seems like a good idea (and doesn't create a new convention), perhaps we could state this in the docs?

Out of curiosity - would the proposed memory layout of match any existing system?

Hey @felipecrv ! I was thinking about int8 for 15 min offset blocks as well, but I'm not sure it's worth it. Politically I would not expect new sub-60 minutes offsets. But ANSI SQL does seem safer.

Copy link
Author

@serramatutu serramatutu Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rok we just sent this to the mailing list yesterday. The discussion thread has a more extensive argumentation around why we chose these constraints.

Out of curiosity - would the proposed memory layout of match any existing system?

The systems we're referencing are Snowflake, MS SQL Server, Oracle DB and Trino, of which only one of them (Trino) is open source. It's hard to know for a fact what is the internal memory layout of proprietary systems... We do know Oracle and Trino store IANA timezones instead of offsets, so the layout doesn't match there and some Arrow conversion layer would need to resolve the timezone names to offsets.

This (resolving offsets on the server) is an explicit choice so that consumer systems don't need to mess with the IANA database or reasoning about daylight savings etc. Arrow consumers just get the offset, add it to the timestamp and voila you have the original timestamp in the original timezone.


* Extension type parameters:

* ``time_unit``: the time-unit of each of the stored UTC timestamps.

* Description of the serialization:

Extension metadata is an empty string.

When de/serializing to/from JSON, this type must be represented as an RFC3339 string, respecting the ``TimeUnit`` precision and time zone offset without loss of information. For example ``2025-01-01T00:00:00Z`` represents January 1st 2025 in UTC with second precision, and ``2025-01-01T00:00:00.000000001-07:00`` represents one nanosecond after January 1st 2025 in UTC-07.

Primitive Type Mappings
-----------------------

Expand Down