-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Add support for Arrow Duration type in Substrait #16503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This commit adds support for Arrow Duration types in Substrait plans. Substrait has no equivalent to the Duration type, which only includes time-based information (i.e. some multiple of nanoseconds). However, the Substrait Interval Day type is very similar, it stores day and time-based information. This commit converts Arrow Duration types into Substrait Interval Day types, with a Duration specific type variation reference, so that it can round trip back to a Duration. An alternative approach would be to use a new Substrait user defined type. Resolves apache#16285
b30e0de to
8047a3e
Compare
|
Thank you @jkosh44 🙏 @gabotechs is there any chance you have time to review this PR? |
|
Sure! I'll get it done in a couple of days. Thanks for submitting this! |
gabotechs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 nice, +1
| pub const DEFAULT_INTERVAL_DAY_TYPE_VARIATION: u32 = 0; | ||
| pub const DURATION_INTERVAL_DAY_TYPE_VARIATION: u32 = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice idea using the type variation capabilities, this looks good to me 👍.
Nit 1: maybe prefixing this constants with _REF as all the others?
Nit 2: maybe adding a comment explaining that the default variation maps to DataType::Interval(DayTime) and the other maps to DataType::Duration?
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review @gabotechs -- I'll wait to see if @jkosh44 would like to make the suggested changes.
|
Just made the suggested updates. I don't 100% know how the substrait plans are used, but I am slightly worried about one thing with this approach. Would it be possible to construct a substrait plan that converts the Duration to an IntervalDay, execute some operation that adds some days to the interval, then attempts and fails to convert it back to a duration? |
I think you can always go back from Duration --> IntervalDay But in any case, I don't think this would be a substrait specific difference -- it would be in the DataFusion engine itself |
|
Thanks again @jkosh44 and @gabotechs |
|
@gabotechs This response from Substrait makes me a little nervous about this approach: substrait-io/substrait#822 (comment) Duration doesn't really map 1:1 with IntervalDay. Duration has a single logical component for time, but IntervalDay has two logical components, time and days. |
🤔 it still does not sound too bad to me. If you want to play it safe, one thing we can do is to take other types that are failing, like Float16, attempt to ship a solution based on Substrait UDTs, and if it turns out to be clean then port the same approach it to Duration/IntervalDay |
Which issue does this PR close?
Rationale for this change
Allows Datafusion plans with the Duration type to be converted to Substrait plans.
What changes are included in this PR?
This commit adds support for Arrow Duration types in Substrait plans. Substrait has no equivalent to the Duration type, which only includes time-based information (i.e. some multiple of nanoseconds). However, the Substrait Interval Day type is very similar, it stores day and time-based information. This commit converts Arrow Duration types into Substrait Interval Day types, with a Duration specific type variation reference, so that it can round trip back to a Duration.
An alternative approach would be to use a new Substrait user defined type.
Are these changes tested?
Yes
Are there any user-facing changes?
Substrait now support Duration types