refactor(rust): Adjust size of dt.* methods to Int64#27411
refactor(rust): Adjust size of dt.* methods to Int64#27411dylanfair wants to merge 9 commits intopola-rs:mainfrom
dt.* methods to Int64#27411Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #27411 +/- ##
==========================================
- Coverage 81.19% 81.19% -0.01%
==========================================
Files 1832 1832
Lines 253247 253245 -2
Branches 3176 3176
==========================================
- Hits 205626 205613 -13
- Misses 46798 46809 +11
Partials 823 823 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
After looking more at the However I'm seeing that without adjusting the dtypes of the df = df.with_columns(
[
(pl.col("timestamp").dt.hour() // 10 * 10).alias("errors"),
]
)Which displays the following: This seems to be the more correct string to start pulling so as to try and keep the |
|
Closing this pre-emptively as I think this is a heavy-handed approach to the problem and could produce backwards compatible issues for those with existing data pipelines (i.e. unexpected memory spike from i8 -> i64 jump). |
Addresses #27162
No AI was used with this PR.
Local tests:

Wasn't sure what Angular convention to use so I landed on "refactor" since this isn't exactly a new feature nor an explicit bug fix, but happy to adjust the title accordingly.
Changes made
This PR adjusts the following
dt.*methods to return an i64 value in order to minimize easy overflows when using them in expressions:dt.hour()dt.minute()dt.second()dt.millisecond()dt.microsecond()dt.nanosecond()dt.year()dt.iso_year()dt.month()dt.day()dt.days_in_month()dt.ordinal_day()dt.week()dt.weekday()dt.millennium()dt.century()As it currently stands the following MRE which showcases
dt.hour()used within an expression will overflow like so:Now we get the following result:
Alternatives
I'll admit a more nuanced approach such as checking for an overflow and upcasting accordingly might be more appropriate. I'm not nearly familiar enough with the codebase yet to know how to implement that, so I'm happy to defer to others if that's the better direction. I'm sure there are performance implications there as well that I'm fully unaware of even if I were to attempt it.
Another alternative might be to instead raise some kind of error if the overflow occurs, and prompt the user to upcast themselves like:
As it currently stands I think it's too easy for someone to not even realize the overflow has occurred if they are running an expression over a dataset with thousands of rows. This is a valid sharp edge someone can land on as mentioned in the issue. The lack of knowledge that the overflow has happened can be fairly problematic as the data at that point is no longer accurate, so any analysis using that data can draw incorrect conclusions.