Skip to content

refactor(rust): Adjust size of dt.* methods to Int64#27411

Closed
dylanfair wants to merge 9 commits intopola-rs:mainfrom
dylanfair:datetime-i64
Closed

refactor(rust): Adjust size of dt.* methods to Int64#27411
dylanfair wants to merge 9 commits intopola-rs:mainfrom
dylanfair:datetime-i64

Conversation

@dylanfair
Copy link
Copy Markdown

Addresses #27162

No AI was used with this PR.

Local tests:
image

Wasn't sure what Angular convention to use so I landed on "refactor" since this isn't exactly a new feature nor an explicit bug fix, but happy to adjust the title accordingly.

Changes made

This PR adjusts the following dt.* methods to return an i64 value in order to minimize easy overflows when using them in expressions:

  • dt.hour()
  • dt.minute()
  • dt.second()
  • dt.millisecond()
  • dt.microsecond()
  • dt.nanosecond()
  • dt.year()
  • dt.iso_year()
  • dt.month()
  • dt.day()
  • dt.days_in_month()
  • dt.ordinal_day()
  • dt.week()
  • dt.weekday()
  • dt.millennium()
  • dt.century()

As it currently stands the following MRE which showcases dt.hour() used within an expression will overflow like so:

import datetime
import polars as pl

df = pl.DataFrame({"timestamp": [datetime.datetime(2026, 1, 1, 4, 35)]})
df = df.with_columns(
    [
        (pl.col("timestamp").dt.hour().mul(60) + pl.col("timestamp").dt.hour()).alias(
            "calculate"
        ),
    ]
)

print(df)
shape: (1, 2)
┌─────────────────────┬───────────┐
│ timestamp           ┆ calculate │
│ ---                 ┆ ---       │
│ datetime[μs]        ┆ i8        │
╞═════════════════════╪═══════════╡
│ 2026-01-01 04:35:00 ┆ -12       │
└─────────────────────┴───────────┘

Now we get the following result:

shape: (1, 2)
┌─────────────────────┬───────────┐
│ timestamp           ┆ calculate │
│ ---                 ┆ ---       │
│ datetime[μs]        ┆ i64       │
╞═════════════════════╪═══════════╡
│ 2026-01-01 04:35:00 ┆ 244       │
└─────────────────────┴───────────┘

Alternatives

I'll admit a more nuanced approach such as checking for an overflow and upcasting accordingly might be more appropriate. I'm not nearly familiar enough with the codebase yet to know how to implement that, so I'm happy to defer to others if that's the better direction. I'm sure there are performance implications there as well that I'm fully unaware of even if I were to attempt it.

Another alternative might be to instead raise some kind of error if the overflow occurs, and prompt the user to upcast themselves like:

df = df.with_columns(
    [
        (
            pl.col("timestamp").dt.hour().cast(pl.Int64).mul(60)
            + pl.col("timestamp").dt.hour().cast(pl.Int64)
        )
        .cast(pl.Int64)
        .alias("calculate"),
    ]
)
print(df)

As it currently stands I think it's too easy for someone to not even realize the overflow has occurred if they are running an expression over a dataset with thousands of rows. This is a valid sharp edge someone can land on as mentioned in the issue. The lack of knowledge that the overflow has happened can be fairly problematic as the data at that point is no longer accurate, so any analysis using that data can draw incorrect conclusions.

@github-actions github-actions Bot added internal An internal refactor or improvement rust Related to Rust Polars first-contribution First contribution by user labels Apr 25, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 26, 2026

Codecov Report

❌ Patch coverage is 96.77419% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.19%. Comparing base (693c520) to head (a50afaf).

Files with missing lines Patch % Lines
crates/polars-time/src/chunkedarray/kernels.rs 50.00% 4 Missing ⚠️
crates/polars-time/src/series/mod.rs 95.23% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #27411      +/-   ##
==========================================
- Coverage   81.19%   81.19%   -0.01%     
==========================================
  Files        1832     1832              
  Lines      253247   253245       -2     
  Branches     3176     3176              
==========================================
- Hits       205626   205613      -13     
- Misses      46798    46809      +11     
  Partials      823      823              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dylanfair
Copy link
Copy Markdown
Author

After looking more at the get_field() method in crates/polars-plan/src/plans/aexpr/function_expr/datetime.rs, I'm thinking it might not be necessary to alter the dtypes of the dt.* methods themselves. Just making changes there and nowhere else seems to cast expressions that use the dt.* methods into i64, leaving something like this with our original example:

shape: (1, 3)
┌─────────────────────┬──────┬───────────┐
│ timestamp           ┆ hour ┆ calculate │
│ ---                 ┆ ---  ┆ ---       │
│ datetime[μs]        ┆ i8   ┆ i64       │
╞═════════════════════╪══════╪═══════════╡
│ 2026-01-01 04:35:00 ┆ 4    ┆ 244       │
└─────────────────────┴──────┴───────────┘

However I'm seeing that without adjusting the dtypes of the dt.* methods to match (like they are in this PR), something like this then starts to error:

df = df.with_columns(
    [
        (pl.col("timestamp").dt.hour() // 10 * 10).alias("errors"),
    ]
)

Which displays the following:

The application panicked (crashed).
Message:  implementation error, cannot get ref Int8 from Int64
Location: crates/polars-core/src/series/mod.rs:1125

Backtrace omitted.

Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
Traceback (most recent call last):
  File "/home/dylanfair/code_projects/polars/sandbox.py", line 16, in <module>
    df = df.with_columns(
         ^^^^^^^^^^^^^^^^
  File "/home/dylanfair/code_projects/polars/py-polars/src/polars/dataframe/frame.py", line 10515, in with_columns
    .collect(optimizations=QueryOptFlags._eager())
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dylanfair/code_projects/polars/py-polars/src/polars/_utils/deprecation.py", line 97, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dylanfair/code_projects/polars/py-polars/src/polars/lazyframe/opt_flags.py", line 343, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dylanfair/code_projects/polars/py-polars/src/polars/lazyframe/frame.py", line 2510, in collect
    return wrap_df(ldf.collect(engine, callback))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: implementation error, cannot get ref Int8 from Int64

This seems to be the more correct string to start pulling so as to try and keep the dt.* methods with their current dtypes while relaxing the dtype that comes out of their use in expressions - I might start looking into that instead.

@dylanfair
Copy link
Copy Markdown
Author

Closing this pre-emptively as I think this is a heavy-handed approach to the problem and could produce backwards compatible issues for those with existing data pipelines (i.e. unexpected memory spike from i8 -> i64 jump).

@dylanfair dylanfair closed this Apr 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

first-contribution First contribution by user internal An internal refactor or improvement rust Related to Rust Polars

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant