-
Notifications
You must be signed in to change notification settings - Fork 72
Improve scalar logic for timestamps #1025
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve scalar logic for timestamps #1025
Conversation
Codecov Report
❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more. @@ Coverage Diff @@
## main #1025 +/- ##
==========================================
+ Coverage 81.99% 82.23% +0.23%
==========================================
Files 78 78
Lines 4566 4571 +5
Branches 846 849 +3
==========================================
+ Hits 3744 3759 +15
+ Misses 639 625 -14
- Partials 183 187 +4
... and 1 file with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
tests/integration/test_rex.py
Outdated
| assert_eq(df1, expected_df) | ||
| # TODO: Fix seconds/nanoseconds conversion | ||
| # df2 = c.sql(f"SELECT EXTRACT(DAY FROM CAST({scalar1} AS TIMESTAMP)) AS day") | ||
| # assert_eq(df2, expected_df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really sure what's going on in this case. When we're using a column of scalars, everything behaves as expected. But when we're just using a scalar like this query, the value doesn't appear to go through CastOperation or ExtractOperation, so I'm not really sure where I can catch the integer and convert it to seconds.
|
Doesn't look like the changes from dask/dask#9881 are being pulled yet. Once we upgrade to the newest Dask version this PR should be good to go. |
|
Ready for re-review |
dask_sql/physical/rex/core/call.py
Outdated
| def is_timestamp_nano(obj): | ||
| return "int" in str(type(obj)) or "int" in str(getattr(obj, "dtype", "")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to use something like pd.api.types.is_integer_dtype for this check, or is there a specific case I'm missing?
dask_sql/physical/rex/core/call.py
Outdated
| raise RuntimeError("Integer input does not accept a format argument") | ||
| return dd.to_datetime(df, unit="s") | ||
| if is_cudf_type(df): | ||
| return df |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking into the GPU failures, it seems like in some cases by skipping over the to_datetime call here means we end up passing an integer series to convert_to_datetime and erroring.
Is there a particular case where we wouldn't want to call to_datetime here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the case that fails is
df = pd.DataFrame({"d1": [1203073300], "d2": [1503073700]})
c.create_table("df", df, gpu=gpu)
expected_df = pd.DataFrame({"dt": [3472]})
df1 = c.sql(
"SELECT TIMESTAMPDIFF(DAY, to_timestamp(d1), to_timestamp(d2)) AS dt FROM df"
)
df1.compute()
which returns 0 instead of 3472 in the GPU case when to_datetime is used here. I'm not really sure why this happens, and there wasn't an obvious type check to avoid this. But you're right, it makes more sense to call to_datetime either way, so for now we can just skip that case.
|
Thanks @charlesbluca ! Lmk what you think. |
charlesbluca
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sarahyurick! This LGTM, just one minor comment around opening issues to track the remaining failures
ayushdg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor question but lgtm!
dask_sql/physical/rex/core/call.py
Outdated
| if is_cudf_type(operands[0]) and isinstance(operands[1], np.timedelta64): | ||
| operands = (dd.to_datetime(operands[0], unit="s"), operands[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious if this is still needed. I stepped through the tests and didn't need this change as the operands were already in unit 's from to_timestamp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, thanks! I think now the gpuCI failures are just ML failures unrelated to this PR, as the same failures are appearing in #1197 right now too.
|
Merging in since the failures are unrelated. |
Closes #982
After some investigation, I found inconsistencies in our timestamp logic regarding whether scalars are interpreted as being in seconds versus in nanoseconds. Since Spark and PostgreSQL both interpret scalars as seconds, the changes in this PR reflect that.