Skip to content

Conversation

@sarahyurick
Copy link
Collaborator

Closes #982

After some investigation, I found inconsistencies in our timestamp logic regarding whether scalars are interpreted as being in seconds versus in nanoseconds. Since Spark and PostgreSQL both interpret scalars as seconds, the changes in this PR reflect that.

@codecov-commenter
Copy link

codecov-commenter commented Feb 1, 2023

Codecov Report

Merging #1025 (b7bf2fc) into main (ca8f963) will increase coverage by 0.23%.
The diff coverage is 85.71%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@            Coverage Diff             @@
##             main    #1025      +/-   ##
==========================================
+ Coverage   81.99%   82.23%   +0.23%     
==========================================
  Files          78       78              
  Lines        4566     4571       +5     
  Branches      846      849       +3     
==========================================
+ Hits         3744     3759      +15     
+ Misses        639      625      -14     
- Partials      183      187       +4     
Impacted Files Coverage Δ
dask_sql/physical/rex/core/call.py 90.32% <85.71%> (+1.07%) ⬆️

... and 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

assert_eq(df1, expected_df)
# TODO: Fix seconds/nanoseconds conversion
# df2 = c.sql(f"SELECT EXTRACT(DAY FROM CAST({scalar1} AS TIMESTAMP)) AS day")
# assert_eq(df2, expected_df)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really sure what's going on in this case. When we're using a column of scalars, everything behaves as expected. But when we're just using a scalar like this query, the value doesn't appear to go through CastOperation or ExtractOperation, so I'm not really sure where I can catch the integer and convert it to seconds.

@sarahyurick sarahyurick marked this pull request as ready for review February 2, 2023 01:06
@sarahyurick sarahyurick changed the title Improve scalar logic for timestamps [BLOCKED] Improve scalar logic for timestamps Feb 10, 2023
@sarahyurick sarahyurick changed the title [BLOCKED] Improve scalar logic for timestamps Improve scalar logic for timestamps May 26, 2023
@sarahyurick sarahyurick requested a review from jdye64 as a code owner June 6, 2023 21:55
@sarahyurick
Copy link
Collaborator Author

Doesn't look like the changes from dask/dask#9881 are being pulled yet. Once we upgrade to the newest Dask version this PR should be good to go.

cc @charlesbluca

@sarahyurick
Copy link
Collaborator Author

Ready for re-review

Comment on lines 54 to 55
def is_timestamp_nano(obj):
return "int" in str(type(obj)) or "int" in str(getattr(obj, "dtype", ""))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to use something like pd.api.types.is_integer_dtype for this check, or is there a specific case I'm missing?

raise RuntimeError("Integer input does not accept a format argument")
return dd.to_datetime(df, unit="s")
if is_cudf_type(df):
return df
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking into the GPU failures, it seems like in some cases by skipping over the to_datetime call here means we end up passing an integer series to convert_to_datetime and erroring.

Is there a particular case where we wouldn't want to call to_datetime here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the case that fails is

df = pd.DataFrame({"d1": [1203073300], "d2": [1503073700]})
c.create_table("df", df, gpu=gpu)
expected_df = pd.DataFrame({"dt": [3472]})
df1 = c.sql(
    "SELECT TIMESTAMPDIFF(DAY, to_timestamp(d1), to_timestamp(d2)) AS dt FROM df"
)
df1.compute()

which returns 0 instead of 3472 in the GPU case when to_datetime is used here. I'm not really sure why this happens, and there wasn't an obvious type check to avoid this. But you're right, it makes more sense to call to_datetime either way, so for now we can just skip that case.

@sarahyurick
Copy link
Collaborator Author

Thanks @charlesbluca ! Lmk what you think.

Copy link
Collaborator

@charlesbluca charlesbluca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sarahyurick! This LGTM, just one minor comment around opening issues to track the remaining failures

Copy link
Collaborator

@ayushdg ayushdg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor question but lgtm!

Comment on lines 141 to 142
if is_cudf_type(operands[0]) and isinstance(operands[1], np.timedelta64):
operands = (dd.to_datetime(operands[0], unit="s"), operands[1])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious if this is still needed. I stepped through the tests and didn't need this change as the operands were already in unit 's from to_timestamp

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, thanks! I think now the gpuCI failures are just ML failures unrelated to this PR, as the same failures are appearing in #1197 right now too.

@ayushdg
Copy link
Collaborator

ayushdg commented Jul 6, 2023

Merging in since the failures are unrelated.

@ayushdg ayushdg merged commit eab95ce into dask-contrib:main Jul 6, 2023
@sarahyurick sarahyurick deleted the timestamp_improvements branch February 28, 2025 18:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Timestamp improvements

4 participants