Improve scalar logic for timestamps #1025

sarahyurick · 2023-02-01T18:53:09Z

Closes #982

After some investigation, I found inconsistencies in our timestamp logic regarding whether scalars are interpreted as being in seconds versus in nanoseconds. Since Spark and PostgreSQL both interpret scalars as seconds, the changes in this PR reflect that.

codecov-commenter · 2023-02-01T21:46:03Z

Codecov Report

Merging #1025 (b7bf2fc) into main (ca8f963) will increase coverage by 0.23%.
The diff coverage is 85.71%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@            Coverage Diff             @@
##             main    #1025      +/-   ##
==========================================
+ Coverage   81.99%   82.23%   +0.23%     
==========================================
  Files          78       78              
  Lines        4566     4571       +5     
  Branches      846      849       +3     
==========================================
+ Hits         3744     3759      +15     
+ Misses        639      625      -14     
- Partials      183      187       +4

Impacted Files	Coverage Δ
dask_sql/physical/rex/core/call.py	`90.32% <85.71%> (+1.07%)`	⬆️

... and 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

tests/integration/test_rex.py

sarahyurick · 2023-02-02T00:10:52Z

tests/integration/test_rex.py

+    assert_eq(df1, expected_df)
+    # TODO: Fix seconds/nanoseconds conversion
+    # df2 = c.sql(f"SELECT EXTRACT(DAY FROM CAST({scalar1} AS TIMESTAMP)) AS day")
+    # assert_eq(df2, expected_df)


Not really sure what's going on in this case. When we're using a column of scalars, everything behaves as expected. But when we're just using a scalar like this query, the value doesn't appear to go through CastOperation or ExtractOperation, so I'm not really sure where I can catch the integer and convert it to seconds.

sarahyurick · 2023-06-06T23:29:44Z

Doesn't look like the changes from dask/dask#9881 are being pulled yet. Once we upgrade to the newest Dask version this PR should be good to go.

cc @charlesbluca

sarahyurick · 2023-06-08T22:23:21Z

Ready for re-review

charlesbluca · 2023-06-22T21:32:28Z

dask_sql/physical/rex/core/call.py

+def is_timestamp_nano(obj):
+    return "int" in str(type(obj)) or "int" in str(getattr(obj, "dtype", ""))


Would it make sense to use something like pd.api.types.is_integer_dtype for this check, or is there a specific case I'm missing?

charlesbluca · 2023-06-22T23:14:09Z

dask_sql/physical/rex/core/call.py

                raise RuntimeError("Integer input does not accept a format argument")
-            return dd.to_datetime(df, unit="s")
+            if is_cudf_type(df):
+                return df


Looking into the GPU failures, it seems like in some cases by skipping over the to_datetime call here means we end up passing an integer series to convert_to_datetime and erroring.

Is there a particular case where we wouldn't want to call to_datetime here?

Yes, the case that fails is

df = pd.DataFrame({"d1": [1203073300], "d2": [1503073700]}) c.create_table("df", df, gpu=gpu) expected_df = pd.DataFrame({"dt": [3472]}) df1 = c.sql( "SELECT TIMESTAMPDIFF(DAY, to_timestamp(d1), to_timestamp(d2)) AS dt FROM df" ) df1.compute()

which returns 0 instead of 3472 in the GPU case when to_datetime is used here. I'm not really sure why this happens, and there wasn't an obvious type check to avoid this. But you're right, it makes more sense to call to_datetime either way, so for now we can just skip that case.

dask_sql/physical/rex/core/call.py

sarahyurick · 2023-06-23T20:28:08Z

Thanks @charlesbluca ! Lmk what you think.

dask_sql/physical/rex/core/call.py

charlesbluca

Thanks @sarahyurick! This LGTM, just one minor comment around opening issues to track the remaining failures

tests/integration/test_rex.py

ayushdg

Minor question but lgtm!

ayushdg · 2023-06-30T22:59:31Z

dask_sql/physical/rex/core/call.py

+            if is_cudf_type(operands[0]) and isinstance(operands[1], np.timedelta64):
+                operands = (dd.to_datetime(operands[0], unit="s"), operands[1])


Curious if this is still needed. I stepped through the tests and didn't need this change as the operands were already in unit 's from to_timestamp

Good point, thanks! I think now the gpuCI failures are just ML failures unrelated to this PR, as the same failures are appearing in #1197 right now too.

ayushdg · 2023-07-06T20:43:54Z

Merging in since the failures are unrelated.

sarahyurick added 2 commits February 1, 2023 10:49

default to seconds

788c8e5

save progress

7fe3eb7

fix most tests

df9e8a1

sarahyurick commented Feb 1, 2023

View reviewed changes

tests/integration/test_rex.py Show resolved Hide resolved

fix gpu test

30abf05

sarahyurick commented Feb 2, 2023

View reviewed changes

check_dtype False

ca9702a

sarahyurick marked this pull request as ready for review February 2, 2023 01:06

sarahyurick requested review from ayushdg, charlesbluca and galipremsagar as code owners February 2, 2023 01:06

Merge branch 'main' into timestamp_improvements

b67371d

sarahyurick changed the title ~~Improve scalar logic for timestamps~~ [BLOCKED] Improve scalar logic for timestamps Feb 10, 2023

sarahyurick changed the title ~~[BLOCKED] Improve scalar logic for timestamps~~ Improve scalar logic for timestamps May 26, 2023

sarahyurick and others added 2 commits June 6, 2023 14:48

Merge branch 'main' into timestamp_improvements

8419dcd

push new updates

fdaad32

sarahyurick requested a review from jdye64 as a code owner June 6, 2023 21:55

sarahyurick added 3 commits June 6, 2023 15:09

check_dtype=False

41edc0b

timedelta check

493795f

check is_cudf_type

bf713e8

sarahyurick mentioned this pull request Jun 7, 2023

Update Dask version in CI #1174

Closed

sarahyurick and others added 4 commits June 8, 2023 09:09

Merge branch 'main' into timestamp_improvements

4f4be76

add DASK_CUDF_TODATETIME_SUPPORT

4627db6

style

f8e65c6

Merge branch 'main' into timestamp_improvements

961223c

charlesbluca reviewed Jun 22, 2023

View reviewed changes

Merge branch 'main' into timestamp_improvements

758cb3f

charlesbluca reviewed Jun 22, 2023

View reviewed changes

dask_sql/physical/rex/core/call.py Outdated Show resolved Hide resolved

sarahyurick and others added 2 commits June 23, 2023 13:22

edit test

920e75a

Merge branch 'main' into timestamp_improvements

41df31d

remove gpu xfail

0f125f6

charlesbluca reviewed Jun 27, 2023

View reviewed changes

dask_sql/physical/rex/core/call.py Outdated Show resolved Hide resolved

charlesbluca reviewed Jun 27, 2023

View reviewed changes

dask_sql/physical/rex/core/call.py Outdated Show resolved Hide resolved

sarahyurick added 2 commits June 27, 2023 10:48

remove is_timestamp_nano

f04695c

style

ad26b7e

charlesbluca approved these changes Jun 27, 2023

View reviewed changes

tests/integration/test_rex.py Show resolved Hide resolved

sarahyurick mentioned this pull request Jun 27, 2023

[BUG] Resolve minor scalar timestamp bugs #1187

Open

3 tasks

Merge branch 'main' into timestamp_improvements

4f69eea

ayushdg approved these changes Jun 30, 2023

View reviewed changes

sarahyurick and others added 2 commits July 5, 2023 12:14

remove ReduceOp check

1d89cc7

Merge branch 'main' into timestamp_improvements

b7bf2fc

ayushdg merged commit eab95ce into dask-contrib:main Jul 6, 2023

sarahyurick deleted the timestamp_improvements branch February 28, 2025 18:55

		def is_timestamp_nano(obj):
		return "int" in str(type(obj)) or "int" in str(getattr(obj, "dtype", ""))

		if is_cudf_type(operands[0]) and isinstance(operands[1], np.timedelta64):
		operands = (dd.to_datetime(operands[0], unit="s"), operands[1])

Improve scalar logic for timestamps #1025

Improve scalar logic for timestamps #1025

Uh oh!

Conversation

sarahyurick commented Feb 1, 2023

Uh oh!

codecov-commenter commented Feb 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

sarahyurick Feb 2, 2023

Choose a reason for hiding this comment

Uh oh!

sarahyurick commented Jun 6, 2023

Uh oh!

sarahyurick commented Jun 8, 2023

Uh oh!

charlesbluca Jun 22, 2023

Choose a reason for hiding this comment

Uh oh!

charlesbluca Jun 22, 2023

Choose a reason for hiding this comment

Uh oh!

sarahyurick Jun 23, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sarahyurick commented Jun 23, 2023

Uh oh!

Uh oh!

Uh oh!

charlesbluca left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ayushdg left a comment

Choose a reason for hiding this comment

Uh oh!

ayushdg Jun 30, 2023

Choose a reason for hiding this comment

Uh oh!

sarahyurick Jul 5, 2023

Choose a reason for hiding this comment

Uh oh!

ayushdg commented Jul 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Feb 1, 2023 •

edited

Loading