Skip to content

Conversation

@charlesbluca
Copy link
Collaborator

@charlesbluca charlesbluca commented Dec 9, 2021

This is a continuation of #298 that also aims to implement CAST(... AS DATE) properly, as up until now that would work identically to CAST(... AS TIMESTAMP):

  • Casts DATE / TIMESTAMP scalars to np.datetime64 instead of datetime.datetime for compatibility with Pandas; for whatever reason datetime.datetimes work in cuDF, which should probably be documented somewhere
  • Casts DATE scalars to np.dtype("datetime64[D]") and TIMESTAMP scalars to np.dtype("datetime64[ns]"), so that DATE filters work as expected
  • Adds special support for CAST (... AS DATE) using the datetime column's dt.date attribute; this currently doesn't work in dask-cuDF as it is dependent on the datetime accessor of cuDF, which doesn't have a date attribute
    • In general, this is a hacky workaround for truncating timestamps that is required because both Pandas and cuDF don't allow truncation through astype("datetime64[D]"); in Pandas this only applies for timezoned datetimes, and in cuDF this applies for all datetimes

Closes #296
Closes #298

@charlesbluca charlesbluca added the blocked Blocked by work in another pull request label Dec 9, 2021
@beckernick
Copy link

@charlesbluca is rapidsai/cudf#7880 in the critical path for this PR, or are there relatively straightforward workarounds?

@charlesbluca
Copy link
Collaborator Author

Need to look through the code again to make sure this is accurate, but IIRC another path forward here would be to add support for casting pandas timezoned datetime / cudf datetime columns to datetime64[D] - if that were the case, we would be able to truncate the timestamps to dates with a single astype call and avoid the use of the DatetimeAccessor.

@codecov-commenter
Copy link

codecov-commenter commented Jan 27, 2022

Codecov Report

Merging #343 (e617a5c) into main (34fee74) will increase coverage by 0.16%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #343      +/-   ##
==========================================
+ Coverage   88.92%   89.09%   +0.16%     
==========================================
  Files          68       68              
  Lines        3333     3337       +4     
  Branches      651      653       +2     
==========================================
+ Hits         2964     2973       +9     
+ Misses        298      289       -9     
- Partials       71       75       +4     
Impacted Files Coverage Δ
dask_sql/mappings.py 100.00% <100.00%> (ø)
dask_sql/physical/rex/core/call.py 99.12% <100.00%> (+<0.01%) ⬆️
dask_sql/_version.py 34.00% <0.00%> (+1.44%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 34fee74...e617a5c. Read the comment docs.

@charlesbluca
Copy link
Collaborator Author

@ChrisJar I've checked and this should resolve #296 - if you get a chance, could you verify?

@ChrisJar
Copy link
Collaborator

@ChrisJar I've checked and this should resolve #296 - if you get a chance, could you verify?

I can confirm that this fixes all of the issues I was seeing. Thanks for working on this!

@charlesbluca charlesbluca removed the blocked Blocked by work in another pull request label Feb 14, 2022
@charlesbluca charlesbluca merged commit f601325 into dask-contrib:main Feb 14, 2022
@charlesbluca charlesbluca deleted the pandas-datetime-cast branch July 19, 2022 10:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Selecting based on dates unexpectedly returns no results.

4 participants