Skip to content

Conversation

@rajagurunath
Copy link
Collaborator

Initial attempt to solve: #296

Explicitly handled the type conversion for the date column and a literal value. And wondering if there is any better way to handle this datetime casting. Happy to know everyone's thoughts on this!

@codecov-commenter
Copy link

codecov-commenter commented Nov 4, 2021

Codecov Report

Merging #298 (61cbca9) into main (5b8f8a9) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #298   +/-   ##
=======================================
  Coverage   95.89%   95.89%           
=======================================
  Files          65       65           
  Lines        2800     2804    +4     
  Branches      418      420    +2     
=======================================
+ Hits         2685     2689    +4     
  Misses         73       73           
  Partials       42       42           
Impacted Files Coverage Δ
dask_sql/mappings.py 100.00% <100.00%> (ø)
dask_sql/physical/rex/core/call.py 99.11% <100.00%> (+<0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5b8f8a9...61cbca9. Read the comment docs.

@quasiben
Copy link
Contributor

quasiben commented Nov 5, 2021

This looks like a reasonable solution. Any ideas why cuDF works while Pandas fails ?

@rajagurunath
Copy link
Collaborator Author

Hi @quasiben,

Thanks for reviewing the PR and putting me in the right direction (I missed the part about why it works for CUDF)

I tried digging deep into this and found the following, Both pandas and cudf seem to work for the local timezone. but pandas doesn't seem to work with UTC timezone.

while executing the query, we are converting the literal timestamp value to the datetime value using the following function.

tz = literal_value.getTimeZone().getID()
assert str(tz) == "UTC", "The code can currently only handle UTC timezones"
dt = datetime.fromtimestamp(
int(literal_value.getTimeInMillis()) / 1000, timezone.utc
)

example using the above function:
image

Since we are only accepting UTC timezone currently, we may need to do the casting explicitly, what do you think?

And One more point in the favour of handling date casting explicitly is the current version of code includes hh:MM: SS even after casting the datetime column into a date column format (for both cudf and pandas version), i.e following test case seems to fail in the current version of the dask-sql (without date casting).

def test_date_casting(c, datetime_table):
# check date casting
query = "SELECT cast(timezone as date) as date1,cast(utc_timezone as date) as date2 FROM datetime_table "
result_df = c.sql(query).compute().astype(str)
expected_df = pd.DataFrame(
{"date1": ["2014-08-01"] * 3, "date2": ["2014-08-01"] * 3}
)
assert_frame_equal(result_df, expected_df, check_dtype=False)

Open for other ideas/Suggestions from everyone!

@quasiben
Copy link
Contributor

quasiben commented Nov 8, 2021

Thank for digging into this @rajagurunath ! I didn't know dask-sql sets everything as UTC if TZ is not defined.

  1. We should make this more clear in documentation
  2. If the user casts a column to date we could also set the column in UTC:

ddf['d_date'] = ddf.d_date.dt.tz_localize('UTC')

I suspect this is slower than what you have.

Copy link
Collaborator

@charlesbluca charlesbluca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work here @rajagurunath 😄 just a few comments:

int(literal_value.getTimeInMillis()) / 1000, timezone.utc
)

if sql_type == "DATE":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind adding a comment here linking to #296?

Suggested change
if sql_type == "DATE":
# address datetime format mismatch
# https://github.com/dask-contrib/dask-sql/issues/296
if sql_type == "DATE":

if return_column is None:
return operand
else:
# handle datetime type specially
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above - just adding a link to #296

Suggested change
# handle datetime type specially
# address datetime format mismatch
# https://github.com/dask-contrib/dask-sql/issues/296

assert_frame_equal(result_df, datetime_table)


def test_date_casting(c, datetime_table):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinks we don't actually need datetime_table here? Since we only use the context here

Suggested change
def test_date_casting(c, datetime_table):
def test_date_casting(c):

@ChrisJar
Copy link
Collaborator

@rajagurunath Thanks for working on this! One thing I noticed was that when I perform a date comparison with this PR:

import dask.dataframe as dd
import pandas as pd
from dask_sql import Context

df = pd.DataFrame({'d_date':["2001-08-01", "2001-08-02", "2001-08-03"], 'val':[1,2,3]})
ddf = dd.from_pandas(df, npartitions=1)

c = Context()
c.create_table("df", ddf)

query = "SELECT val, d_date FROM df WHERE CAST(d_date as date) >= DATE '2001-08-02'"
c.sql(query).compute()

I get:

TypeError: '>=' not supported between instances of 'str' and 'datetime.date'

When I run the same code without this PR I get:

TypeError('Invalid comparison between dtype=datetime64[ns] and datetime')

Do you know what might be happening here?

@charlesbluca
Copy link
Collaborator

Working on a PR to supersede this one in #343

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants