Fix for date format mismatch #298

rajagurunath · 2021-11-04T20:56:13Z

Initial attempt to solve: #296

Explicitly handled the type conversion for the date column and a literal value. And wondering if there is any better way to handle this datetime casting. Happy to know everyone's thoughts on this!

codecov-commenter · 2021-11-04T21:06:18Z

Codecov Report

Merging #298 (61cbca9) into main (5b8f8a9) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main     #298   +/-   ##
=======================================
  Coverage   95.89%   95.89%           
=======================================
  Files          65       65           
  Lines        2800     2804    +4     
  Branches      418      420    +2     
=======================================
+ Hits         2685     2689    +4     
  Misses         73       73           
  Partials       42       42

Impacted Files	Coverage Δ
dask_sql/mappings.py	`100.00% <100.00%> (ø)`
dask_sql/physical/rex/core/call.py	`99.11% <100.00%> (+<0.01%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5b8f8a9...61cbca9. Read the comment docs.

quasiben · 2021-11-05T00:33:54Z

This looks like a reasonable solution. Any ideas why cuDF works while Pandas fails ?

rajagurunath · 2021-11-05T17:10:03Z

Hi @quasiben,

Thanks for reviewing the PR and putting me in the right direction (I missed the part about why it works for CUDF)

I tried digging deep into this and found the following, Both pandas and cudf seem to work for the local timezone. but pandas doesn't seem to work with UTC timezone.

while executing the query, we are converting the literal timestamp value to the datetime value using the following function.

dask-sql/dask_sql/mappings.py

Lines 160 to 165 in 61cbca9

    
           tz = literal_value.getTimeZone().getID() 
        
           assert str(tz) == "UTC", "The code can currently only handle UTC timezones" 
        
           dt = datetime.fromtimestamp( 
        
               int(literal_value.getTimeInMillis()) / 1000, timezone.utc 
        
           )

example using the above function:

Since we are only accepting UTC timezone currently, we may need to do the casting explicitly, what do you think?

And One more point in the favour of handling date casting explicitly is the current version of code includes hh:MM: SS even after casting the datetime column into a date column format (for both cudf and pandas version), i.e following test case seems to fail in the current version of the dask-sql (without date casting).

dask-sql/tests/integration/test_select.py

Lines 120 to 127 in 61cbca9

    
           def test_date_casting(c, datetime_table): 
        
               # check date casting 
        
               query = "SELECT cast(timezone as date) as date1,cast(utc_timezone as date) as date2 FROM datetime_table " 
        
               result_df = c.sql(query).compute().astype(str) 
        
               expected_df = pd.DataFrame( 
        
                   {"date1": ["2014-08-01"] * 3, "date2": ["2014-08-01"] * 3} 
        
               ) 
        
               assert_frame_equal(result_df, expected_df, check_dtype=False)

Open for other ideas/Suggestions from everyone!

quasiben · 2021-11-08T21:11:41Z

Thank for digging into this @rajagurunath ! I didn't know dask-sql sets everything as UTC if TZ is not defined.

We should make this more clear in documentation
If the user casts a column to date we could also set the column in UTC:

ddf['d_date'] = ddf.d_date.dt.tz_localize('UTC')

I suspect this is slower than what you have.

charlesbluca

Thanks for the work here @rajagurunath 😄 just a few comments:

charlesbluca · 2021-11-16T20:23:37Z

dask_sql/mappings.py

            int(literal_value.getTimeInMillis()) / 1000, timezone.utc
        )
-
+        if sql_type == "DATE":


Would you mind adding a comment here linking to #296?

Suggested change

if sql_type == "DATE":

# address datetime format mismatch

# https://github.com/dask-contrib/dask-sql/issues/296

if sql_type == "DATE":

charlesbluca · 2021-11-16T20:23:55Z

dask_sql/physical/rex/core/call.py

        if return_column is None:
            return operand
        else:
+            # handle datetime type specially


Same as above - just adding a link to #296

Suggested change

# handle datetime type specially

# address datetime format mismatch

# https://github.com/dask-contrib/dask-sql/issues/296

charlesbluca · 2021-11-16T20:32:15Z

tests/integration/test_select.py

    assert_frame_equal(result_df, datetime_table)
+
+
+def test_date_casting(c, datetime_table):


Thinks we don't actually need datetime_table here? Since we only use the context here

Suggested change

def test_date_casting(c, datetime_table):

def test_date_casting(c):

ChrisJar · 2021-11-17T16:49:39Z

@rajagurunath Thanks for working on this! One thing I noticed was that when I perform a date comparison with this PR:

import dask.dataframe as dd
import pandas as pd
from dask_sql import Context

df = pd.DataFrame({'d_date':["2001-08-01", "2001-08-02", "2001-08-03"], 'val':[1,2,3]})
ddf = dd.from_pandas(df, npartitions=1)

c = Context()
c.create_table("df", ddf)

query = "SELECT val, d_date FROM df WHERE CAST(d_date as date) >= DATE '2001-08-02'"
c.sql(query).compute()

I get:

TypeError: '>=' not supported between instances of 'str' and 'datetime.date'

When I run the same code without this PR I get:

TypeError('Invalid comparison between dtype=datetime64[ns] and datetime')

Do you know what might be happening here?

charlesbluca · 2021-12-09T15:47:00Z

Working on a PR to supersede this one in #343

Added initial fix for date type mismatch

61cbca9

charlesbluca reviewed Nov 16, 2021

View reviewed changes

This was referenced Dec 9, 2021

[BUG] Selecting based on dates unexpectedly returns no results. #296

Closed

Fix TIMESTAMP / DATE scalars, add support for DATE column casting #343

Merged

charlesbluca closed this in #343 Feb 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix for date format mismatch #298

Fix for date format mismatch #298

Uh oh!

rajagurunath commented Nov 4, 2021

Uh oh!

codecov-commenter commented Nov 4, 2021 •

edited

Loading

Uh oh!

quasiben commented Nov 5, 2021

Uh oh!

rajagurunath commented Nov 5, 2021

Uh oh!

quasiben commented Nov 8, 2021 •

edited

Loading

Uh oh!

charlesbluca left a comment

Uh oh!

charlesbluca Nov 16, 2021

Uh oh!

charlesbluca Nov 16, 2021

Uh oh!

charlesbluca Nov 16, 2021

Uh oh!

ChrisJar commented Nov 17, 2021

Uh oh!

charlesbluca commented Dec 9, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	# handle datetime type specially
	# address datetime format mismatch
	# https://github.com/dask-contrib/dask-sql/issues/296

		assert_frame_equal(result_df, datetime_table)


		def test_date_casting(c, datetime_table):

	def test_date_casting(c, datetime_table):
	def test_date_casting(c):

Fix for date format mismatch #298

Fix for date format mismatch #298

Uh oh!

Conversation

rajagurunath commented Nov 4, 2021

Uh oh!

codecov-commenter commented Nov 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

quasiben commented Nov 5, 2021

Uh oh!

rajagurunath commented Nov 5, 2021

Uh oh!

quasiben commented Nov 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

charlesbluca left a comment

Choose a reason for hiding this comment

Uh oh!

charlesbluca Nov 16, 2021

Choose a reason for hiding this comment

Uh oh!

charlesbluca Nov 16, 2021

Choose a reason for hiding this comment

Uh oh!

charlesbluca Nov 16, 2021

Choose a reason for hiding this comment

Uh oh!

ChrisJar commented Nov 17, 2021

Uh oh!

charlesbluca commented Dec 9, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov-commenter commented Nov 4, 2021 •

edited

Loading

quasiben commented Nov 8, 2021 •

edited

Loading