Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Oct 18, 2019

What changes were proposed in this pull request?

  • Cast strings to intervals in interval - string. For example:
spark-sql> select interval '1 month' - '1 day 2 hours';
interval 1 months -1 days -2 hours
  • Cast strings to intervals in datetime + string or string + datetime. For example:
spark-sql> select timestamp'today' + '1 month';
2019-11-18 00:00:00
  • Cast strings to timestamps in datetime - string or string - datetime. For example:
spark-sql> select date'today' - '2019-10-01';
interval 2 weeks 3 days

Why are the changes needed?

To improve user experience with Spark SQL

Does this PR introduce any user-facing change?

Yes, previously the operations fails with the errors:

spark-sql> select timestamp'today' + '1 month';
Error in query: cannot resolve '(TIMESTAMP('2019-10-18 00:00:00') + CAST('1 month' AS DOUBLE))' due to data type mismatch: differing types in '(TIMESTAMP('2019-10-18 00:00:00') + CAST('1 month' AS DOUBLE))' (timestamp and double).; line 1 pos 7;
'Project [unresolvedalias((1571346000000000 + cast(1 month as double)), None)]
+- OneRowRelation

How was this patch tested?

  • Add tests to TypeCoercionSuite to check rules
  • End-to-end tests in dateTimeOperations.sql

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 18, 2019

@wangyum Please, review this PR.

@SparkQA
Copy link

SparkQA commented Oct 18, 2019

Test build #112279 has finished for PR 26165 at commit 9a2ce13.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member

wangyum commented Oct 18, 2019

I tried to add this feature before: #25190
cc @maropu @dongjoon-hyun

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 18, 2019

I tried to add this feature before: #25190

Why was it closed? This casting is useful independently from PostgreSQL. I believe this improves user experience with Spark SQL. Don't think it is better to fail on casting a string to DOUBLE.

@SparkQA
Copy link

SparkQA commented Oct 18, 2019

Test build #112282 has finished for PR 26165 at commit 01a42a4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 18, 2019

Test build #112290 has finished for PR 26165 at commit b001177.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member

wangyum commented Oct 18, 2019

@MaxGekk Please see @maropu's comment:
#25190 (comment)

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 19, 2019

Can we consider the feature out of the scope of compatibility with PostgreSQL?

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 21, 2019

@srowen @cloud-fan @HyukjinKwon WDYT?

@srowen
Copy link
Member

srowen commented Oct 21, 2019

I'm just trying to think if this causes any surprising behavior, either because some string is unintentionally cast to a date or interval type, or because it doesn't fully work in all cases (see the other ticket). How much do we need to support this? If it's not a standard behavior across DBs, I'd kind of prefer to have people write more explicit casts for such a thing. Do we cast strings to other types implicitly though?

@cloud-fan
Copy link
Contributor

Is it really a useful feature? Users can simply put the INTERVAL keyword before the string literal, like timestamp'today' + interval'1 month'

For timestamp_col + string_col, I don't think it's reasonable to implicitly cast string_col to interval. People may get null result silently and hide the mistake(they may pick a wrong column to add).

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 22, 2019

Users can simply put the INTERVAL keyword before the string literal ...

You are right, users can put it before literals but what about string column. Users have to explicitly cast it, correct?

People may get null result silently and hide the mistake(they may pick a wrong column to add).

Yeh, though the same argument is applicable to other implicit cast from strings, for example:

spark-sql> select 2 * '3';
6.0

Why does Spark implicitly cast '3' to a numeric. Maybe an user wants to repeat '3' two times, and get '33'?

Let's consider proposed implicit casts one by one:

  1. select interval '1 month' - <string>. What else could we subtract from interval besides of another interval. Don't think casting to interval is ambiguous, here.
  2. select timestamp'today' + <string>. What else could we add to a timestamp or date? Definitely, not another timestamp or date. Maybe we could add a number but what is the number - days, seconds or microseconds? Obviously, a timestamp can be adjusted by an interval.
  3. select date'today' - <string>. Here, I follow PostgreSQL which cast the string to datetime implicitly:
maxim=# select date'today' - '2019-10-01';
 ?column?
----------
       21
(1 row)

@maropu
Copy link
Member

maropu commented Oct 23, 2019

Yea, I'm still worried that this feature might cause unexpected results in complicated SQL queries as others suggested..., btw, any dbms-like systems support this implicit casts?

@cloud-fan
Copy link
Contributor

Why does Spark implicitly cast '3' to a numeric

To be honest I think this is a mistake as well. This kind of implicit cast is really risky as Spark returns null for invalid cast.

Even if invalid cast can throw runtime exception, we should still not allow this kind of implicit cast to be safe. FYI this is the result of pgsql

cloud0fan=# select 1 + s from t;
ERROR:  operator does not exist: integer + character varying
LINE 1: select 1 + s from t;
                 ^
HINT:  No operator matches the given name and argument type(s). You might need to add explicit type casts.

AFAIK pgsql(and some other DBs) only apply this kind of implicit cast for string literal. We'd either update the type coercion module to handle string literal specially, or not do it at all.

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 25, 2019

I am closing this PR.

@MaxGekk MaxGekk closed this Oct 25, 2019
@MaxGekk MaxGekk deleted the string-to-interval-implicit-conv branch June 5, 2020 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants