[SPARK-26535][SQL] Parse literals as DOUBLE instead of DECIMALS #23468

mgaido91 · 2019-01-05T17:00:08Z

What changes were proposed in this pull request?

In Spark 2.x, literal values are parsed as DECIMALs. Many RDBMS, instead, treat them as DOUBLE. Among those, we can name Presto, Hive and MSSQL. The last 2 are particularly important for us, because they are the 2 which we used as reference for our implementation of decimal operations.

In the current scenario, specific constant - eg. 1e10 - are parsed as DECIMALs with negative scale. This is a case which is not handled properly by Spark currently and there is an ongoing PR for fixing the operations rules for this case. Despite this PR doesn't forbid completely decimals with negative scale, anyway it reduces considerably the cases when this can happen, resolving naturally the problem mentioned above.

The PR introduces the config option spark.sql.legacy.literals.asDecimal which can be used in order to restore the previous behavior

How was this patch tested?

modified/enhanced UTs

SparkQA · 2019-01-05T21:15:09Z

Test build #100796 has finished for PR 23468 at commit 2dde31f.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2019-01-05T22:02:08Z

In Spark 2.x, literal values are parsed as DECIMALs. Many RDBMS, instead, treat them as DOUBLE. Among those, we can name Presto, Hive and MSSQL.

This argument is not enough. For example, Presto even treats 1.2 as Double. See the commit: prestodb/presto@dba2648 It is still different from the mainstream enterprise database, DB2

I would prefer to documenting our current behavior first. Update our Spark SQL doc? Below is some references,

mgaido91 · 2019-01-05T22:21:01Z

@gatorsmile thanks for your comment. Let me cc also @cloud-fan, @dilipbiswal and @rxin who partecipated in the discussion related to this too.

I agree on improving the doc on this. If you are fine with that, I'll create a JIRA and a PR for that ASAP.

The main argument for this is: we are basing our decimal implementation on what Hive and MSSQL do. But we have mainly 2 differences with them at the moment:

We allow negative decimal scales (and at the moment we don't handle them properly);
We use decimals for literals, while Hive/MSSQL use DOUBLE (so we "enlarge" the issue). Using DOUBLE, when possible, seems safer and it is what is done by our references for these operations.

dongjoon-hyun · 2019-01-05T23:37:29Z

Hi, @mgaido91 and @gatorsmile .

First of all, Hive starts to use Decimal by default. Also, this introduces TPC-H query result difference among Spark versions. We cannot do this. I'm -1 for this.

hive> select version();
OK
3.1.1 rf4e0529634b6231a0072295da48af466cf2f10b7
Time taken: 0.089 seconds, Fetched: 1 row(s)

hive> explain select 2.3;
OK
STAGE DEPENDENCIES:
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        TableScan
          alias: _dummy_table
          Row Limit Per Split: 1
          Statistics: Num rows: 1 Data size: 10 Basic stats: COMPLETE Column stats: COMPLETE
          Select Operator
            expressions: 2.3 (type: decimal(2,1))
            outputColumnNames: _col0
            Statistics: Num rows: 1 Data size: 112 Basic stats: COMPLETE Column stats: COMPLETE
            ListSink

SparkQA · 2019-01-06T02:40:28Z

Test build #100807 has finished for PR 23468 at commit 5d3d400.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2019-01-06T15:19:52Z

thanks for your comment @dongjoon-hyun. I wasn't aware of the behavior change in Hive. Then, if we don't want to do this, I think #21599 becomes even more important and we cannot forbid negative scales in decimals, because we would not be able anymore to represent numbers such as 1e40.

We may - instead - here try and reduce the cases when we use negative scales, so that 1e20 will be considered a decimal(20, 0) instead of a decimal(1, -20). What do you think about this?

dongjoon-hyun · 2019-01-06T23:17:43Z

Yes. Thank you for understanding, @mgaido91 . +1 for transforming this PR or #21599 in order to handle the specific 1e20 case. For all cases, please refer Hive 3.1 together.

cloud-fan · 2019-01-07T04:18:40Z

If we are going to support negative scale, isn't decimal(1, -20) better than decimal(20, 0)?

mgaido91 · 2019-01-07T13:35:53Z

@cloud-fan it depends on what we consider as "better". It is better because requires less precision avoiding potential truncation with subsequent operations. It is worse because if the output is written to a datasource which doesn't support decimals with negative scales it doesn't work.

Let me close this and let's address #21599 first, then we can get back to this later if needed. Thanks.

[SPARK-26535][SQL] Parse literals as DOUBLE instead of DECIMALS

2dde31f

fix R UT failure

5d3d400

mgaido91 closed this Jan 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-26535][SQL] Parse literals as DOUBLE instead of DECIMALS #23468

[SPARK-26535][SQL] Parse literals as DOUBLE instead of DECIMALS #23468

Uh oh!

mgaido91 commented Jan 5, 2019

Uh oh!

SparkQA commented Jan 5, 2019

Uh oh!

gatorsmile commented Jan 5, 2019

Uh oh!

mgaido91 commented Jan 5, 2019

Uh oh!

dongjoon-hyun commented Jan 5, 2019 •

edited

Loading

Uh oh!

SparkQA commented Jan 6, 2019

Uh oh!

mgaido91 commented Jan 6, 2019

Uh oh!

dongjoon-hyun commented Jan 6, 2019 •

edited

Loading

Uh oh!

cloud-fan commented Jan 7, 2019

Uh oh!

mgaido91 commented Jan 7, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-26535][SQL] Parse literals as DOUBLE instead of DECIMALS #23468

[SPARK-26535][SQL] Parse literals as DOUBLE instead of DECIMALS #23468

Uh oh!

Conversation

mgaido91 commented Jan 5, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jan 5, 2019

Uh oh!

gatorsmile commented Jan 5, 2019

Uh oh!

mgaido91 commented Jan 5, 2019

Uh oh!

dongjoon-hyun commented Jan 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jan 6, 2019

Uh oh!

mgaido91 commented Jan 6, 2019

Uh oh!

dongjoon-hyun commented Jan 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Jan 7, 2019

Uh oh!

mgaido91 commented Jan 7, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dongjoon-hyun commented Jan 5, 2019 •

edited

Loading

dongjoon-hyun commented Jan 6, 2019 •

edited

Loading