Skip to content

Conversation

@mgaido91
Copy link
Contributor

@mgaido91 mgaido91 commented Jan 5, 2019

What changes were proposed in this pull request?

In Spark 2.x, literal values are parsed as DECIMALs. Many RDBMS, instead, treat them as DOUBLE. Among those, we can name Presto, Hive and MSSQL. The last 2 are particularly important for us, because they are the 2 which we used as reference for our implementation of decimal operations.

In the current scenario, specific constant - eg. 1e10 - are parsed as DECIMALs with negative scale. This is a case which is not handled properly by Spark currently and there is an ongoing PR for fixing the operations rules for this case. Despite this PR doesn't forbid completely decimals with negative scale, anyway it reduces considerably the cases when this can happen, resolving naturally the problem mentioned above.

The PR introduces the config option spark.sql.legacy.literals.asDecimal which can be used in order to restore the previous behavior

How was this patch tested?

modified/enhanced UTs

@SparkQA
Copy link

SparkQA commented Jan 5, 2019

Test build #100796 has finished for PR 23468 at commit 2dde31f.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

In Spark 2.x, literal values are parsed as DECIMALs. Many RDBMS, instead, treat them as DOUBLE. Among those, we can name Presto, Hive and MSSQL.

This argument is not enough. For example, Presto even treats 1.2 as Double. See the commit: prestodb/presto@dba2648 It is still different from the mainstream enterprise database, DB2

I would prefer to documenting our current behavior first. Update our Spark SQL doc? Below is some references,

@mgaido91
Copy link
Contributor Author

mgaido91 commented Jan 5, 2019

@gatorsmile thanks for your comment. Let me cc also @cloud-fan, @dilipbiswal and @rxin who partecipated in the discussion related to this too.

I agree on improving the doc on this. If you are fine with that, I'll create a JIRA and a PR for that ASAP.

The main argument for this is: we are basing our decimal implementation on what Hive and MSSQL do. But we have mainly 2 differences with them at the moment:

  1. We allow negative decimal scales (and at the moment we don't handle them properly);
  2. We use decimals for literals, while Hive/MSSQL use DOUBLE (so we "enlarge" the issue). Using DOUBLE, when possible, seems safer and it is what is done by our references for these operations.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jan 5, 2019

Hi, @mgaido91 and @gatorsmile .

First of all, Hive starts to use Decimal by default. Also, this introduces TPC-H query result difference among Spark versions. We cannot do this. I'm -1 for this.

hive> select version();
OK
3.1.1 rf4e0529634b6231a0072295da48af466cf2f10b7
Time taken: 0.089 seconds, Fetched: 1 row(s)

hive> explain select 2.3;
OK
STAGE DEPENDENCIES:
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        TableScan
          alias: _dummy_table
          Row Limit Per Split: 1
          Statistics: Num rows: 1 Data size: 10 Basic stats: COMPLETE Column stats: COMPLETE
          Select Operator
            expressions: 2.3 (type: decimal(2,1))
            outputColumnNames: _col0
            Statistics: Num rows: 1 Data size: 112 Basic stats: COMPLETE Column stats: COMPLETE
            ListSink

@SparkQA
Copy link

SparkQA commented Jan 6, 2019

Test build #100807 has finished for PR 23468 at commit 5d3d400.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

mgaido91 commented Jan 6, 2019

thanks for your comment @dongjoon-hyun. I wasn't aware of the behavior change in Hive. Then, if we don't want to do this, I think #21599 becomes even more important and we cannot forbid negative scales in decimals, because we would not be able anymore to represent numbers such as 1e40.

We may - instead - here try and reduce the cases when we use negative scales, so that 1e20 will be considered a decimal(20, 0) instead of a decimal(1, -20). What do you think about this?

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jan 6, 2019

Yes. Thank you for understanding, @mgaido91 . +1 for transforming this PR or #21599 in order to handle the specific 1e20 case. For all cases, please refer Hive 3.1 together.

@cloud-fan
Copy link
Contributor

If we are going to support negative scale, isn't decimal(1, -20) better than decimal(20, 0)?

@mgaido91
Copy link
Contributor Author

mgaido91 commented Jan 7, 2019

@cloud-fan it depends on what we consider as "better". It is better because requires less precision avoiding potential truncation with subsequent operations. It is worse because if the output is written to a datasource which doesn't support decimals with negative scales it doesn't work.

Let me close this and let's address #21599 first, then we can get back to this later if needed. Thanks.

@mgaido91 mgaido91 closed this Jan 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants