[SPARK-25454][SQL] should not generate negative scale as possible as we can #22470

cloud-fan · 2018-09-19T16:25:53Z

What changes were proposed in this pull request?

An alternative fix of #22450

The issue is, Spark SQL accepts negative decimal scale by accident (I believe so), and have problems to deal with negative scale for some cases (e.g. decimal division, decimal rounding).

The best solution is to officially support negative scale and make the behaviors following SQL standard, but this is hard to do and will take a long time. Or we can officially forbid negative scale, which is a breaking change and we can't do it as a bug fix.

This PR proposes to avoid generating negative scale as possible as we can, to work around this issue for common cases.

How was this patch tested?

new tests

cloud-fan · 2018-09-19T16:28:00Z

cc @mgaido91 @hvanhovell @gatorsmile @viirya @dongjoon-hyun

dilipbiswal · 2018-09-19T18:56:18Z

@cloud-fan Could you please check CSVinferSchema::tryParseDecimal() ? There is a condition to check negative scale.

SparkQA · 2018-09-19T19:39:44Z

Test build #96271 has finished for PR 22470 at commit 3c05636.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-09-19T20:07:10Z

@cloud-fan I think the main problem about this (and it is the reason why I haven't proposed it) is that the range of operations supported would be smaller, so we may forbid operations which now can happen. For instance, the following code has been always working on Spark: lit(BigDecimal(1e36)) * lit(BigDecimal(10000)). Indeed now this would become a decimal(6, -36). With your change, this is going to be a decimal(42, 0) which is out of the range of supported values (ie. an overflow would occur).

I am not sure if any user has something like this, but it is possible and I think we cannot exclude it. We may, though, restrict again the condition when it happens, ie. in case we are just parsing from a literal we can avoid returning a negative scale. But the other fix would be needed anyway in this case, as we could still have to deal with negative scales, so IMO this would be quite useless.

I'd agree though about forbidding negative scale in 3.0.

dilipbiswal · 2018-09-19T21:33:43Z

@mgaido91 @cloud-fan On the other hand .. some use cases may work better :-) , for example
Before

scala> spark.sql("create table dec as select (1e36 * 1) as col1")
org.apache.spark.SparkException: Cannot recognize hive type string: decimal(3,-36)
  at org.apache.spark.sql.hive.client.HiveClientImpl$.org$apache$spark$sql$hive$client$HiveClientImpl$$getSparkSQLDataType(HiveClientImpl.scala:883)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$org$apache$spark$sql$hive$client$HiveClientImpl$$verifyColumnDataType$1.apply(HiveClientImpl.scala:905)

with this pr

scala> spark.sql("create table dec as select (1e36 * 1) as col1")
18/09/19 14:29:29 WARN HiveMetaStore: Location: file:/user/hive/warehouse/dec specified for non-external table:dec
18/09/19 14:29:30 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
res0: org.apache.spark.sql.DataFrame = []
scala> spark.sql("describe table dec").show
+--------+-------------+-------+
|col_name|    data_type|comment|
+--------+-------------+-------+
|    col1|decimal(38,0)|   null|
+--------+-------------+-------+

Perhaps we may have issues writing out data frames containing decimal with negative scale to file based datasources as well. I have not verified though.

mgaido91 · 2018-09-19T21:55:07Z

@dilipbiswal yes, I definitely think that in general we should get to forbid negative scales. I only thought that this should be done in 3.0 rather than now. And for now the safest option to me was to update properly the only rule which was not working correctly in that case. I also agree in avoiding negative scales in places where they were not possible before 2.3, ie. when parsing from literals. But as there may be other cases where this happens, this solves only part of the problem.

dilipbiswal · 2018-09-19T22:29:59Z

@mgaido91 makes sense. Actually @cloud-fan had asked me to write some test cases for decimal values with -ve scale in another PR. While i was playing around, i found this issue. It seemed to me that the whole -ve scale thingy is not thought through properly. Thats the reason i wanted to share my findings here while you guys are discussing this :-).

cloud-fan · 2018-09-20T01:58:13Z

@mgaido91 you are right, this still has behavior changes if the intermedia result exceed the max precision. Since most of the storages don't support negative scale(hive, parquet, etc.), I think we should eventually forbid it too. Let's move the discussion to #22450

should not generate negative scale as possible as we can

3c05636

cloud-fan mentioned this pull request Sep 19, 2018

[SPARK-25454][SQL] Avoid precision loss in division with decimal with negative scale #22450

Closed

cloud-fan closed this Sep 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-25454][SQL] should not generate negative scale as possible as we can #22470

[SPARK-25454][SQL] should not generate negative scale as possible as we can #22470

Uh oh!

cloud-fan commented Sep 19, 2018

Uh oh!

cloud-fan commented Sep 19, 2018 •

edited

Loading

Uh oh!

dilipbiswal commented Sep 19, 2018

Uh oh!

SparkQA commented Sep 19, 2018

Uh oh!

mgaido91 commented Sep 19, 2018

Uh oh!

dilipbiswal commented Sep 19, 2018

Uh oh!

mgaido91 commented Sep 19, 2018

Uh oh!

dilipbiswal commented Sep 19, 2018

Uh oh!

cloud-fan commented Sep 20, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-25454][SQL] should not generate negative scale as possible as we can #22470

[SPARK-25454][SQL] should not generate negative scale as possible as we can #22470

Uh oh!

Conversation

cloud-fan commented Sep 19, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan commented Sep 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dilipbiswal commented Sep 19, 2018

Uh oh!

SparkQA commented Sep 19, 2018

Uh oh!

mgaido91 commented Sep 19, 2018

Uh oh!

dilipbiswal commented Sep 19, 2018

Uh oh!

mgaido91 commented Sep 19, 2018

Uh oh!

dilipbiswal commented Sep 19, 2018

Uh oh!

cloud-fan commented Sep 20, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cloud-fan commented Sep 19, 2018 •

edited

Loading