Skip to content

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

An alternative fix of #22450

The issue is, Spark SQL accepts negative decimal scale by accident (I believe so), and have problems to deal with negative scale for some cases (e.g. decimal division, decimal rounding).

The best solution is to officially support negative scale and make the behaviors following SQL standard, but this is hard to do and will take a long time. Or we can officially forbid negative scale, which is a breaking change and we can't do it as a bug fix.

This PR proposes to avoid generating negative scale as possible as we can, to work around this issue for common cases.

How was this patch tested?

new tests

@cloud-fan
Copy link
Contributor Author

cloud-fan commented Sep 19, 2018

@dilipbiswal
Copy link
Contributor

@cloud-fan Could you please check CSVinferSchema::tryParseDecimal() ? There is a condition to check negative scale.

@SparkQA
Copy link

SparkQA commented Sep 19, 2018

Test build #96271 has finished for PR 22470 at commit 3c05636.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor

@cloud-fan I think the main problem about this (and it is the reason why I haven't proposed it) is that the range of operations supported would be smaller, so we may forbid operations which now can happen. For instance, the following code has been always working on Spark: lit(BigDecimal(1e36)) * lit(BigDecimal(10000)). Indeed now this would become a decimal(6, -36). With your change, this is going to be a decimal(42, 0) which is out of the range of supported values (ie. an overflow would occur).

I am not sure if any user has something like this, but it is possible and I think we cannot exclude it. We may, though, restrict again the condition when it happens, ie. in case we are just parsing from a literal we can avoid returning a negative scale. But the other fix would be needed anyway in this case, as we could still have to deal with negative scales, so IMO this would be quite useless.

I'd agree though about forbidding negative scale in 3.0.

@dilipbiswal
Copy link
Contributor

@mgaido91 @cloud-fan On the other hand .. some use cases may work better :-) , for example
Before

scala> spark.sql("create table dec as select (1e36 * 1) as col1")
org.apache.spark.SparkException: Cannot recognize hive type string: decimal(3,-36)
  at org.apache.spark.sql.hive.client.HiveClientImpl$.org$apache$spark$sql$hive$client$HiveClientImpl$$getSparkSQLDataType(HiveClientImpl.scala:883)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$org$apache$spark$sql$hive$client$HiveClientImpl$$verifyColumnDataType$1.apply(HiveClientImpl.scala:905)

with this pr

scala> spark.sql("create table dec as select (1e36 * 1) as col1")
18/09/19 14:29:29 WARN HiveMetaStore: Location: file:/user/hive/warehouse/dec specified for non-external table:dec
18/09/19 14:29:30 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
res0: org.apache.spark.sql.DataFrame = []
scala> spark.sql("describe table dec").show
+--------+-------------+-------+
|col_name|    data_type|comment|
+--------+-------------+-------+
|    col1|decimal(38,0)|   null|
+--------+-------------+-------+

Perhaps we may have issues writing out data frames containing decimal with negative scale to file based datasources as well. I have not verified though.

@mgaido91
Copy link
Contributor

@dilipbiswal yes, I definitely think that in general we should get to forbid negative scales. I only thought that this should be done in 3.0 rather than now. And for now the safest option to me was to update properly the only rule which was not working correctly in that case. I also agree in avoiding negative scales in places where they were not possible before 2.3, ie. when parsing from literals. But as there may be other cases where this happens, this solves only part of the problem.

@dilipbiswal
Copy link
Contributor

@mgaido91 makes sense. Actually @cloud-fan had asked me to write some test cases for decimal values with -ve scale in another PR. While i was playing around, i found this issue. It seemed to me that the whole -ve scale thingy is not thought through properly. Thats the reason i wanted to share my findings here while you guys are discussing this :-).

@cloud-fan
Copy link
Contributor Author

@mgaido91 you are right, this still has behavior changes if the intermedia result exceed the max precision. Since most of the storages don't support negative scale(hive, parquet, etc.), I think we should eventually forbid it too. Let's move the discussion to #22450

@cloud-fan cloud-fan closed this Sep 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants