[SPARK-16883][SparkR]:SQL decimal type is not properly cast to number when collecting SparkDataFrame #14613

wangmiao1981 · 2016-08-11T23:25:59Z

What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

registerTempTable(createDataFrame(iris), "iris")
str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y from iris limit 5")))

'data.frame': 5 obs. of 2 variables:
$ x: num 1 1 1 1 1
$ y:List of 5
..$ : num 2
..$ : num 2
..$ : num 2
..$ : num 2
..$ : num 2

The problem is that spark returns decimal(10, 0) col type, instead of decimal. Thus, decimal(10, 0) is not handled correctly. It should be handled as "double".

As discussed in JIRA thread, we can have two potential fixes:
1). Scala side fix to add a new case when writing the object back; However, I can't use spark.sql.types._ in Spark core due to dependency issues. I don't find a way of doing type case match;

2). SparkR side fix: Add a helper function to check special type like "decimal(10, 0)" and replace it with double, which is PRIMITIVE type. This special helper is generic for adding new types handling in the future.

I open this PR to discuss pros and cons of both approaches. If we want to do Scala side fix, we need to find a way to match the case of DecimalType and StructType in Spark Core.

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Manual test:

str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y from iris limit 5")))
'data.frame': 5 obs. of 2 variables:
$ x: num 1 1 1 1 1
$ y: num 2 2 2 2 2
R Unit tests

SparkQA · 2016-08-11T23:56:52Z

Test build #63648 has finished for PR 14613 at commit e95f557.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2016-08-12T00:02:12Z

@wangmiao1981 Thanks for the PR. Could we add a couple of test cases for this ? It'll also help me understand what is the expected behavior -- one of them could be for collect with decimals and another one could be for str on a Spark DatatFrame which contains decimals.

wangmiao1981 · 2016-08-12T04:30:25Z

@shivaram Sure. I will add unit tests.

SparkQA · 2016-08-13T00:22:24Z

Test build #63710 has finished for PR 14613 at commit b333cda.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-08-13T17:43:28Z

R/pkg/R/DataFrame.R

maybe this should go into types.R?
Could you add some documentation comment what this is doing and why?

I will add comments. Thanks!

SparkQA · 2016-08-15T06:47:33Z

Test build #63775 has finished for PR 14613 at commit 61b7a48.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-08-17T12:40:21Z

R/pkg/inst/tests/testthat/test_sparkSQL.R

since this is the collect DataFrame test, perhaps you want to move these to the sql test or the createOrReplaceTempView test?

wangmiao1981 · 2016-08-18T18:01:43Z

@felixcheung Let me think about your comments and I will get back soon.

wangmiao1981 · 2016-08-23T17:59:04Z

@felixcheung I changed the tests according to your comments. For the specialtypeshandle, it should return the key in PRIMITIVE_TYPES based on the original backend return type. In the caller, I did adjustment on the types to make sure it is consistent with R native type.

SparkQA · 2016-08-23T18:34:25Z

Test build #64297 has finished for PR 14613 at commit abce7ff.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-08-25T01:58:54Z

R/pkg/R/types.R

should this go `"^decimal(.+)$"?

why do we switch on the first character? why don't we regex on the full string?

In case, there are other types that we want to handle in the future. Right?

I'd just simplify this at this point. It would be easy to add the additional checks later

felixcheung · 2016-08-25T02:09:23Z

@sun-rui @shivaram thought?

SparkQA · 2016-09-01T22:52:06Z

Test build #64808 has finished for PR 14613 at commit 4d6d048.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-02T06:36:21Z

Test build #64834 has finished for PR 14613 at commit 4e9e403.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

… when collecting SparkDataFrame ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) registerTempTable(createDataFrame(iris), "iris") str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y from iris limit 5"))) 'data.frame': 5 obs. of 2 variables: $ x: num 1 1 1 1 1 $ y:List of 5 ..$ : num 2 ..$ : num 2 ..$ : num 2 ..$ : num 2 ..$ : num 2 The problem is that spark returns `decimal(10, 0)` col type, instead of `decimal`. Thus, `decimal(10, 0)` is not handled correctly. It should be handled as "double". As discussed in JIRA thread, we can have two potential fixes: 1). Scala side fix to add a new case when writing the object back; However, I can't use spark.sql.types._ in Spark core due to dependency issues. I don't find a way of doing type case match; 2). SparkR side fix: Add a helper function to check special type like `"decimal(10, 0)"` and replace it with `double`, which is PRIMITIVE type. This special helper is generic for adding new types handling in the future. I open this PR to discuss pros and cons of both approaches. If we want to do Scala side fix, we need to find a way to match the case of DecimalType and StructType in Spark Core. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Manual test: > str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y from iris limit 5"))) 'data.frame': 5 obs. of 2 variables: $ x: num 1 1 1 1 1 $ y: num 2 2 2 2 2 R Unit tests Author: [email protected] <[email protected]> Closes #14613 from wangmiao1981/type. (cherry picked from commit 0f30cde) Signed-off-by: Felix Cheung <[email protected]>

felixcheung · 2016-09-02T09:07:31Z

Merged to 2.0.1 and 2.1.0. thanks!

wangmiao1981 force-pushed the type branch from e95f557 to b333cda Compare August 12, 2016 23:55

felixcheung reviewed Aug 13, 2016
View reviewed changes

wangmiao1981 force-pushed the type branch from b333cda to 61b7a48 Compare August 15, 2016 06:17

felixcheung reviewed Aug 17, 2016
View reviewed changes

wangmiao1981 force-pushed the type branch from 61b7a48 to abce7ff Compare August 23, 2016 17:56

felixcheung reviewed Aug 25, 2016
View reviewed changes

wangmiao1981 added 5 commits September 1, 2016 15:03

add a type check helper

7f55569

update unit tests

4f88f47

move helper function to types.R and add comments

6d19711

address review comments, adjust return value, change tests

63b9d06

address review comments

4d6d048

wangmiao1981 force-pushed the type branch from abce7ff to 4d6d048 Compare September 1, 2016 22:13

remove switch

4e9e403

asfgit closed this in 0f30cde Sep 2, 2016

wangmiao1981 deleted the type branch September 26, 2016 20:09

[SPARK-16883][SparkR]:SQL decimal type is not properly cast to number when collecting SparkDataFrame #14613

[SPARK-16883][SparkR]:SQL decimal type is not properly cast to number when collecting SparkDataFrame #14613

Uh oh!

Conversation

wangmiao1981 commented Aug 11, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Aug 11, 2016

Uh oh!

shivaram commented Aug 12, 2016

Uh oh!

wangmiao1981 commented Aug 12, 2016

Uh oh!

SparkQA commented Aug 13, 2016

Uh oh!

felixcheung Aug 13, 2016

Choose a reason for hiding this comment

Uh oh!

wangmiao1981 Aug 14, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 15, 2016

Uh oh!

felixcheung Aug 17, 2016

Choose a reason for hiding this comment

Uh oh!

wangmiao1981 commented Aug 18, 2016

Uh oh!

wangmiao1981 commented Aug 23, 2016

Uh oh!

SparkQA commented Aug 23, 2016

Uh oh!

felixcheung Aug 25, 2016

Choose a reason for hiding this comment

Uh oh!

felixcheung Aug 25, 2016

Choose a reason for hiding this comment

Uh oh!

wangmiao1981 Sep 1, 2016

Choose a reason for hiding this comment

Uh oh!

felixcheung Sep 1, 2016

Choose a reason for hiding this comment

Uh oh!

felixcheung commented Aug 25, 2016

Uh oh!

SparkQA commented Sep 1, 2016

Uh oh!

SparkQA commented Sep 2, 2016

Uh oh!

felixcheung commented Sep 2, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants