-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-16883][SparkR]:SQL decimal type is not properly cast to number when collecting SparkDataFrame #14613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #63648 has finished for PR 14613 at commit
|
|
@wangmiao1981 Thanks for the PR. Could we add a couple of test cases for this ? It'll also help me understand what is the expected behavior -- one of them could be for |
|
@shivaram Sure. I will add unit tests. |
|
Test build #63710 has finished for PR 14613 at commit
|
R/pkg/R/DataFrame.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this should go into types.R?
Could you add some documentation comment what this is doing and why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add comments. Thanks!
|
Test build #63775 has finished for PR 14613 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this is the collect DataFrame test, perhaps you want to move these to the sql test or the createOrReplaceTempView test?
|
@felixcheung Let me think about your comments and I will get back soon. |
|
@felixcheung I changed the tests according to your comments. For the |
|
Test build #64297 has finished for PR 14613 at commit
|
R/pkg/R/types.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this go `"^decimal(.+)$"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we switch on the first character? why don't we regex on the full string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case, there are other types that we want to handle in the future. Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just simplify this at this point. It would be easy to add the additional checks later
|
Test build #64808 has finished for PR 14613 at commit
|
|
Test build #64834 has finished for PR 14613 at commit
|
… when collecting SparkDataFrame
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
registerTempTable(createDataFrame(iris), "iris")
str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y from iris limit 5")))
'data.frame': 5 obs. of 2 variables:
$ x: num 1 1 1 1 1
$ y:List of 5
..$ : num 2
..$ : num 2
..$ : num 2
..$ : num 2
..$ : num 2
The problem is that spark returns `decimal(10, 0)` col type, instead of `decimal`. Thus, `decimal(10, 0)` is not handled correctly. It should be handled as "double".
As discussed in JIRA thread, we can have two potential fixes:
1). Scala side fix to add a new case when writing the object back; However, I can't use spark.sql.types._ in Spark core due to dependency issues. I don't find a way of doing type case match;
2). SparkR side fix: Add a helper function to check special type like `"decimal(10, 0)"` and replace it with `double`, which is PRIMITIVE type. This special helper is generic for adding new types handling in the future.
I open this PR to discuss pros and cons of both approaches. If we want to do Scala side fix, we need to find a way to match the case of DecimalType and StructType in Spark Core.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Manual test:
> str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y from iris limit 5")))
'data.frame': 5 obs. of 2 variables:
$ x: num 1 1 1 1 1
$ y: num 2 2 2 2 2
R Unit tests
Author: [email protected] <[email protected]>
Closes #14613 from wangmiao1981/type.
(cherry picked from commit 0f30cde)
Signed-off-by: Felix Cheung <[email protected]>
|
Merged to 2.0.1 and 2.1.0. thanks! |
What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
registerTempTable(createDataFrame(iris), "iris")
str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y from iris limit 5")))
'data.frame': 5 obs. of 2 variables:
$ x: num 1 1 1 1 1
$ y:List of 5
..$ : num 2
..$ : num 2
..$ : num 2
..$ : num 2
..$ : num 2
The problem is that spark returns
decimal(10, 0)col type, instead ofdecimal. Thus,decimal(10, 0)is not handled correctly. It should be handled as "double".As discussed in JIRA thread, we can have two potential fixes:
1). Scala side fix to add a new case when writing the object back; However, I can't use spark.sql.types._ in Spark core due to dependency issues. I don't find a way of doing type case match;
2). SparkR side fix: Add a helper function to check special type like
"decimal(10, 0)"and replace it withdouble, which is PRIMITIVE type. This special helper is generic for adding new types handling in the future.I open this PR to discuss pros and cons of both approaches. If we want to do Scala side fix, we need to find a way to match the case of DecimalType and StructType in Spark Core.
How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Manual test: