Skip to content

Conversation

@wangmiao1981
Copy link
Contributor

What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

registerTempTable(createDataFrame(iris), "iris")
str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y from iris limit 5")))

'data.frame': 5 obs. of 2 variables:
$ x: num 1 1 1 1 1
$ y:List of 5
..$ : num 2
..$ : num 2
..$ : num 2
..$ : num 2
..$ : num 2

The problem is that spark returns decimal(10, 0) col type, instead of decimal. Thus, decimal(10, 0) is not handled correctly. It should be handled as "double".

As discussed in JIRA thread, we can have two potential fixes:
1). Scala side fix to add a new case when writing the object back; However, I can't use spark.sql.types._ in Spark core due to dependency issues. I don't find a way of doing type case match;

2). SparkR side fix: Add a helper function to check special type like "decimal(10, 0)" and replace it with double, which is PRIMITIVE type. This special helper is generic for adding new types handling in the future.

I open this PR to discuss pros and cons of both approaches. If we want to do Scala side fix, we need to find a way to match the case of DecimalType and StructType in Spark Core.

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Manual test:

str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y from iris limit 5")))
'data.frame': 5 obs. of 2 variables:
$ x: num 1 1 1 1 1
$ y: num 2 2 2 2 2
R Unit tests

@SparkQA
Copy link

SparkQA commented Aug 11, 2016

Test build #63648 has finished for PR 14613 at commit e95f557.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

@wangmiao1981 Thanks for the PR. Could we add a couple of test cases for this ? It'll also help me understand what is the expected behavior -- one of them could be for collect with decimals and another one could be for str on a Spark DatatFrame which contains decimals.

@wangmiao1981
Copy link
Contributor Author

@shivaram Sure. I will add unit tests.

@SparkQA
Copy link

SparkQA commented Aug 13, 2016

Test build #63710 has finished for PR 14613 at commit b333cda.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this should go into types.R?
Could you add some documentation comment what this is doing and why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add comments. Thanks!

@SparkQA
Copy link

SparkQA commented Aug 15, 2016

Test build #63775 has finished for PR 14613 at commit 61b7a48.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is the collect DataFrame test, perhaps you want to move these to the sql test or the createOrReplaceTempView test?

@wangmiao1981
Copy link
Contributor Author

@felixcheung Let me think about your comments and I will get back soon.

@wangmiao1981
Copy link
Contributor Author

@felixcheung I changed the tests according to your comments. For the specialtypeshandle, it should return the key in PRIMITIVE_TYPES based on the original backend return type. In the caller, I did adjustment on the types to make sure it is consistent with R native type.

@SparkQA
Copy link

SparkQA commented Aug 23, 2016

Test build #64297 has finished for PR 14613 at commit abce7ff.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

R/pkg/R/types.R Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this go `"^decimal(.+)$"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we switch on the first character? why don't we regex on the full string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case, there are other types that we want to handle in the future. Right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just simplify this at this point. It would be easy to add the additional checks later

@felixcheung
Copy link
Member

@sun-rui @shivaram thought?

@SparkQA
Copy link

SparkQA commented Sep 1, 2016

Test build #64808 has finished for PR 14613 at commit 4d6d048.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 2, 2016

Test build #64834 has finished for PR 14613 at commit 4e9e403.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Sep 2, 2016
… when collecting SparkDataFrame

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

registerTempTable(createDataFrame(iris), "iris")
str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y  from iris limit 5")))

'data.frame':	5 obs. of  2 variables:
 $ x: num  1 1 1 1 1
 $ y:List of 5
  ..$ : num 2
  ..$ : num 2
  ..$ : num 2
  ..$ : num 2
  ..$ : num 2

The problem is that spark returns `decimal(10, 0)` col type, instead of `decimal`. Thus, `decimal(10, 0)` is not handled correctly. It should be handled as "double".

As discussed in JIRA thread, we can have two potential fixes:
1). Scala side fix to add a new case when writing the object back; However, I can't use spark.sql.types._ in Spark core due to dependency issues. I don't find a way of doing type case match;

2). SparkR side fix: Add a helper function to check special type like `"decimal(10, 0)"` and replace it with `double`, which is PRIMITIVE type. This special helper is generic for adding new types handling in the future.

I open this PR to discuss pros and cons of both approaches. If we want to do Scala side fix, we need to find a way to match the case of DecimalType and StructType in Spark Core.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Manual test:
> str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y  from iris limit 5")))
'data.frame':	5 obs. of  2 variables:
 $ x: num  1 1 1 1 1
 $ y: num  2 2 2 2 2
R Unit tests

Author: [email protected] <[email protected]>

Closes #14613 from wangmiao1981/type.

(cherry picked from commit 0f30cde)
Signed-off-by: Felix Cheung <[email protected]>
@asfgit asfgit closed this in 0f30cde Sep 2, 2016
@felixcheung
Copy link
Member

Merged to 2.0.1 and 2.1.0. thanks!

@wangmiao1981 wangmiao1981 deleted the type branch September 26, 2016 20:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants