Skip to content
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion R/pkg/R/column.R
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ operators <- list(
"&" = "and", "|" = "or", #, "!" = "unary_$bang"
"^" = "pow"
)
column_functions1 <- c("asc", "desc", "isNull", "isNotNull")
column_functions1 <- c("asc", "desc", "isNaN", "isNull", "isNotNull")
column_functions2 <- c("like", "rlike", "startsWith", "endsWith", "getField", "getItem", "contains")

createOperator <- function(op) {
Expand Down
30 changes: 23 additions & 7 deletions R/pkg/R/functions.R
Original file line number Diff line number Diff line change
Expand Up @@ -488,19 +488,35 @@ setMethod("initcap",
column(jc)
})

#' isNaN
#' isnan
#'
#' Return true iff the column is NaN.
#' Return true if the column is NaN.
#'
#' @rdname isNaN
#' @name isNaN
#' @rdname isnan
#' @name isnan
#' @family normal_funcs
#' @export
#' @examples \dontrun{isNaN(df$c)}
setMethod("isNaN",
#' @examples \dontrun{isnan(df$c)}
setMethod("isnan",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be called (or have an alias) is.nan?
https://stat.ethz.ch/R-manual/R-devel/library/base/html/is.finite.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it makes sense to add an alias as is.nan

signature(x = "Column"),
function(x) {
jc <- callJStatic("org.apache.spark.sql.functions", "isNaN", x@jc)
jc <- callJStatic("org.apache.spark.sql.functions", "isnan", x@jc)
column(jc)
})

#' isnull
#'
#' Return true if the column is NULL.
#'
#' @rdname isnull
#' @name isnull
#' @family normal_funcs
#' @export
#' @examples \dontrun{isnull(df$c)}
setMethod("isnull",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is.null works on the object as a whole. is.na is better()?

> is.null(list(1, NULL))
[1] FALSE
> is.na(list(1, NA))
[1] FALSE  TRUE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, is.null works on the whole object, but is.na has different semantics with is.null. Should we still support the alias?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the context of DataFrame column, "null" means missing value, which I think NA in R means. When we read a column from a DataFrame to R side, null will be converted to NA, see the code at https://github.com/apache/spark/blob/master/R/pkg/R/deserialize.R#L115

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this gets tricky... R's usage for NULL is not the same as JVM's. In many cases it might be closer to R's NA though not exactly the same either.

eg. can't set a column in data.frame to NULL in the conventional way:
http://www.cookbook-r.com/Manipulating_data/Adding_and_removing_columns_from_a_data_frame/

Maybe it's fine to leave it as isnull for distinction - we really need some explanations on what is a NULL column.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our replies crossed - it would seem to me to say Scala/Python isnull == R is.na is confusing.

In any case we could definitely use some explanations on NULL/NA in the SPARKR programming guide, opened JIRA https://issues.apache.org/jira/browse/SPARK-12071

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good. It's a place where we can discuss how we map R NA or NULL to Scala null.

signature(x = "Column"),
function(x) {
jc <- callJStatic("org.apache.spark.sql.functions", "isnull", x@jc)
column(jc)
})

Expand Down
12 changes: 10 additions & 2 deletions R/pkg/R/generics.R
Original file line number Diff line number Diff line change
Expand Up @@ -621,6 +621,10 @@ setGeneric("getField", function(x, ...) { standardGeneric("getField") })
#' @export
setGeneric("getItem", function(x, ...) { standardGeneric("getItem") })

#' @rdname column
#' @export
setGeneric("isNaN", function(x) { standardGeneric("isNaN") })
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if isNaN is being removed, we shouldn't set this generic right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, isNaN is around since Spark 1.5 - if we are taking this out we would need to note this breaking change in release doc with a JIRA

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, rereading the description. it sounds like we have
isNaN for Column
isnan for DataFrame
?
That seems a bit confusing..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, indeed. I have send #10056 to push SQL side make change to uniform interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't uniform the isNaN and isnan interface at SQL side, please refer the comments at #10056. Further more, there are different semantics between Scala and R about isnull. So let's narrow this PR to provide the same functions as Scala at SparkR side, this is my original motivation. I will add test cases for column as @sun-rui suggested. Then I think this PR can be merged firstly and we start to discuss corresponding alias or explanations about the NULL/NA difference at SPARK-12071. @sun-rui @felixcheung @shivaram

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the resolution here? isNull is still exported after this PR ? How is @felixcheung 's example going to change after this PR ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isNull and isNotNull are still exported after this PR because they have been exist at Spark 1.5. isNull and isNotNull (with upper case) are functions of Column, I consider they are Spark specific functions so did not remove them. If we want to remove them, I can do it in a follow-up PR and they are breaking changes need to be explain in release note. @shivaram

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I think I get it. Let me summarize the situation below and let know if I am getting it right.

  1. We have isNaN, isNull, isNotNull for Column as defined in column.R. These mirror the scala functions.
  2. We have added isnan and is.nan for Column in this PR. These call isnan in Scala. And I presume their behavior is this the same as isNaN ?
  3. In addition to this, we have some DataFrame operators called isNaN ? I can't find that call in our unit test file, so I guess it doesn't exist in SparkR ? Does this exist in Scala ?
  4. We convert NA in R to null in the SparkSQL side.

I think the change looks fine to me, but I just want to understand the different things going on here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that's correct.We could scope this PR to only isnan on Column. And track others with JIRAs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shivaram, to be clear, IsNaN, isnan and is.nan are functions (in org.apache.spark.sql.functions) applied to Column, they are not DataFrame operators. IsNaN function is deprecated by isnan function. We add is.nan as an alias of isnan.


#' @rdname column
#' @export
setGeneric("isNull", function(x) { standardGeneric("isNull") })
Expand Down Expand Up @@ -796,9 +800,13 @@ setGeneric("initcap", function(x) { standardGeneric("initcap") })
#' @export
setGeneric("instr", function(y, x) { standardGeneric("instr") })

#' @rdname isNaN
#' @rdname isnan
#' @export
setGeneric("isNaN", function(x) { standardGeneric("isNaN") })
setGeneric("isnan", function(x) { standardGeneric("isnan") })

#' @rdname isnull
#' @export
setGeneric("isnull", function(x) { standardGeneric("isnull") })

#' @rdname kurtosis
#' @export
Expand Down
2 changes: 1 addition & 1 deletion R/pkg/inst/tests/test_sparkSQL.R
Original file line number Diff line number Diff line change
Expand Up @@ -878,7 +878,7 @@ test_that("column functions", {
c2 <- avg(c) + base64(c) + bin(c) + bitwiseNOT(c) + cbrt(c) + ceil(c) + cos(c)
c3 <- cosh(c) + count(c) + crc32(c) + exp(c)
c4 <- explode(c) + expm1(c) + factorial(c) + first(c) + floor(c) + hex(c)
c5 <- hour(c) + initcap(c) + isNaN(c) + last(c) + last_day(c) + length(c)
c5 <- hour(c) + initcap(c) + isnan(c) + isnull(c) + last(c) + last_day(c) + length(c)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add test cases for "isNaN", "isNull", "isNotNul" for Column

c6 <- log(c) + (c) + log1p(c) + log2(c) + lower(c) + ltrim(c) + max(c) + md5(c)
c7 <- mean(c) + min(c) + month(c) + negate(c) + quarter(c)
c8 <- reverse(c) + rint(c) + round(c) + rtrim(c) + sha1(c)
Expand Down