Skip to content

Conversation

@wangmiao1981
Copy link
Contributor

What changes were proposed in this pull request?

SPARK-20307 Added handleInvalid option to RFormula for tree-based classification algorithms. We should add this parameter for other classification algorithms in SparkR.

This is a followup PR for SPARK-20307.

How was this patch tested?

New Unit tests are added.

@SparkQA
Copy link

SparkQA commented Jul 12, 2017

Test build #79546 has finished for PR 18605 at commit 77b04a3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangmiao1981
Copy link
Contributor Author

@felixcheung This is a follow-up PR of JIRA-20307.

@wangmiao1981
Copy link
Contributor Author

Trigger windows check.

@wangmiao1981
Copy link
Contributor Author

Reopen for windows check

@wangmiao1981 wangmiao1981 reopened this Jul 12, 2017
Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you update this to make it consistent with the earlier PR? I think it's mostly the param document wording

@wangmiao1981
Copy link
Contributor Author

Sure. I am reading the #18613 comments. Just come back from a business travel. Thanks!

@SparkQA
Copy link

SparkQA commented Jul 17, 2017

Test build #79678 has finished for PR 18605 at commit e59941e.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangmiao1981
Copy link
Contributor Author

wangmiao1981 commented Jul 17, 2017

@yanboliang after #18613, unit tests fails if "skip" is used.

For example,
data <- data.frame(clicked = base::sample(c(0, 1), 10, replace = TRUE),
someString = base::sample(c("this", "that"), 10, replace = TRUE),
stringsAsFactors = FALSE)
trainidxs <- base::sample(nrow(data), nrow(data) * 0.7)
traindf <- as.DataFrame(data[trainidxs, ])
testdf <- as.DataFrame(rbind(data[-trainidxs, ], c(0, "the other")))
model <- spark.mlp(traindf, clicked ~ ., layers = c(1, 3), handleInvalid = "skip")
predictions <- predict(model, testdf)
expect_equal(class(collect(predictions)$clicked[1]), "character")

It fails the as if "error" is used.

If I change "skip" to "keep", then the predictions$click[0] is NULL.

collect(predictions)
[1] clicked someString prediction
<0 rows> (or 0-length row.names)
collect(predictions)$click[1]
[[1]]
NULL

I am not sure whether this is expected or there is a bug.

Before, the units work fine.

@yanboliang
Copy link
Contributor

@wangmiao1981 This is expected, see my comment here . This uncovers an existing bug for forceIndexLabel. I will send a fix later, please use keep to test handleInvalid currently. Thanks.

@wangmiao1981
Copy link
Contributor Author

@yanboliang Thanks for your reply! I will change the unit tests now.

@SparkQA
Copy link

SparkQA commented Jul 18, 2017

Test build #79728 has finished for PR 18605 at commit 3d7c517.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 19, 2017

Test build #79734 has finished for PR 18605 at commit 3ebb5cd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangmiao1981
Copy link
Contributor Author

@yanboliang I have made changes accordingly. Thanks!

@wangmiao1981
Copy link
Contributor Author

@felixcheung Can you take a look? Thanks!

@felixcheung
Copy link
Member

I'll take a look

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry about the delay, one comment otherwise LG

#' "error" (throw an error), "keep" (put invalid data in a special additional
#' bucket, at index numLabels). Default is "error".
#' @param handleInvalid How to handle invalid data (unseen labels or NULL values) in features and label
#' column of string type.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was only for classification though, so the original text in classification model. we should keep.
ditto for decisionTree and gbt in this .R file

@SparkQA
Copy link

SparkQA commented Jul 31, 2017

Test build #80081 has finished for PR 18605 at commit 18c69d2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

merged to master

@asfgit asfgit closed this in 9570e81 Aug 1, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants