Skip to content

Conversation

@yanboliang
Copy link
Contributor

What changes were proposed in this pull request?

RFormula should handle invalid for both features and label column.
#18496 only handle invalid values in features column. This PR add handling invalid values for label column and test cases.

How was this patch tested?

Add test cases.

@yanboliang
Copy link
Contributor Author

cc @felixcheung @wangmiao1981

@SparkQA
Copy link

SparkQA commented Jul 12, 2017

Test build #79561 has finished for PR 18613 at commit 2df7e35.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

model <- spark.randomForest(traindf, clicked ~ ., type = "classification",
maxDepth = 10, maxBins = 10, numTrees = 10,
handleInvalid = "skip")
handleInvalid = "keep")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of R always forceIndexLabel which will index label whether it is numeric or string type, this leads to 0.0 and 0 in R label are different. If we choose skip, it will make all labels unseen. I think this is a bug, maybe we should fix it in a separate PR.

@felixcheung
Copy link
Member

in #18496 we discuss the behavior of the output prediction (#18496 (comment)), similar in #18613 (comment), I'd suggest we step back and review how handleInvalid should work in Scala first.

I think we can still make progress in this PR and #18605, but likely need some changes in Scala.

@SparkQA
Copy link

SparkQA commented Jul 12, 2017

Test build #79566 has finished for PR 18613 at commit 9132640.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangmiao1981
Copy link
Contributor

@felixcheung I agree. We should make changes in Scala side.

@yanboliang
Copy link
Contributor Author

@felixcheung @wangmiao1981 In Scala, we set handleInvalid for both estimator and model, although it only takes effect for model prediction. The reason behind this is we should support pipeline training and transform, so we need to support set model param during estimator fitting.
For R side, why I propose to set handleInvalid for predict is there is no pipeline concept for R and in other native R algorithms like glm, we set handleInvalid for predict. I'm open to hear you thoughts.
BTW, this PR is not related to the above discussion, let move corresponding discussion to the other PR. This PR fix the bug to make label column also can handle invalid according users' setting. Would you mind to have a look at this one? Thanks.

assert(result1.collect() === expected1.collect())
assert(result2.collect() === expected2.collect())

// Handle unseen labels.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following test cases is failed before this PR.

@felixcheung
Copy link
Member

felixcheung commented Jul 13, 2017

isn't it confusing to silently drop features?
also one might want different policy on how to handle invalid with features vs. label?

@yanboliang
Copy link
Contributor Author

@felixcheung We don't silently drop features, we use handleInvalid to let users decide how to handle invalid features or label. The behavior is consistent with Scala which supports to handle invalid features and label. BTW, this PR add support to handle invalid values for label, the original PR by @wangmiao1981 has already supported to handle invalid features.
With regards to different policy, this may should be discussed. We can provide two options for features and label respectively, but I think handling invalid values for features and label with the same policy may be the most common use case, and I don't want to make code much complicated before we can see users' strong requirements. And we can split it into two params to control features and label later when it's really necessary. However, I don't have strong opinions about this, and looking forward to hear your thoughts. Thanks.

@felixcheung
Copy link
Member

@yanboliang that's what I mean. to elaborate,
if handleInvalid = "skip", and with this applying to features, then that feature will just be ignored silently?

I get that part on #18496 - I asked actually https://github.com/apache/spark/pull/18496/files#r125154606 - thought it was confusing.

ok, I agree with your assessment on starting with the same policy.

@yanboliang
Copy link
Contributor Author

Merged into master. Thanks for all reviewing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants