Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
package org.apache.spark.ml.evaluation

import org.apache.spark.annotation.{Experimental, Since}
import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
import org.apache.spark.ml.param.{BooleanParam, Param, ParamMap, ParamValidators}
import org.apache.spark.ml.param.shared.{HasLabelCol, HasPredictionCol}
import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable}
import org.apache.spark.mllib.evaluation.RegressionMetrics
Expand Down Expand Up @@ -69,7 +69,27 @@ final class RegressionEvaluator @Since("1.4.0") (@Since("1.4.0") override val ui
@Since("1.4.0")
def setLabelCol(value: String): this.type = set(labelCol, value)

setDefault(metricName -> "rmse")
/**
* Param for whether to drop rows where 'predictionCol' is NaN. NOTE - only set this to
* true if you are certain that NaN predictions should be ignored!
* (default: false)
*
* @group expertParam
*/
@Since("2.0.0")
val dropNaN: BooleanParam = new BooleanParam(this, "dropNaN",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making this a Boolean parameter called dropNaN makes it less extensible in the future if we wish to implement more than just one possible NaN behavior. If we don't envision adding any other behavior then I guess this is good, but otherwise we could make a String param and limit its options to drop or raise an error for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me see how nulls are handled given nullable input columns, and perhaps the possible strategies can be adjusted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently if nulls are present RegressionEvaluator throws a MatchError. I think we should either (a) disallow nullable columns explicitly with a schema check in evaluate - this can then provide a more understandable error message too; (b) allow nulls, but ignore them for both prediction and label col.

I think nulls in the input for this case are unlikely and probably a result of bad data or user error somewhere along the line. So I'd prefer option (a). This then means the dropNaN setting will only apply to NaNs.

"whether to drop rows where 'predictionCol' is NaN. NOTE - only set this to true if you are " +
"certain that NaN predictions should be ignored! (default: false)")

/** @group expertGetParam */
@Since("2.0.0")
def getDropNaN: Boolean = $(dropNaN)

/** @group expertSetParam */
@Since("2.0.0")
def setDropNaN(value: Boolean): this.type = set(dropNaN, value)

setDefault(metricName -> "rmse", dropNaN -> false)

@Since("2.0.0")
override def evaluate(dataset: Dataset[_]): Double = {
Expand All @@ -86,8 +106,9 @@ final class RegressionEvaluator @Since("1.4.0") (@Since("1.4.0") override val ui

val predictionAndLabels = dataset
.select(col($(predictionCol)).cast(DoubleType), col($(labelCol)).cast(DoubleType))
.rdd.
map { case Row(prediction: Double, label: Double) =>
.na.drop("any", if ($(dropNaN)) Seq($(predictionCol)) else Seq())
Copy link
Contributor

@sethah sethah Apr 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also drops null values. I'm not sure how likely this is to happen, but the documentation should probably note it drops NaN and null values. Also, should we add a test case to verify that null values are ignored?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, will add null to test cases. I don't think it's likely in practice. But actually if nulls do exist in the dataset, it's worse than NaN from a correctness point of view, as either a NPE will be thrown, or it will be treated as 0 => 0 squared error for that datapoint, but the denominator will still be added for the mean calculation. So MSE will be biased low.

.rdd
.map { case Row(prediction: Double, label: Double) =>
(prediction, label)
}
val metrics = new RegressionMetrics(predictionAndLabels)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,19 @@ class RegressionEvaluatorSuite
assert(evaluator.evaluate(predictions) ~== 0.08399089 absTol 0.01)
}

test("support dropping NaNs from prediction column") {
val local = this.sqlContext
import local.implicits._
val dataset = Seq(
(5.0, 4.0), (1.0, 4.0), (2.0, Double.NaN), (3.0, 1.0), (4.0, Double.NaN)
).toDF("label", "prediction")

val evaluator = new RegressionEvaluator()
assert(evaluator.evaluate(dataset).isNaN)
evaluator.setDropNaN(true)
assert(evaluator.evaluate(dataset) ~== 2.16024 absTol 1e-2)
}

test("read/write") {
val evaluator = new RegressionEvaluator()
.setPredictionCol("myPrediction")
Expand Down
38 changes: 33 additions & 5 deletions python/pyspark/ml/evaluation.py
Original file line number Diff line number Diff line change
Expand Up @@ -187,6 +187,13 @@ class RegressionEvaluator(JavaEvaluator, HasLabelCol, HasPredictionCol):
0.993...
>>> evaluator.evaluate(dataset, {evaluator.metricName: "mae"})
2.649...
>>> scoreAndLabels = [(4.0, 5.0), (4.0, 1.0), (float('nan'), 2.0),
... (1.0, 3.0), (float('nan'), 4.0)]
>>> dataset = sqlContext.createDataFrame(scoreAndLabels, ["raw", "label"])
...
>>> evaluator = RegressionEvaluator(predictionCol="raw").setDropNaN(True)
>>> evaluator.evaluate(dataset)
Copy link
Contributor

@sethah sethah Apr 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't quite mirror the Scala test, since the scala test first checks that the result is NaN when dropNaN is false. Shall we do the same check here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

2.160...

.. versionadded:: 1.4.0
"""
Expand All @@ -197,18 +204,24 @@ class RegressionEvaluator(JavaEvaluator, HasLabelCol, HasPredictionCol):
"metric name in evaluation (mse|rmse|r2|mae)",
typeConverter=TypeConverters.toString)

dropNaN = Param(Params._dummy(), "dropNaN",
"whether to drop rows where 'predictionCol' is NaN. NOTE - only set this to " +
"True if you are certain that NaN predictions should be ignored! " +
"(default: False)",
typeConverter=TypeConverters.toBoolean)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: move typeConverter up a line.


@keyword_only
def __init__(self, predictionCol="prediction", labelCol="label",
metricName="rmse"):
metricName="rmse", dropNaN=False):
"""
__init__(self, predictionCol="prediction", labelCol="label", \
metricName="rmse")
metricName="rmse", dropNaN=False)
"""
super(RegressionEvaluator, self).__init__()
self._java_obj = self._new_java_obj(
"org.apache.spark.ml.evaluation.RegressionEvaluator", self.uid)
self._setDefault(predictionCol="prediction", labelCol="label",
metricName="rmse")
metricName="rmse", dropNaN=False)
kwargs = self.__init__._input_kwargs
self._set(**kwargs)

Expand All @@ -227,13 +240,28 @@ def getMetricName(self):
"""
return self.getOrDefault(self.metricName)

@since("2.0.0")
def setDropNaN(self, value):
"""
Sets the value of :py:attr:`dropNaN`.
"""
self._set(dropNaN=value)
return self

@since("2.0.0")
def getDropNaN(self):
"""
Gets the value of dropNaN or its default value.
"""
return self.getOrDefault(self.dropNaN)

@keyword_only
@since("1.4.0")
def setParams(self, predictionCol="prediction", labelCol="label",
metricName="rmse"):
metricName="rmse", dropNaN=False):
"""
setParams(self, predictionCol="prediction", labelCol="label", \
metricName="rmse")
metricName="rmse", dropNaN=False)
Sets params for regression evaluator.
"""
kwargs = self.setParams._input_kwargs
Expand Down