-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-19781][ML] Handle NULLs as well as NaNs in Bucketizer when handleInvalid is on #17123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from 16 commits
Commits
Show all changes
17 commits
Select commit
Hold shift + click to select a range
2b07514
add support for null values in Bucketizer
crackcell b3f98b6
fix a typo
crackcell b452cc2
fix code style errors
crackcell ced8984
add unit tests
crackcell 12a0f7f
simplify code
cb6c338
use Option[Double] to work with NULL values
b017774
merge
83e2045
use java.lang.Double instead of Row to reduce serilization cost
2efbebd
simplify changes
bdbb3ee
merge latest master
crackcell 138c7f4
fix compile errors
crackcell 6f22b6e
all test cases passed
crackcell 3e023e7
update migration guides
crackcell c0800d6
remove useless code
crackcell b954526
simplify code
crackcell db2eb8e
change doc location
crackcell 063327c
up
crackcell File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -53,7 +53,8 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String | |
| * Values at -inf, inf must be explicitly provided to cover all Double values; | ||
| * otherwise, values outside the splits specified will be treated as errors. | ||
| * | ||
| * See also [[handleInvalid]], which can optionally create an additional bucket for NaN values. | ||
| * See also [[handleInvalid]], which can optionally create an additional bucket for NaN/NULL | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This sounds like a behavior change, we should add an item in migration guide of ML docs.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @viirya done. |
||
| * values. | ||
| * | ||
| * @group param | ||
| */ | ||
|
|
@@ -163,7 +164,7 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String | |
|
|
||
| val (filteredDataset, keepInvalid) = { | ||
| if (getHandleInvalid == Bucketizer.SKIP_INVALID) { | ||
| // "skip" NaN option is set, will filter out NaN values in the dataset | ||
| // "skip" NaN/NULL option is set, will filter out NaN/NULL values in the dataset | ||
| (dataset.na.drop(inputColumns).toDF(), false) | ||
| } else { | ||
| (dataset.toDF(), getHandleInvalid == Bucketizer.KEEP_INVALID) | ||
|
|
@@ -177,8 +178,10 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String | |
| } | ||
|
|
||
| val bucketizers: Seq[UserDefinedFunction] = seqOfSplits.zipWithIndex.map { case (splits, idx) => | ||
| udf { (feature: Double) => | ||
| Bucketizer.binarySearchForBuckets(splits, feature, keepInvalid) | ||
| udf { (feature: java.lang.Double) => | ||
| Bucketizer.binarySearchForBuckets(splits, | ||
| if (feature == null) None else Option(feature), | ||
| keepInvalid) | ||
| }.withName(s"bucketizer_$idx") | ||
| } | ||
|
|
||
|
|
@@ -259,34 +262,34 @@ object Bucketizer extends DefaultParamsReadable[Bucketizer] { | |
| * Binary searching in several buckets to place each data point. | ||
| * @param splits array of split points | ||
| * @param feature data point | ||
| * @param keepInvalid NaN flag. | ||
| * Set "true" to make an extra bucket for NaN values; | ||
| * Set "false" to report an error for NaN values | ||
| * @param keepInvalid NaN/NULL flag. | ||
| * Set "true" to make an extra bucket for NaN/NULL values; | ||
| * Set "false" to report an error for NaN/NULL values | ||
| * @return bucket for each data point | ||
| * @throws SparkException if a feature is < splits.head or > splits.last | ||
| */ | ||
|
|
||
| private[feature] def binarySearchForBuckets( | ||
| splits: Array[Double], | ||
| feature: Double, | ||
| feature: Option[Double], | ||
| keepInvalid: Boolean): Double = { | ||
| if (feature.isNaN) { | ||
| if (feature.isEmpty || feature.get.isNaN) { | ||
| if (keepInvalid) { | ||
| splits.length - 1 | ||
| } else { | ||
| throw new SparkException("Bucketizer encountered NaN value. To handle or skip NaNs," + | ||
| " try setting Bucketizer.handleInvalid.") | ||
| throw new SparkException("Bucketizer encountered NaN/NULL values. " + | ||
| "To handle or skip NaNs/NULLs, try setting Bucketizer.handleInvalid.") | ||
| } | ||
| } else if (feature == splits.last) { | ||
| } else if (feature.get == splits.last) { | ||
| splits.length - 2 | ||
| } else { | ||
| val idx = ju.Arrays.binarySearch(splits, feature) | ||
| val idx = ju.Arrays.binarySearch(splits, feature.get) | ||
| if (idx >= 0) { | ||
| idx | ||
| } else { | ||
| val insertPos = -idx - 1 | ||
| if (insertPos == 0 || insertPos == splits.length) { | ||
| throw new SparkException(s"Feature value $feature out of Bucketizer bounds" + | ||
| throw new SparkException(s"Feature value ${feature.get} out of Bucketizer bounds" + | ||
| s" [${splits.head}, ${splits.last}]. Check your features, or loosen " + | ||
| s"the lower/upper bound constraints.") | ||
| } else { | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I think for skip,
dataset.na.dropdrops NULL before. We didn't change its behavior.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, you are right. :-p