Skip to content
Closed
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -103,15 +103,22 @@ final class QuantileDiscretizer(override val uid: String)

@Since("1.6.0")
object QuantileDiscretizer extends DefaultParamsReadable[QuantileDiscretizer] with Logging {

/**
* Minimum number of samples required for finding splits, regardless of number of bins. If
* the dataset has less rows than this value, the entire dataset column will be used.
*/
val minSamplesRequired: Int = 10000
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test will require creating a dataset at least as large as minSamplesRequired, so making its value excessively large could slow down testing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

less rows -> fewer rows
entire dataset column -> entire dataset?

10000 just isn't that big. A dummy data set in a test would be, what, a megabyte in memory? Am I missing a bigger problem there?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you're missing anything. It's my first time contributing and I just want to be explicit for reviewers of the patch. I agree that 10K isn't that big, especially out "in the wild". However, I wasn't sure if there were standards for time/memory consumption for tests so I added the line note so that reviewers with more experience than me would be aware in case there are established/de facto testing standards.

I'll make the "->" changes you've indicated and push a new commit sometime today. Thanks.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, my original reason for asking about removing the hard coded value of 10K was because that value was partly the cause of the bug and so a regression test would need to know the value.

I could have hard coded 10k in to my test. However if a developer increased its value later, say to 100K, without increasing the hard coded test value as well, they would render the test useless since it would always pass.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we have muuuch bigger tests that turn up whole small clusters. Making a little data set is fine. You can expose the 10000 figure as a private[spark] val in the object so you can reference it from test code.


/**
* Sampling from the given dataset to collect quantile statistics.
*/
private[feature] def getSampledInput(dataset: DataFrame, numBins: Int, seed: Long): Array[Row] = {
val totalSamples = dataset.count()
require(totalSamples > 0,
"QuantileDiscretizer requires non-empty input dataset but was given an empty input.")
val requiredSamples = math.max(numBins * numBins, 10000)
val fraction = math.min(requiredSamples / dataset.count(), 1.0)
val requiredSamples = math.max(numBins * numBins, minSamplesRequired)
val fraction = math.min(requiredSamples.toDouble / dataset.count(), 1.0)
dataset.sample(withReplacement = false, fraction, new XORShiftRandom(seed).nextInt()).collect()
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,26 @@ class QuantileDiscretizerSuite
}
}

test("Test splits on relatively large dataset") {
val sqlCtx = SQLContext.getOrCreate(sc)
import sqlCtx.implicits._

val datasetSize = QuantileDiscretizer.minSamplesRequired + 1
val numBuckets = 5
val df = sc.parallelize((1.0 to datasetSize by 1.0).map(Tuple1.apply)).toDF("input")
val discretizer = new QuantileDiscretizer()
.setInputCol("input")
.setOutputCol("result")
.setNumBuckets(numBuckets)
.setSeed(1)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the offending line


val result = discretizer.fit(df).transform(df)
val observedNumBuckets = result.select("result").distinct.count

assert(observedNumBuckets === numBuckets,
"Observed number of buckets does not equal expected number of buckets.")
}

test("read/write") {
val t = new QuantileDiscretizer()
.setInputCol("myInputCol")
Expand Down