Skip to content
Closed
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
2adebe8
add a chiSquare Selector based on False Positive Rate (FPR) test
Aug 10, 2016
04053ca
Merge remote-tracking branch 'origin/master' into fprChiSquare
Aug 11, 2016
7623563
Configure the ChiSqSelector to reuse ChiSqTestResult by numTopFeature…
Aug 16, 2016
3d6aecb
Config the ChiSqSelector to reuse the ChiSqTestResult by KBest, Perce…
Aug 17, 2016
026ac85
Merge branch 'master' into fprChiSquare2
Aug 17, 2016
5305709
add Since annotation
Aug 17, 2016
1e8d83a
Not reuse the ChiSqTestResult to be consistent with other methods
Aug 22, 2016
85a17dd
fix Percentile bugs, optimize the code
Aug 22, 2016
61b71c8
change the default value of Percentile
Aug 22, 2016
d7b2892
Add require for setAlpha value
Aug 23, 2016
6699396
rm isSorted function, change gtEq(0) to inRange(0,1) for percentile a…
Aug 23, 2016
b8986b5
Optimize fit function of ml ChiSqSelector
Aug 23, 2016
5c2e44c
Fpr to FPR, sort all cases in fit
Aug 24, 2016
0d3967a
Add Python API for ChiSqSelector
Aug 29, 2016
1dc6a8e
split the ChiSqSelector param to numTopFeateres, Percentile, Alpha in…
Sep 5, 2016
9908871
Add type check for Python ChiSqSelector
Sep 5, 2016
bbccac7
Change the exception type of value check
Sep 6, 2016
c35bcf1
change python code style
Sep 13, 2016
e8f03ed
change python code style
Sep 13, 2016
ec74dde
revert isSort to pass MiMa test
Sep 13, 2016
6398f4c
Change MimaExcludes
Sep 14, 2016
6cc4c92
Change Mima conflict
Sep 14, 2016
1d2f67f
add javadoc
Sep 18, 2016
6220dd5
fix mima conflict
Sep 18, 2016
ce3f8fb
Merge remote-tracking branch 'origin/master' into fprChiSquare
Sep 19, 2016
88d2143
change javadoc
Sep 19, 2016
24f26f2
fix mima conflict
Sep 20, 2016
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,29 @@ private[feature] trait ChiSqSelectorParams extends Params

/** @group getParam */
def getNumTopFeatures: Int = $(numTopFeatures)

final val percentile = new DoubleParam(this, "percentile",
"Percentile of features that selector will select, ordered by statistics value descending.",
ParamValidators.gtEq(0))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use .inRange(0, 1) here? it needs to be <= 1 too.

setDefault(percentile -> 10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this means 10%, but I would expect it to be specified as 0.1. I wonder how it's done elsewhere? I think we've not used percents. At least we also want to validate that the value is <= 100% (or 1)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update, thanks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need to change this, because other functions like the SQL percentile function accept values in [0,1] instead of [0,100]

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I will update it. Thanks.


/** @group getParam */
def getPercentile: Double = $(percentile)

final val alpha = new DoubleParam(this, "alpha",
"The highest p-value for features to be kept.",
ParamValidators.gtEq(0))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, only makes sense for this to be in [0,1]

setDefault(alpha -> 0.05)

/** @group getParam */
def getAlpha: Double = $(alpha)

final val selectorType = new Param[String](this, "selectorType",
"ChiSqSelector Type: KBest, Percentile, Fpr")
setDefault(selectorType -> "KBest")

/** @group getParam */
def getChiSqSelectorType: String = $(selectorType)
}

/**
Expand All @@ -67,9 +90,27 @@ final class ChiSqSelector @Since("1.6.0") (@Since("1.6.0") override val uid: Str
@Since("1.6.0")
def this() = this(Identifiable.randomUID("chiSqSelector"))

@Since("2.1.0")
var chiSqSelector: feature.ChiSqSelector = null

/** @group setParam */
@Since("1.6.0")
def setNumTopFeatures(value: Int): this.type = set(numTopFeatures, value)
@Since("2.1.0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I just noticed this: it should still say "since 1.6.0"

def setNumTopFeatures(value: Int): this.type = {
set(selectorType, "KBest")
Copy link
Member

@srowen srowen Aug 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, let's make these values like "KBest" some fields in a private object Edit: oh, this was done below. They can be used here? and ideally they are hidden as an implementation detail.

set(numTopFeatures, value)
}

@Since("2.1.0")
def setPercentile(value: Double): this.type = {
set(selectorType, "Percentile")
set(percentile, value)
}

@Since("2.1.0")
def setAlpha(value: Double): this.type = {
set(selectorType, "Fpr")
set(alpha, value)
}

/** @group setParam */
@Since("1.6.0")
Expand All @@ -91,8 +132,38 @@ final class ChiSqSelector @Since("1.6.0") (@Since("1.6.0") override val uid: Str
case Row(label: Double, features: Vector) =>
OldLabeledPoint(label, OldVectors.fromML(features))
}
val chiSqSelector = new feature.ChiSqSelector($(numTopFeatures)).fit(input)
copyValues(new ChiSqSelectorModel(uid, chiSqSelector).setParent(this))
$(selectorType) match {
case "KBest" =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't these use the toString of the new enumeration or something like that? or just parse the string as an enum value and match that.

chiSqSelector = new feature.ChiSqSelector().setNumTopFeatures($(numTopFeatures))
case "Percentile" =>
chiSqSelector = new feature.ChiSqSelector().setPercentile($(percentile))
case "Fpr" =>
chiSqSelector = new feature.ChiSqSelector().setAlpha($(alpha))
case _ => throw new Exception("Unknown ChiSqSelector Type.")
}
val model = chiSqSelector.fit(input)
copyValues(new ChiSqSelectorModel(uid, model).setParent(this))
}

@Since("2.1.0")
def selectKBest(value: Int): ChiSqSelectorModel = {
require(chiSqSelector != null, "ChiSqSelector has not been created.")
val model = chiSqSelector.selectKBest(value)
copyValues(new ChiSqSelectorModel(uid, model).setParent(this))
}

@Since("2.1.0")
def selectPercentile(value: Double): ChiSqSelectorModel = {
require(chiSqSelector != null, "ChiSqSelector has not been created.")
val model = chiSqSelector.selectPercentile(value)
copyValues(new ChiSqSelectorModel(uid, model).setParent(this))
}

@Since("2.1.0")
def selectFpr(value: Double): ChiSqSelectorModel = {
require(chiSqSelector != null, "ChiSqSelector has not been created.")
val model = chiSqSelector.selectFpr(value)
copyValues(new ChiSqSelectorModel(uid, model).setParent(this))
}

@Since("1.6.0")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,22 +27,27 @@ import org.apache.spark.annotation.Since
import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vector, Vectors}
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.stat.Statistics
import org.apache.spark.mllib.stat.test.ChiSqTestResult
import org.apache.spark.mllib.util.{Loader, Saveable}
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext
import org.apache.spark.sql.{Row, SparkSession}

@Since("2.1.0")
object ChiSqSelectorType extends Enumeration {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be private[spark]? I would expect this would be in spark.ml but I don't feel strong about it. It is referenced in both packages.

type SelectorType = Value
val KBest, Percentile, Fpr = Value
}

/**
* Chi Squared selector model.
*
* @param selectedFeatures list of indices to select (filter). Must be ordered asc
* @param selectedFeatures list of indices to select (filter).
*/
@Since("1.3.0")
class ChiSqSelectorModel @Since("1.3.0") (
@Since("1.3.0") val selectedFeatures: Array[Int]) extends VectorTransformer with Saveable {

require(isSorted(selectedFeatures), "Array has to be sorted asc")

protected def isSorted(array: Array[Int]): Boolean = {
var i = 1
val len = array.length
Expand All @@ -69,21 +74,22 @@ class ChiSqSelectorModel @Since("1.3.0") (
* Preserves the order of filtered features the same as their indices are stored.
* Might be moved to Vector as .slice
* @param features vector
* @param filterIndices indices of features to filter, must be ordered asc
* @param filterIndices indices of features to filter
*/
private def compress(features: Vector, filterIndices: Array[Int]): Vector = {
val orderedIndices = filterIndices.sorted
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be computed once and stored, rather than store unsorted indices and resort them.

features match {
case SparseVector(size, indices, values) =>
val newSize = filterIndices.length
val newSize = orderedIndices.length
val newValues = new ArrayBuilder.ofDouble
val newIndices = new ArrayBuilder.ofInt
var i = 0
var j = 0
var indicesIdx = 0
var filterIndicesIdx = 0
while (i < indices.length && j < filterIndices.length) {
while (i < indices.length && j < orderedIndices.length) {
indicesIdx = indices(i)
filterIndicesIdx = filterIndices(j)
filterIndicesIdx = orderedIndices(j)
if (indicesIdx == filterIndicesIdx) {
newIndices += j
newValues += values(i)
Expand All @@ -101,7 +107,7 @@ class ChiSqSelectorModel @Since("1.3.0") (
Vectors.sparse(newSize, newIndices.result(), newValues.result())
case DenseVector(values) =>
val values = features.toArray
Vectors.dense(filterIndices.map(i => values(i)))
Vectors.dense(orderedIndices.map(i => values(i)))
case other =>
throw new UnsupportedOperationException(
s"Only sparse and dense vectors are supported but got ${other.getClass}.")
Expand Down Expand Up @@ -171,14 +177,47 @@ object ChiSqSelectorModel extends Loader[ChiSqSelectorModel] {

/**
* Creates a ChiSquared feature selector.
* @param numTopFeatures number of features that selector will select
* (ordered by statistic value descending)
* Note that if the number of features is less than numTopFeatures,
* then this will select all features.
*/
@Since("1.3.0")
class ChiSqSelector @Since("1.3.0") (
@Since("1.3.0") val numTopFeatures: Int) extends Serializable {
@Since("2.1.0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise the class is still "since 1.3.0"

class ChiSqSelector @Since("2.1.0") () extends Serializable {
private var numTopFeatures: Int = 50
private var percentile: Double = 10.0
private var alpha: Double = 0.05
private var selectorType = ChiSqSelectorType.KBest
private var chiSqTestResult: Array[ChiSqTestResult] = _

@Since("1.3.0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing constructor should still have javadoc maybe pointing to the setNumTopFeatures method to say that's the effect it has

def this(numTopFeatures: Int) {
this()
this.numTopFeatures = numTopFeatures
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should call setNumTopFeatures to set the type too?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not necessary, because the default selectorType is KBest

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. It seemed to split the logic a bit here but it's not bad. The default behavior needs to be documented then. Now there is effectively a default numTopFeatures.

}

@Since("2.1.0")
def setNumTopFeatures(value: Int): this.type = {
numTopFeatures = value
selectorType = ChiSqSelectorType.KBest
this
}

@Since("2.1.0")
def setPercentile(value: Double): this.type = {
percentile = value
selectorType = ChiSqSelectorType.Percentile
this
}

@Since("2.1.0")
def setAlpha(value: Double): this.type = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it need a require for alpha here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

require is added, thanks

alpha = value
selectorType = ChiSqSelectorType.Fpr
this
}

@Since("2.1.0")
def setChiSqSelectorType(value: ChiSqSelectorType.Value): this.type = {
selectorType = value
this
}

/**
* Returns a ChiSquared feature selector.
Expand All @@ -189,11 +228,35 @@ class ChiSqSelector @Since("1.3.0") (
*/
@Since("1.3.0")
def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = {
val indices = Statistics.chiSqTest(data)
.zipWithIndex.sortBy { case (res, _) => -res.statistic }
.take(numTopFeatures)
.map { case (_, indices) => indices }
.sorted
chiSqTestResult = Statistics.chiSqTest(data)
selectorType match {
case ChiSqSelectorType.KBest => selectKBest(numTopFeatures)
case ChiSqSelectorType.Percentile => selectPercentile(percentile)
case ChiSqSelectorType.Fpr => selectFpr(alpha)
case _ => throw new Exception("Unknown ChiSqSelector Type")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IllegalStateException is probably better and this can include the type in the message.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I will update

}
}

@Since("2.1.0")
def selectKBest(value: Int): ChiSqSelectorModel = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think these would be private? fit is the entry point. They can receive the results as an argument?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user can call selectKBest directly, because, use can fit one time, and generate models by selectKBest multi times.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I thought this was the discussion we had, that the model would still just expose fit and transform for consistency. It's not bad to expose these methods, just wondered if it were consistent. I thought that was the reasoning to not do it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, the user can use fit and transform only for consistent.
If the user wants to reuse ChiSqTestResult, call selectKBest, SelectPercentile will help them. If we don't consider reuse ChiSqTestResult, we can not expose this methods.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I thought the idea is that the model would be fixed, without parameters, though it would have all the information internally (the chi squared results) to select or re-select features to filter in based on different logic. And that the selector would expose the parameters. I'm not against your approach here just thought that's what you were saying would be inconsistent. I suppose you're right that the model would have to mutate its state to compute different sets of indices and that introduces a complication.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chi squared test result should be the same no matter which type of selection you do, yes. I agree it's desirable to avoid computing it many times if one wanted to perform many variations on chi squared selection, but is that a common use case?

Still, I can also imagine going back to the idea of only finding the chi-squared values in the selector, and letting the model instance be parameterized. So its behavior as a transformer would be modified by setting parameters on it. I'm OK with that, I just thought you were saying this wasn't consistent with other model classes. Maybe I misunderstood that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @srowen , maybe I misunderstood you. I think make the model instance be parameterized is the same as current solution (make selector instance be parameterized). This only difference of your suggestion is moving the logic from selector to model. Both of the solution is to avoid computing ChiSqTestResult many times when one wanted to perform different chi squared selection.

If we don't consider to avoid computing ChiSqTestResult many times, the problem will be very easy, and can be consistent exactly with other methods. Can we don't consider reuse ChiSqTestResult in this PR. Thanks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your second paragraph. I don't think it's very common that someone would perform different chi-squared selections on the same values but with different criteria. Therefore I'm not concerned about recomputing different models for different criteria. So I think indeed the model does not need parameters, and then it's consistent. Does that match your idea?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I also think so. I will update this PR. Thanks.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @srowen , I have updated this PR. Thanks.

val indices = chiSqTestResult.zipWithIndex.sortBy { case (res, _) => -res.statistic }
.take(numTopFeatures)
.map { case (_, indices) => indices }
new ChiSqSelectorModel(indices)
}

@Since("2.1.0")
def selectPercentile(value: Double): ChiSqSelectorModel = {
val indices = chiSqTestResult.zipWithIndex.sortBy { case (res, _) => -res.statistic }
.take((chiSqTestResult.length * percentile / 100).toInt)
.map { case (_, indices) => indices }
new ChiSqSelectorModel(indices)
}

@Since("2.1.0")
def selectFpr(value: Double): ChiSqSelectorModel = {
val indices = chiSqTestResult.zipWithIndex.filter{ case (res, _) => res.pValue < alpha }
.map { case (_, indices) => indices }
new ChiSqSelectorModel(indices)
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -49,16 +49,23 @@ class ChiSqSelectorSuite extends SparkFunSuite with MLlibTestSparkContext
.map(x => (x._1.label, x._1.features, x._2))
.toDF("label", "data", "preFilteredData")

val model = new ChiSqSelector()
val selector = new ChiSqSelector()
.setNumTopFeatures(1)
.setFeaturesCol("data")
.setLabelCol("label")
.setOutputCol("filtered")

model.fit(df).transform(df).select("filtered", "preFilteredData").collect().foreach {
selector.fit(df).transform(df).select("filtered", "preFilteredData").collect().foreach {
case Row(vec1: Vector, vec2: Vector) =>
assert(vec1 ~== vec2 absTol 1e-1)
}

selector.selectPercentile(34).transform(df)
.select("filtered", "preFilteredData").collect().foreach {
case Row(vec1: Vector, vec2: Vector) =>
assert(vec1 ~== vec2 absTol 1e-1)
}

}

test("ChiSqSelector read/write") {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,24 @@ class ChiSqSelectorSuite extends SparkFunSuite with MLlibTestSparkContext {
assert(filteredData == preFilteredData)
}

test("ChiSqSelector by FPR transform test (sparse & dense vector)") {
val labeledDiscreteData = sc.parallelize(
Seq(LabeledPoint(0.0, Vectors.sparse(4, Array((0, 8.0), (1, 7.0)))),
LabeledPoint(1.0, Vectors.sparse(4, Array((1, 9.0), (2, 6.0), (3, 4.0)))),
LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0, 8.0, 4.0))),
LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0, 5.0, 9.0)))), 2)
val preFilteredData =
Set(LabeledPoint(0.0, Vectors.dense(Array(0.0))),
LabeledPoint(1.0, Vectors.dense(Array(4.0))),
LabeledPoint(1.0, Vectors.dense(Array(4.0))),
LabeledPoint(2.0, Vectors.dense(Array(9.0))))
val model = new ChiSqSelector().setAlpha(0.1).fit(labeledDiscreteData)
val filteredData = labeledDiscreteData.map { lp =>
LabeledPoint(lp.label, model.transform(lp.features))
}.collect().toSet
assert(filteredData == preFilteredData)
}

test("model load / save") {
val model = ChiSqSelectorSuite.createModel()
val tempDir = Utils.createTempDir()
Expand Down