Skip to content

Commit a5d7de4

Browse files
committed
add docs
1 parent 364fb83 commit a5d7de4

4 files changed

Lines changed: 202 additions & 0 deletions

File tree

docs/ml-features.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1009,6 +1009,51 @@ for more details on the API.
10091009
</div>
10101010
</div>
10111011

1012+
1013+
## RobustScaler
1014+
1015+
`RobustScaler` transforms a dataset of `Vector` rows, removing the median and scaling the data according to a specific quantile range (by default the IQR: Interquartile Range, quantile range between the 1st quartile and the 3rd quartile). Its behavior is quite similar to `StandardScaler`, however the median and the quantile range are used instead of mean and standard deviation, which make it robust to outliers. It takes parameters:
1016+
1017+
* `lower`: 0.25 by default. Lower quantile to calculate quantile range, shared by all features.
1018+
* `upper`: 0.75 by default. Upper quantile to calculate quantile range, shared by all features.
1019+
* `withScaling`: True by default. Scales the data to quantile range.
1020+
* `withCentering`: False by default. Centers the data with median before scaling. It will build a dense output, so take care when applying to sparse input.
1021+
1022+
`RobustScaler` is an `Estimator` which can be `fit` on a dataset to produce a `RobustScalerModel`; this amounts to computing quantile statistics. The model can then transform a `Vector` column in a dataset to have unit quantile range and/or zero median features.
1023+
1024+
Note that if the quantile range of a feature is zero, it will return default `0.0` value in the `Vector` for that feature.
1025+
1026+
**Examples**
1027+
1028+
The following example demonstrates how to load a dataset in libsvm format and then normalize each feature to have unit standard deviation.
1029+
1030+
<div class="codetabs">
1031+
<div data-lang="scala" markdown="1">
1032+
1033+
Refer to the [RobustScaler Scala docs](api/scala/index.html#org.apache.spark.ml.feature.RobustScaler)
1034+
for more details on the API.
1035+
1036+
{% include_example scala/org/apache/spark/examples/ml/RobustScalerExample.scala %}
1037+
</div>
1038+
1039+
<div data-lang="java" markdown="1">
1040+
1041+
Refer to the [RobustScaler Java docs](api/java/org/apache/spark/ml/feature/RobustScaler.html)
1042+
for more details on the API.
1043+
1044+
{% include_example java/org/apache/spark/examples/ml/JavaRobustScalerExample.java %}
1045+
</div>
1046+
1047+
<div data-lang="python" markdown="1">
1048+
1049+
Refer to the [RobustScaler Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.RobustScaler)
1050+
for more details on the API.
1051+
1052+
{% include_example python/ml/robust_scaler_example.py %}
1053+
</div>
1054+
</div>
1055+
1056+
10121057
## MinMaxScaler
10131058

10141059
`MinMaxScaler` transforms a dataset of `Vector` rows, rescaling each feature to a specific range (often [0, 1]). It takes parameters:
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
/*
2+
* Licensed to the Apache Software Foundation (ASF) under one or more
3+
* contributor license agreements. See the NOTICE file distributed with
4+
* this work for additional information regarding copyright ownership.
5+
* The ASF licenses this file to You under the Apache License, Version 2.0
6+
* (the "License"); you may not use this file except in compliance with
7+
* the License. You may obtain a copy of the License at
8+
*
9+
* http://www.apache.org/licenses/LICENSE-2.0
10+
*
11+
* Unless required by applicable law or agreed to in writing, software
12+
* distributed under the License is distributed on an "AS IS" BASIS,
13+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
* See the License for the specific language governing permissions and
15+
* limitations under the License.
16+
*/
17+
18+
package org.apache.spark.examples.ml;
19+
20+
import org.apache.spark.sql.SparkSession;
21+
22+
// $example on$
23+
import org.apache.spark.ml.feature.RobustScaler;
24+
import org.apache.spark.ml.feature.RobustScalerModel;
25+
import org.apache.spark.sql.Dataset;
26+
import org.apache.spark.sql.Row;
27+
// $example off$
28+
29+
public class JavaRobustScalerExample {
30+
public static void main(String[] args) {
31+
SparkSession spark = SparkSession
32+
.builder()
33+
.appName("JavaRobustScalerExample")
34+
.getOrCreate();
35+
36+
// $example on$
37+
Dataset<Row> dataFrame =
38+
spark.read().format("libsvm").load("data/mllib/sample_libsvm_data.txt");
39+
40+
RobustScaler scaler = new RobustScaler()
41+
.setInputCol("features")
42+
.setOutputCol("scaledFeatures")
43+
.setWithScaling(true)
44+
.setWithCentering(false)
45+
.setLower(0.25)
46+
.setUpper(0.75);
47+
48+
// Compute summary statistics by fitting the RobustScaler
49+
RobustScalerModel scalerModel = scaler.fit(dataFrame);
50+
51+
// Transform each feature to have unit quantile range.
52+
Dataset<Row> scaledData = scalerModel.transform(dataFrame);
53+
scaledData.show();
54+
// $example off$
55+
spark.stop();
56+
}
57+
}
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
#
2+
# Licensed to the Apache Software Foundation (ASF) under one or more
3+
# contributor license agreements. See the NOTICE file distributed with
4+
# this work for additional information regarding copyright ownership.
5+
# The ASF licenses this file to You under the Apache License, Version 2.0
6+
# (the "License"); you may not use this file except in compliance with
7+
# the License. You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
#
17+
18+
from __future__ import print_function
19+
20+
# $example on$
21+
from pyspark.ml.feature import RobustScaler
22+
# $example off$
23+
from pyspark.sql import SparkSession
24+
25+
if __name__ == "__main__":
26+
spark = SparkSession\
27+
.builder\
28+
.appName("RobustScalerExample")\
29+
.getOrCreate()
30+
31+
# $example on$
32+
dataFrame = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
33+
scaler = RobustScaler(inputCol="features", outputCol="scaledFeatures",
34+
withScaling=True, withCentering=False,
35+
lower=0.25, upper=0.75)
36+
37+
# Compute summary statistics by fitting the RobustScaler
38+
scalerModel = scaler.fit(dataFrame)
39+
40+
# Transform each feature to have unit quantile range.
41+
scaledData = scalerModel.transform(dataFrame)
42+
scaledData.show()
43+
# $example off$
44+
45+
spark.stop()
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
/*
2+
* Licensed to the Apache Software Foundation (ASF) under one or more
3+
* contributor license agreements. See the NOTICE file distributed with
4+
* this work for additional information regarding copyright ownership.
5+
* The ASF licenses this file to You under the Apache License, Version 2.0
6+
* (the "License"); you may not use this file except in compliance with
7+
* the License. You may obtain a copy of the License at
8+
*
9+
* http://www.apache.org/licenses/LICENSE-2.0
10+
*
11+
* Unless required by applicable law or agreed to in writing, software
12+
* distributed under the License is distributed on an "AS IS" BASIS,
13+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
* See the License for the specific language governing permissions and
15+
* limitations under the License.
16+
*/
17+
18+
// scalastyle:off println
19+
package org.apache.spark.examples.ml
20+
21+
// $example on$
22+
import org.apache.spark.ml.feature.RobustScaler
23+
// $example off$
24+
import org.apache.spark.sql.SparkSession
25+
26+
object RobustScalerExample {
27+
def main(args: Array[String]): Unit = {
28+
val spark = SparkSession
29+
.builder
30+
.appName("RobustScalerExample")
31+
.getOrCreate()
32+
33+
// $example on$
34+
val dataFrame = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
35+
36+
val scaler = new RobustScaler()
37+
.setInputCol("features")
38+
.setOutputCol("scaledFeatures")
39+
.setWithScaling(true)
40+
.setWithCentering(false)
41+
.setLower(0.25)
42+
.setUpper(0.75)
43+
44+
// Compute summary statistics by fitting the RobustScaler.
45+
val scalerModel = scaler.fit(dataFrame)
46+
47+
// Transform each feature to have unit quantile range.
48+
val scaledData = scalerModel.transform(dataFrame)
49+
scaledData.show()
50+
// $example off$
51+
52+
spark.stop()
53+
}
54+
}
55+
// scalastyle:on println

0 commit comments

Comments
 (0)