You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/ml-classification-regression.md
+14-14Lines changed: 14 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -236,9 +236,9 @@ Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.classificat
236
236
237
237
Multilayer perceptron classifier (MLPC) is a classifier based on the [feedforward artificial neural network](https://en.wikipedia.org/wiki/Feedforward_neural_network).
238
238
MLPC consists of multiple layers of nodes.
239
-
Each layer is fully connected to the next layer in the network. Nodes in the input layer represent the input data. All other nodes maps inputs to the outputs
240
-
by performing linear combination of the inputs with the node's weights `$\wv$` and bias `$\bv$` and applying an activation function.
241
-
It can be written in matrix form for MLPC with `$K+1$` layers as follows:
239
+
Each layer is fully connected to the next layer in the network. Nodes in the input layer represent the input data. All other nodes map inputs to outputs
240
+
by a linear combination of the inputs with the node's weights `$\wv$` and bias `$\bv$` and applying an activation function.
241
+
This can be written in matrix form for MLPC with `$K+1$` layers as follows:
each vector represents the token counts of the document over the vocabulary.
174
+
Each vector represents the token counts of the document over the vocabulary.
175
175
176
176
<divclass="codetabs">
177
177
<divdata-lang="scala"markdown="1">
@@ -477,8 +477,7 @@ for more details on the API.
477
477
## StringIndexer
478
478
479
479
`StringIndexer` encodes a string column of labels to a column of label indices.
480
-
The indices are in `[0, numLabels)`, ordered by label frequencies.
481
-
So the most frequent label gets index `0`.
480
+
The indices are in `[0, numLabels)`, ordered by label frequencies, so the most frequent label gets index `0`.
482
481
If the input column is numeric, we cast it to string and index the string
483
482
values. When downstream pipeline components such as `Estimator` or
484
483
`Transformer` make use of this string-indexed label, you must set the input
@@ -585,7 +584,7 @@ for more details on the API.
585
584
## IndexToString
586
585
587
586
Symmetrically to `StringIndexer`, `IndexToString` maps a column of label indices
588
-
back to a column containing the original labels as strings. The common use case
587
+
back to a column containing the original labels as strings. A common use case
589
588
is to produce indices from labels with `StringIndexer`, train a model with those
590
589
indices and retrieve the original labels from the column of predicted indices
591
590
with `IndexToString`. However, you are free to supply your own labels.
@@ -652,7 +651,7 @@ for more details on the API.
652
651
653
652
## OneHotEncoder
654
653
655
-
[One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features
654
+
[One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features.
656
655
657
656
<divclass="codetabs">
658
657
<divdata-lang="scala"markdown="1">
@@ -888,7 +887,7 @@ for more details on the API.
888
887
889
888
*`splits`: Parameter for mapping continuous features into buckets. With n+1 splits, there are n buckets. A bucket defined by splits x,y holds values in the range [x,y) except the last bucket, which also includes y. Splits should be strictly increasing. Values at -inf, inf must be explicitly provided to cover all Double values; Otherwise, values outside the splits specified will be treated as errors. Two examples of `splits` are `Array(Double.NegativeInfinity, 0.0, 1.0, Double.PositiveInfinity)` and `Array(0.0, 1.0, 2.0)`.
890
889
891
-
Note that if you have no idea of the upper bound and lower bound of the targeted column, you would better add the`Double.NegativeInfinity` and `Double.PositiveInfinity` as the bounds of your splits to prevent a potential out of Bucketizer bounds exception.
890
+
Note that if you have no idea of the upper and lower bounds of the targeted column, you should add `Double.NegativeInfinity` and `Double.PositiveInfinity` as the bounds of your splits to prevent a potential out of Bucketizer bounds exception.
892
891
893
892
Note also that the splits that you provided have to be in strictly increasing order, i.e. `s0 < s1 < s2 < ... < sn`.
894
893
@@ -976,7 +975,7 @@ for more details on the API.
976
975
Currently we only support SQL syntax like `"SELECT ... FROM __THIS__ ..."`
977
976
where `"__THIS__"` represents the underlying table of the input dataset.
978
977
The select clause specifies the fields, constants, and expressions to display in
979
-
the output, it can be any select clause that Spark SQL supports. Users can also
978
+
the output, and can be any select clause that Spark SQL supports. Users can also
980
979
use Spark SQL built-in function and UDFs to operate on these selected columns.
981
980
For example, `SQLTransformer` supports statements like:
982
981
@@ -1121,7 +1120,7 @@ Assume that we have a DataFrame with the columns `id`, `hour`:
1121
1120
~~~
1122
1121
1123
1122
`hour` is a continuous feature with `Double` type. We want to turn the continuous feature into
1124
-
categorical one. Given `numBuckets = 3`, we should get the following DataFrame:
1123
+
a categorical one. Given `numBuckets = 3`, we should get the following DataFrame:
1125
1124
1126
1125
~~~
1127
1126
id | hour | result
@@ -1171,19 +1170,19 @@ for more details on the API.
1171
1170
`VectorSlicer` is a transformer that takes a feature vector and outputs a new feature vector with a
1172
1171
sub-array of the original features. It is useful for extracting features from a vector column.
1173
1172
1174
-
`VectorSlicer` accepts a vector column with a specified indices, then outputs a new vector column
1173
+
`VectorSlicer` accepts a vector column with specified indices, then outputs a new vector column
1175
1174
whose values are selected via those indices. There are two types of indices,
1176
1175
1177
-
1. Integer indices that represents the indices into the vector, `setIndices()`;
1176
+
1. Integer indices that represent the indices into the vector, `setIndices()`.
1178
1177
1179
-
2. String indices that represents the names of features into the vector, `setNames()`.
1178
+
2. String indices that represent the names of features into the vector, `setNames()`.
1180
1179
*This requires the vector column to have an `AttributeGroup` since the implementation matches on
1181
1180
the name field of an `Attribute`.*
1182
1181
1183
1182
Specification by integer and string are both acceptable. Moreover, you can use integer index and
1184
1183
string name simultaneously. At least one feature must be selected. Duplicate features are not
1185
1184
allowed, so there can be no overlap between selected indices and names. Note that if names of
1186
-
features are selected, an exception will be threw out when encountering with empty input attributes.
1185
+
features are selected, an exception will be thrown if empty input attributes are encountered.
1187
1186
1188
1187
The output vector will order features with the selected indices first (in the order given),
1189
1188
followed by the selected names (in the order given).
@@ -1198,8 +1197,8 @@ Suppose that we have a DataFrame with the column `userFeatures`:
1198
1197
[0.0, 10.0, 0.5]
1199
1198
~~~
1200
1199
1201
-
`userFeatures` is a vector column that contains three user features. Assuming that the first column
1202
-
of `userFeatures` are all zeros, so we want to remove it and only the last two columns are selected.
1200
+
`userFeatures` is a vector column that contains three user features. Assume that the first column
1201
+
of `userFeatures` are all zeros, so we want to remove it and select only the last two columns.
1203
1202
The `VectorSlicer` selects the last two elements with `setIndices(1, 2)` then produces a new vector
1204
1203
column named `features`:
1205
1204
@@ -1209,7 +1208,7 @@ column named `features`:
1209
1208
[0.0, 10.0, 0.5] | [10.0, 0.5]
1210
1209
~~~
1211
1210
1212
-
Suppose also that we have a potential input attributes for the `userFeatures`, i.e.
1211
+
Suppose also that we have potential input attributes for the `userFeatures`, i.e.
1213
1212
`["f1", "f2", "f3"]`, then we can use `setNames("f2", "f3")` to select them.
1214
1213
1215
1214
~~~
@@ -1337,8 +1336,8 @@ id | features | clicked
1337
1336
9 | [1.0, 0.0, 15.0, 0.1] | 0.0
1338
1337
~~~
1339
1338
1340
-
If we use `ChiSqSelector` with a `numTopFeatures = 1`, then according to our label `clicked` the
1341
-
last column in our `features` chosen as the most useful feature:
1339
+
If we use `ChiSqSelector` with `numTopFeatures = 1`, then according to our label `clicked` the
1340
+
last column in our `features`is chosen as the most useful feature:
0 commit comments