Skip to content

Commit 564e918

Browse files
committed
grammar fixes
1 parent e8b79af commit 564e918

5 files changed

Lines changed: 45 additions & 46 deletions

File tree

docs/ml-classification-regression.md

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -236,9 +236,9 @@ Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.classificat
236236

237237
Multilayer perceptron classifier (MLPC) is a classifier based on the [feedforward artificial neural network](https://en.wikipedia.org/wiki/Feedforward_neural_network).
238238
MLPC consists of multiple layers of nodes.
239-
Each layer is fully connected to the next layer in the network. Nodes in the input layer represent the input data. All other nodes maps inputs to the outputs
240-
by performing linear combination of the inputs with the node's weights `$\wv$` and bias `$\bv$` and applying an activation function.
241-
It can be written in matrix form for MLPC with `$K+1$` layers as follows:
239+
Each layer is fully connected to the next layer in the network. Nodes in the input layer represent the input data. All other nodes map inputs to outputs
240+
by a linear combination of the inputs with the node's weights `$\wv$` and bias `$\bv$` and applying an activation function.
241+
This can be written in matrix form for MLPC with `$K+1$` layers as follows:
242242
`\[
243243
\mathrm{y}(\x) = \mathrm{f_K}(...\mathrm{f_2}(\wv_2^T\mathrm{f_1}(\wv_1^T \x+b_1)+b_2)...+b_K)
244244
\]`
@@ -252,7 +252,7 @@ Nodes in the output layer use softmax function:
252252
\]`
253253
The number of nodes `$N$` in the output layer corresponds to the number of classes.
254254

255-
MLPC employs backpropagation for learning the model. We use logistic loss function for optimization and L-BFGS as optimization routine.
255+
MLPC employs backpropagation for learning the model. We use the logistic loss function for optimization and L-BFGS as an optimization routine.
256256

257257
**Example**
258258

@@ -311,9 +311,9 @@ Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.classificat
311311

312312
## Naive Bayes
313313

314-
[Naive Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier) are a family of simple
314+
[Naive Bayes classifiers](http://en.wikipedia.org/wiki/Naive_Bayes_classifier) are a family of simple
315315
probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence
316-
assumptions between the features. The spark.ml implementation currently supports both [multinomial
316+
assumptions between the features. The `spark.ml` implementation currently supports both [multinomial
317317
naive Bayes](http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html)
318318
and [Bernoulli naive Bayes](http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html).
319319
More information can be found in the section on [Naive Bayes in MLlib](mllib-naive-bayes.html#naive-bayes-sparkmllib).
@@ -482,11 +482,11 @@ Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.regression.
482482

483483
In `spark.ml`, we implement the [Accelerated failure time (AFT)](https://en.wikipedia.org/wiki/Accelerated_failure_time_model)
484484
model which is a parametric survival regression model for censored data.
485-
It describes a model for the log of survival time, so it's often called
486-
log-linear model for survival analysis. Different from
485+
It describes a model for the log of survival time, so it's often called a
486+
log-linear model for survival analysis. Different from a
487487
[Proportional hazards](https://en.wikipedia.org/wiki/Proportional_hazards_model) model
488-
designed for the same purpose, the AFT model is more easily to parallelize
489-
because each instance contribute to the objective function independently.
488+
designed for the same purpose, the AFT model is easier to parallelize
489+
because each instance contributes to the objective function independently.
490490

491491
Given the values of the covariates $x^{'}$, for random lifetime $t_{i}$ of
492492
subjects i = 1, ..., n, with possible right-censoring,
@@ -501,10 +501,10 @@ assumes the form:
501501
\iota(\beta,\sigma)=\sum_{i=1}^{n}[-\delta_{i}\log\sigma+\delta_{i}\log{f_{0}}(\epsilon_{i})+(1-\delta_{i})\log{S_{0}(\epsilon_{i})}]
502502
\]`
503503
Where $S_{0}(\epsilon_{i})$ is the baseline survivor function,
504-
and $f_{0}(\epsilon_{i})$ is corresponding density function.
504+
and $f_{0}(\epsilon_{i})$ is the corresponding density function.
505505

506506
The most commonly used AFT model is based on the Weibull distribution of the survival time.
507-
The Weibull distribution for lifetime corresponding to extreme value distribution for
507+
The Weibull distribution for lifetime corresponds to the extreme value distribution for the
508508
log of the lifetime, and the $S_{0}(\epsilon)$ function is:
509509
`\[
510510
S_{0}(\epsilon_{i})=\exp(-e^{\epsilon_{i}})
@@ -513,7 +513,7 @@ the $f_{0}(\epsilon_{i})$ function is:
513513
`\[
514514
f_{0}(\epsilon_{i})=e^{\epsilon_{i}}\exp(-e^{\epsilon_{i}})
515515
\]`
516-
The log-likelihood function for AFT model with Weibull distribution of lifetime is:
516+
The log-likelihood function for AFT model with a Weibull distribution of lifetime is:
517517
`\[
518518
\iota(\beta,\sigma)= -\sum_{i=1}^n[\delta_{i}\log\sigma-\delta_{i}\epsilon_{i}+e^{\epsilon_{i}}]
519519
\]`
@@ -529,7 +529,7 @@ The gradient functions for $\beta$ and $\log\sigma$ respectively are:
529529

530530
The AFT model can be formulated as a convex optimization problem,
531531
i.e. the task of finding a minimizer of a convex function $-\iota(\beta,\sigma)$
532-
that depends coefficients vector $\beta$ and the log of scale parameter $\log\sigma$.
532+
that depends on the coefficients vector $\beta$ and the log of scale parameter $\log\sigma$.
533533
The optimization algorithm underlying the implementation is L-BFGS.
534534
The implementation matches the result from R's survival function
535535
[survreg](https://stat.ethz.ch/R-manual/R-devel/library/survival/html/survreg.html)

docs/ml-clustering.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -89,7 +89,7 @@ Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.clustering.
8989
## Latent Dirichlet allocation (LDA)
9090

9191
`LDA` is implemented as an `Estimator` that supports both `EMLDAOptimizer` and `OnlineLDAOptimizer`,
92-
and generates a `LDAModel` as the base models. Expert users may cast a `LDAModel` generated by
92+
and generates a `LDAModel` as the base model. Expert users may cast a `LDAModel` generated by
9393
`EMLDAOptimizer` to a `DistributedLDAModel` if needed.
9494

9595
<div class="codetabs">

docs/ml-collaborative-filtering.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ best parameter learned from a sampled subset to the full dataset and expect simi
6060
<div class="codetabs">
6161
<div data-lang="scala" markdown="1">
6262

63-
In the following example, we load rating data from the
63+
In the following example, we load ratings data from the
6464
[MovieLens dataset](http://grouplens.org/datasets/movielens/), each row
6565
consisting of a user, a movie, a rating and a timestamp.
6666
We then train an ALS model which assumes, by default, that the ratings are
@@ -91,7 +91,7 @@ val als = new ALS()
9191

9292
<div data-lang="java" markdown="1">
9393

94-
In the following example, we load rating data from the
94+
In the following example, we load ratings data from the
9595
[MovieLens dataset](http://grouplens.org/datasets/movielens/), each row
9696
consisting of a user, a movie, a rating and a timestamp.
9797
We then train an ALS model which assumes, by default, that the ratings are
@@ -122,7 +122,7 @@ ALS als = new ALS()
122122

123123
<div data-lang="python" markdown="1">
124124

125-
In the following example, we load rating data from the
125+
In the following example, we load ratings data from the
126126
[MovieLens dataset](http://grouplens.org/datasets/movielens/), each row
127127
consisting of a user, a movie, a rating and a timestamp.
128128
We then train an ALS model which assumes, by default, that the ratings are

docs/ml-features.md

Lines changed: 23 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ to a document in the corpus. Denote a term by `$t$`, a document by `$d$`, and th
2626
Term frequency `$TF(t, d)$` is the number of times that term `$t$` appears in document `$d$`, while
2727
document frequency `$DF(t, D)$` is the number of documents that contains term `$t$`. If we only use
2828
term frequency to measure the importance, it is very easy to over-emphasize terms that appear very
29-
often but carry little information about the document, e.g., "a", "the", and "of". If a term appears
29+
often but carry little information about the document, e.g. "a", "the", and "of". If a term appears
3030
very often across the corpus, it means it doesn't carry special information about a particular document.
3131
Inverse document frequency is a numerical measure of how much information a term provides:
3232
`\[
@@ -50,7 +50,7 @@ A raw feature is mapped into an index (term) by applying a hash function. Then t
5050
are calculated based on the mapped indices. This approach avoids the need to compute a global
5151
term-to-index map, which can be expensive for a large corpus, but it suffers from potential hash
5252
collisions, where different raw features may become the same term after hashing. To reduce the
53-
chance of collision, we can increase the target feature dimension, i.e., the number of buckets
53+
chance of collision, we can increase the target feature dimension, i.e. the number of buckets
5454
of the hash table. Since a simple modulo is used to transform the hash function to a column index,
5555
it is advisable to use a power of two as the feature dimension, otherwise the features will
5656
not be mapped evenly to the columns. The default feature dimension is `$2^{18} = 262,144$`.
@@ -104,7 +104,7 @@ the [IDF Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.IDF) for mor
104104
`Word2Vec` is an `Estimator` which takes sequences of words representing documents and trains a
105105
`Word2VecModel`. The model maps each word to a unique fixed-size vector. The `Word2VecModel`
106106
transforms each document into a vector using the average of all words in the document; this vector
107-
can then be used for as features for prediction, document similarity calculations, etc.
107+
can then be used as features for prediction, document similarity calculations, etc.
108108
Please refer to the [MLlib user guide on Word2Vec](mllib-feature-extraction.html#word2vec) for more
109109
details.
110110

@@ -140,12 +140,12 @@ for more details on the API.
140140

141141
`CountVectorizer` and `CountVectorizerModel` aim to help convert a collection of text documents
142142
to vectors of token counts. When an a-priori dictionary is not available, `CountVectorizer` can
143-
be used as an `Estimator` to extract the vocabulary and generates a `CountVectorizerModel`. The
143+
be used as an `Estimator` to extract the vocabulary, and generates a `CountVectorizerModel`. The
144144
model produces sparse representations for the documents over the vocabulary, which can then be
145145
passed to other algorithms like LDA.
146146

147147
During the fitting process, `CountVectorizer` will select the top `vocabSize` words ordered by
148-
term frequency across the corpus. An optional parameter "minDF" also affect the fitting process
148+
term frequency across the corpus. An optional parameter "minDF" also affects the fitting process
149149
by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be
150150
included in the vocabulary.
151151

@@ -161,8 +161,8 @@ Assume that we have the following DataFrame with columns `id` and `texts`:
161161
~~~~
162162

163163
each row in `texts` is a document of type Array[String].
164-
Invoking fit of `CountVectorizer` produces a `CountVectorizerModel` with vocabulary (a, b, c),
165-
then the output column "vector" after transformation contains:
164+
Invoking fit of `CountVectorizer` produces a `CountVectorizerModel` with vocabulary (a, b, c).
165+
Then the output column "vector" after transformation contains:
166166

167167
~~~~
168168
id | texts | vector
@@ -171,7 +171,7 @@ then the output column "vector" after transformation contains:
171171
1 | Array("a", "b", "b", "c", "a") | (3,[0,1,2],[2.0,2.0,1.0])
172172
~~~~
173173

174-
each vector represents the token counts of the document over the vocabulary.
174+
Each vector represents the token counts of the document over the vocabulary.
175175

176176
<div class="codetabs">
177177
<div data-lang="scala" markdown="1">
@@ -477,8 +477,7 @@ for more details on the API.
477477
## StringIndexer
478478

479479
`StringIndexer` encodes a string column of labels to a column of label indices.
480-
The indices are in `[0, numLabels)`, ordered by label frequencies.
481-
So the most frequent label gets index `0`.
480+
The indices are in `[0, numLabels)`, ordered by label frequencies, so the most frequent label gets index `0`.
482481
If the input column is numeric, we cast it to string and index the string
483482
values. When downstream pipeline components such as `Estimator` or
484483
`Transformer` make use of this string-indexed label, you must set the input
@@ -585,7 +584,7 @@ for more details on the API.
585584
## IndexToString
586585

587586
Symmetrically to `StringIndexer`, `IndexToString` maps a column of label indices
588-
back to a column containing the original labels as strings. The common use case
587+
back to a column containing the original labels as strings. A common use case
589588
is to produce indices from labels with `StringIndexer`, train a model with those
590589
indices and retrieve the original labels from the column of predicted indices
591590
with `IndexToString`. However, you are free to supply your own labels.
@@ -652,7 +651,7 @@ for more details on the API.
652651

653652
## OneHotEncoder
654653

655-
[One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features
654+
[One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features.
656655

657656
<div class="codetabs">
658657
<div data-lang="scala" markdown="1">
@@ -888,7 +887,7 @@ for more details on the API.
888887

889888
* `splits`: Parameter for mapping continuous features into buckets. With n+1 splits, there are n buckets. A bucket defined by splits x,y holds values in the range [x,y) except the last bucket, which also includes y. Splits should be strictly increasing. Values at -inf, inf must be explicitly provided to cover all Double values; Otherwise, values outside the splits specified will be treated as errors. Two examples of `splits` are `Array(Double.NegativeInfinity, 0.0, 1.0, Double.PositiveInfinity)` and `Array(0.0, 1.0, 2.0)`.
890889

891-
Note that if you have no idea of the upper bound and lower bound of the targeted column, you would better add the `Double.NegativeInfinity` and `Double.PositiveInfinity` as the bounds of your splits to prevent a potential out of Bucketizer bounds exception.
890+
Note that if you have no idea of the upper and lower bounds of the targeted column, you should add `Double.NegativeInfinity` and `Double.PositiveInfinity` as the bounds of your splits to prevent a potential out of Bucketizer bounds exception.
892891

893892
Note also that the splits that you provided have to be in strictly increasing order, i.e. `s0 < s1 < s2 < ... < sn`.
894893

@@ -976,7 +975,7 @@ for more details on the API.
976975
Currently we only support SQL syntax like `"SELECT ... FROM __THIS__ ..."`
977976
where `"__THIS__"` represents the underlying table of the input dataset.
978977
The select clause specifies the fields, constants, and expressions to display in
979-
the output, it can be any select clause that Spark SQL supports. Users can also
978+
the output, and can be any select clause that Spark SQL supports. Users can also
980979
use Spark SQL built-in function and UDFs to operate on these selected columns.
981980
For example, `SQLTransformer` supports statements like:
982981

@@ -1121,7 +1120,7 @@ Assume that we have a DataFrame with the columns `id`, `hour`:
11211120
~~~
11221121

11231122
`hour` is a continuous feature with `Double` type. We want to turn the continuous feature into
1124-
categorical one. Given `numBuckets = 3`, we should get the following DataFrame:
1123+
a categorical one. Given `numBuckets = 3`, we should get the following DataFrame:
11251124

11261125
~~~
11271126
id | hour | result
@@ -1171,19 +1170,19 @@ for more details on the API.
11711170
`VectorSlicer` is a transformer that takes a feature vector and outputs a new feature vector with a
11721171
sub-array of the original features. It is useful for extracting features from a vector column.
11731172

1174-
`VectorSlicer` accepts a vector column with a specified indices, then outputs a new vector column
1173+
`VectorSlicer` accepts a vector column with specified indices, then outputs a new vector column
11751174
whose values are selected via those indices. There are two types of indices,
11761175

1177-
1. Integer indices that represents the indices into the vector, `setIndices()`;
1176+
1. Integer indices that represent the indices into the vector, `setIndices()`.
11781177

1179-
2. String indices that represents the names of features into the vector, `setNames()`.
1178+
2. String indices that represent the names of features into the vector, `setNames()`.
11801179
*This requires the vector column to have an `AttributeGroup` since the implementation matches on
11811180
the name field of an `Attribute`.*
11821181

11831182
Specification by integer and string are both acceptable. Moreover, you can use integer index and
11841183
string name simultaneously. At least one feature must be selected. Duplicate features are not
11851184
allowed, so there can be no overlap between selected indices and names. Note that if names of
1186-
features are selected, an exception will be threw out when encountering with empty input attributes.
1185+
features are selected, an exception will be thrown if empty input attributes are encountered.
11871186

11881187
The output vector will order features with the selected indices first (in the order given),
11891188
followed by the selected names (in the order given).
@@ -1198,8 +1197,8 @@ Suppose that we have a DataFrame with the column `userFeatures`:
11981197
[0.0, 10.0, 0.5]
11991198
~~~
12001199

1201-
`userFeatures` is a vector column that contains three user features. Assuming that the first column
1202-
of `userFeatures` are all zeros, so we want to remove it and only the last two columns are selected.
1200+
`userFeatures` is a vector column that contains three user features. Assume that the first column
1201+
of `userFeatures` are all zeros, so we want to remove it and select only the last two columns.
12031202
The `VectorSlicer` selects the last two elements with `setIndices(1, 2)` then produces a new vector
12041203
column named `features`:
12051204

@@ -1209,7 +1208,7 @@ column named `features`:
12091208
[0.0, 10.0, 0.5] | [10.0, 0.5]
12101209
~~~
12111210

1212-
Suppose also that we have a potential input attributes for the `userFeatures`, i.e.
1211+
Suppose also that we have potential input attributes for the `userFeatures`, i.e.
12131212
`["f1", "f2", "f3"]`, then we can use `setNames("f2", "f3")` to select them.
12141213

12151214
~~~
@@ -1337,8 +1336,8 @@ id | features | clicked
13371336
9 | [1.0, 0.0, 15.0, 0.1] | 0.0
13381337
~~~
13391338

1340-
If we use `ChiSqSelector` with a `numTopFeatures = 1`, then according to our label `clicked` the
1341-
last column in our `features` chosen as the most useful feature:
1339+
If we use `ChiSqSelector` with `numTopFeatures = 1`, then according to our label `clicked` the
1340+
last column in our `features` is chosen as the most useful feature:
13421341

13431342
~~~
13441343
id | features | clicked | selectedFeatures

0 commit comments

Comments
 (0)