Skip to content

Commit 1736057

Browse files
author
VinceShieh
committed
update document
Signed-off-by: VinceShieh <vincent.xie@intel.com>
1 parent 9a41745 commit 1736057

1 file changed

Lines changed: 20 additions & 3 deletions

File tree

docs/ml-features.md

Lines changed: 20 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -502,7 +502,7 @@ for more details on the API.
502502
## StringIndexer
503503

504504
`StringIndexer` encodes a string column of labels to a column of label indices.
505-
The indices are in `[0, numLabels)`, ordered by label frequencies, so the most frequent label gets index `0`.
505+
The indices are in `[0, numLabels]`, ordered by label frequencies, so the most frequent label gets index `0`.
506506
If the input column is numeric, we cast it to string and index the string
507507
values. When downstream pipeline components such as `Estimator` or
508508
`Transformer` make use of this string-indexed label, you must set the input
@@ -542,12 +542,13 @@ column, we should get the following:
542542
"a" gets index `0` because it is the most frequent, followed by "c" with index `1` and "b" with
543543
index `2`.
544544

545-
Additionally, there are two strategies regarding how `StringIndexer` will handle
545+
Additionally, there are three strategies regarding how `StringIndexer` will handle
546546
unseen labels when you have fit a `StringIndexer` on one dataset and then use it
547547
to transform another:
548548

549549
- throw an exception (which is the default)
550550
- skip the row containing the unseen label entirely
551+
- map the unseen labels with indices [numLabels]
551552

552553
**Examples**
553554

@@ -561,6 +562,7 @@ Let's go back to our previous example but this time reuse our previously defined
561562
1 | b
562563
2 | c
563564
3 | d
565+
4 | e
564566
~~~~
565567

566568
If you've not set how `StringIndexer` handles unseen labels or set it to
@@ -576,7 +578,22 @@ will be generated:
576578
2 | c | 1.0
577579
~~~~
578580

579-
Notice that the row containing "d" does not appear.
581+
Notice that the rows containing "d" or "e" do not appear.
582+
583+
If you had called `setHandleInvalid("keep")`, the following dataset
584+
will be generated:
585+
586+
~~~~
587+
id | category | categoryIndex
588+
----|----------|---------------
589+
0 | a | 0.0
590+
1 | b | 2.0
591+
2 | c | 1.0
592+
3 | d | 3.0
593+
4 | e | 3.0
594+
~~~~
595+
596+
Notice that the rows containing "d" or "e" are mapped with indices "3.0"
580597

581598
<div class="codetabs">
582599

0 commit comments

Comments
 (0)