@@ -502,7 +502,7 @@ for more details on the API.
502502## StringIndexer
503503
504504` StringIndexer ` encodes a string column of labels to a column of label indices.
505- The indices are in ` [0, numLabels) ` , ordered by label frequencies, so the most frequent label gets index ` 0 ` .
505+ The indices are in ` [0, numLabels] ` , ordered by label frequencies, so the most frequent label gets index ` 0 ` .
506506If the input column is numeric, we cast it to string and index the string
507507values. When downstream pipeline components such as ` Estimator ` or
508508` Transformer ` make use of this string-indexed label, you must set the input
@@ -542,12 +542,13 @@ column, we should get the following:
542542"a" gets index ` 0 ` because it is the most frequent, followed by "c" with index ` 1 ` and "b" with
543543index ` 2 ` .
544544
545- Additionally, there are two strategies regarding how ` StringIndexer ` will handle
545+ Additionally, there are three strategies regarding how ` StringIndexer ` will handle
546546unseen labels when you have fit a ` StringIndexer ` on one dataset and then use it
547547to transform another:
548548
549549- throw an exception (which is the default)
550550- skip the row containing the unseen label entirely
551+ - map the unseen labels with indices [ numLabels]
551552
552553** Examples**
553554
@@ -561,6 +562,7 @@ Let's go back to our previous example but this time reuse our previously defined
561562 1 | b
562563 2 | c
563564 3 | d
565+ 4 | e
564566~~~~
565567
566568If you've not set how ` StringIndexer ` handles unseen labels or set it to
@@ -576,7 +578,22 @@ will be generated:
576578 2 | c | 1.0
577579~~~~
578580
579- Notice that the row containing "d" does not appear.
581+ Notice that the rows containing "d" or "e" do not appear.
582+
583+ If you had called ` setHandleInvalid("keep") ` , the following dataset
584+ will be generated:
585+
586+ ~~~~
587+ id | category | categoryIndex
588+ ----|----------|---------------
589+ 0 | a | 0.0
590+ 1 | b | 2.0
591+ 2 | c | 1.0
592+ 3 | d | 3.0
593+ 4 | e | 3.0
594+ ~~~~
595+
596+ Notice that the rows containing "d" or "e" are mapped with indices "3.0"
580597
581598<div class =" codetabs " >
582599
0 commit comments