You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/en/generation_strategies.md
+47-23Lines changed: 47 additions & 23 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -416,16 +416,6 @@ Assisted decoding assumes the main and assistant models have the same tokenizer,
416
416
Currently, only greedy search and sampling are supported with assisted decoding, and assisted decoding doesn't support batched inputs.
417
417
To learn more about assisted decoding, check [this blog post](https://huggingface.co/blog/assisted-generation).
418
418
419
-
#### Universal Assisted Decoding
420
-
421
-
Universal Assisted Decoding (UAD) adds support for main and assistant models with different tokenizers.
422
-
To use it, simply pass the tokenizers using the `tokenizer`and`assistant_tokenizer` arguments (see below).
423
-
Internally, the main model input tokens are re-encoded into assistant model tokens, then candidate tokens are generated in the assistant encoding, which are
424
-
in turn re-encoded into main model candidate tokens. Validation then proceeds as explained above.
425
-
The re-encoding steps involve decoding token ids into text and then encoding the text using a different tokenizer.
426
-
Since re-encoding the tokens may result in tokenization discrepancies, UAD finds the longest common subsequence between the source and target encodings,
427
-
to ensure the new tokens include the correct prompt suffix.
428
-
429
419
To enable assisted decoding, set the `assistant_model` argument with a model.
430
420
431
421
```python
@@ -445,7 +435,36 @@ To enable assisted decoding, set the `assistant_model` argument with a model.
445
435
['Alice and Bob are sitting in a bar. Alice is drinking a beer and Bob is drinking a']
446
436
```
447
437
448
-
If the main and assistant models have different tokenizers, use Universal Assisted Decoding.
438
+
When using assisted decoding with sampling methods, you can use the `temperature` argument to control the randomness,
439
+
just like in multinomial sampling. However, in assisted decoding, reducing the temperature may help improve the latency.
['Alice and Bob, a couple of friends of mine, who are both in the same office as']
457
+
```
458
+
459
+
#### Universal Assisted Decoding
460
+
461
+
Universal Assisted Decoding (UAD) adds support for main and assistant models with different tokenizers.
462
+
To use it, simply pass the tokenizers using the `tokenizer`and`assistant_tokenizer` arguments (see below).
463
+
Internally, the main model input tokens are re-encoded into assistant model tokens, then candidate tokens are generated in the assistant encoding, which are
464
+
in turn re-encoded into main model candidate tokens. Validation then proceeds as explained above.
465
+
The re-encoding steps involve decoding token ids into text and then encoding the text using a different tokenizer.
466
+
Since re-encoding the tokens may result in tokenization discrepancies, UAD finds the longest common subsequence between the source and target encodings,
467
+
to ensure the new tokens include the correct prompt suffix.
@@ -465,30 +484,35 @@ If the main and assistant models have different tokenizers, use Universal Assist
465
484
['Alice and Bob are sitting in a bar. Alice is drinking a beer and Bob is drinking a']
466
485
```
467
486
468
-
When using assisted decoding with sampling methods, you can use the `temperature` argument to control the randomness,
469
-
just like in multinomial sampling. However, in assisted decoding, reducing the temperature may help improve the latency.
487
+
#### Prompt Lookup
488
+
489
+
Alternatively, you can also set the `prompt_lookup_num_tokens` to trigger n-gram based assisted decoding, as opposed
490
+
to model based assisted decoding. You can read more about it [here](https://twitter.com/joao_gante/status/1747322413006643259).
491
+
492
+
#### Self-Speculative Decoding
493
+
494
+
An LLM can be trained to also use its language modeling head with earlier hidden states asinput, effectively
495
+
skipping layers to yield a lower-quality output -- a technique called early exiting.
496
+
We use the lower-quality early exit output as an assistant output, and apply self-speculation to fix the output using the remaining layers. The final generation of that self-speculative solution is the same (or has the same distribution) as the original model's generation.
497
+
If the model you're using was trained to do early exit, you can pass
498
+
`assistant_early_exit` (integer). In this case, the assistant model will be the same model but exiting early, hence the
499
+
"self-speculative" name. Because the assistant model is a portion of the target model, caches and weights can be shared, which results in lower memory requirements. As in other assisted generation methods, the final generated result has the same quality asif no assistant had been used.
0 commit comments