Skip to content

Commit 98834fe

Browse files
authored
Update nm to rht in doc links + refine fp8 doc (#17678)
Signed-off-by: mgoin <[email protected]>
1 parent 90bd2ae commit 98834fe

File tree

2 files changed

+16
-72
lines changed

2 files changed

+16
-72
lines changed

docs/source/features/quantization/fp8.md

Lines changed: 15 additions & 71 deletions
Original file line numberDiff line numberDiff line change
@@ -19,24 +19,6 @@ FP8 computation is supported on NVIDIA GPUs with compute capability > 8.9 (Ada L
1919
FP8 models will run on compute capability > 8.0 (Ampere) as weight-only W8A16, utilizing FP8 Marlin.
2020
:::
2121

22-
## Quick Start with Online Dynamic Quantization
23-
24-
Dynamic quantization of an original precision BF16/FP16 model to FP8 can be achieved with vLLM without any calibration data required. You can enable the feature by specifying `--quantization="fp8"` in the command line or setting `quantization="fp8"` in the LLM constructor.
25-
26-
In this mode, all Linear modules (except for the final `lm_head`) have their weights quantized down to FP8_E4M3 precision with a per-tensor scale. Activations have their minimum and maximum values calculated during each forward pass to provide a dynamic per-tensor scale for high accuracy. As a result, latency improvements are limited in this mode.
27-
28-
```python
29-
from vllm import LLM
30-
model = LLM("facebook/opt-125m", quantization="fp8")
31-
# INFO 06-10 17:55:42 model_runner.py:157] Loading model weights took 0.1550 GB
32-
result = model.generate("Hello, my name is")
33-
print(result[0].outputs[0].text)
34-
```
35-
36-
:::{warning}
37-
Currently, we load the model at original precision before quantizing down to 8-bits, so you need enough memory to load the whole model.
38-
:::
39-
4022
## Installation
4123

4224
To produce performant FP8 quantized models with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
@@ -45,12 +27,6 @@ To produce performant FP8 quantized models with vLLM, you'll need to install the
4527
pip install llmcompressor
4628
```
4729

48-
Additionally, install `vllm` and `lm-evaluation-harness` for evaluation:
49-
50-
```console
51-
pip install vllm lm-eval==0.4.4
52-
```
53-
5430
## Quantization Process
5531

5632
The quantization process involves three main steps:
@@ -101,6 +77,12 @@ tokenizer.save_pretrained(SAVE_DIR)
10177

10278
### 3. Evaluating Accuracy
10379

80+
Install `vllm` and `lm-evaluation-harness` for evaluation:
81+
82+
```console
83+
pip install vllm lm-eval==0.4.4
84+
```
85+
10486
Load and run the model in `vllm`:
10587

10688
```python
@@ -137,58 +119,20 @@ Here's an example of the resulting scores:
137119

138120
If you encounter any issues or have feature requests, please open an issue on the `vllm-project/llm-compressor` GitHub repository.
139121

140-
## Deprecated Flow
141-
142-
:::{note}
143-
The following information is preserved for reference and search purposes.
144-
The quantization method described below is deprecated in favor of the `llmcompressor` method described above.
145-
:::
146-
147-
For static per-tensor offline quantization to FP8, please install the [AutoFP8 library](https://github.com/neuralmagic/autofp8).
148-
149-
```bash
150-
git clone https://github.com/neuralmagic/AutoFP8.git
151-
pip install -e AutoFP8
152-
```
153-
154-
This package introduces the `AutoFP8ForCausalLM` and `BaseQuantizeConfig` objects for managing how your model will be compressed.
155-
156-
## Offline Quantization with Static Activation Scaling Factors
157-
158-
You can use AutoFP8 with calibration data to produce per-tensor static scales for both the weights and activations by enabling the `activation_scheme="static"` argument.
159-
160-
```python
161-
from datasets import load_dataset
162-
from transformers import AutoTokenizer
163-
from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig
164-
165-
pretrained_model_dir = "meta-llama/Meta-Llama-3-8B-Instruct"
166-
quantized_model_dir = "Meta-Llama-3-8B-Instruct-FP8"
122+
## Online Dynamic Quantization
167123

168-
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
169-
tokenizer.pad_token = tokenizer.eos_token
170-
171-
# Load and tokenize 512 dataset samples for calibration of activation scales
172-
ds = load_dataset("mgoin/ultrachat_2k", split="train_sft").select(range(512))
173-
examples = [tokenizer.apply_chat_template(batch["messages"], tokenize=False) for batch in ds]
174-
examples = tokenizer(examples, padding=True, truncation=True, return_tensors="pt").to("cuda")
175-
176-
# Define quantization config with static activation scales
177-
quantize_config = BaseQuantizeConfig(quant_method="fp8", activation_scheme="static")
178-
179-
# Load the model, quantize, and save checkpoint
180-
model = AutoFP8ForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
181-
model.quantize(examples)
182-
model.save_quantized(quantized_model_dir)
183-
```
124+
Dynamic quantization of an original precision BF16/FP16 model to FP8 can be achieved with vLLM without any calibration data required. You can enable the feature by specifying `--quantization="fp8"` in the command line or setting `quantization="fp8"` in the LLM constructor.
184125

185-
Your model checkpoint with quantized weights and activations should be available at `Meta-Llama-3-8B-Instruct-FP8/`.
186-
Finally, you can load the quantized model checkpoint directly in vLLM.
126+
In this mode, all Linear modules (except for the final `lm_head`) have their weights quantized down to FP8_E4M3 precision with a per-tensor scale. Activations have their minimum and maximum values calculated during each forward pass to provide a dynamic per-tensor scale for high accuracy. As a result, latency improvements are limited in this mode.
187127

188128
```python
189129
from vllm import LLM
190-
model = LLM(model="Meta-Llama-3-8B-Instruct-FP8/")
191-
# INFO 06-10 21:15:41 model_runner.py:159] Loading model weights took 8.4596 GB
130+
model = LLM("facebook/opt-125m", quantization="fp8")
131+
# INFO 06-10 17:55:42 model_runner.py:157] Loading model weights took 0.1550 GB
192132
result = model.generate("Hello, my name is")
193133
print(result[0].outputs[0].text)
194134
```
135+
136+
:::{warning}
137+
Currently, we load the model at original precision before quantizing down to 8-bits, so you need enough memory to load the whole model.
138+
:::

docs/source/serving/offline_inference.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,7 @@ You can convert the model checkpoint to a sharded checkpoint using <gh-file:exam
9595

9696
Quantized models take less memory at the cost of lower precision.
9797

98-
Statically quantized models can be downloaded from HF Hub (some popular ones are available at [Neural Magic](https://huggingface.co/neuralmagic))
98+
Statically quantized models can be downloaded from HF Hub (some popular ones are available at [Red Hat AI](https://huggingface.co/RedHatAI))
9999
and used directly without extra configuration.
100100

101101
Dynamic quantization is also supported via the `quantization` option -- see [here](#quantization-index) for more details.

0 commit comments

Comments
 (0)