|
| 1 | +# NVIDIA TensorRT Model Optimizer |
| 2 | + |
| 3 | +The [NVIDIA TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) is a library designed to optimize models for inference with NVIDIA GPUs. It includes tools for Post-Training Quantization (PTQ) and Quantization Aware Training (QAT) of Large Language Models (LLMs), Vision Language Models (VLMs), and diffusion models. |
| 4 | + |
| 5 | +We recommend installing the library with: |
| 6 | + |
| 7 | +```console |
| 8 | +pip install nvidia-modelopt |
| 9 | +``` |
| 10 | + |
| 11 | +## Quantizing HuggingFace Models with PTQ |
| 12 | + |
| 13 | +You can quantize HuggingFace models using the example scripts provided in the TensorRT Model Optimizer repository. The primary script for LLM PTQ is typically found within the `examples/llm_ptq` directory. |
| 14 | + |
| 15 | +Below is an example showing how to quantize a model using modelopt's PTQ API: |
| 16 | + |
| 17 | +```python |
| 18 | +import modelopt.torch.quantization as mtq |
| 19 | +from transformers import AutoModelForCausalLM |
| 20 | + |
| 21 | +# Load the model from HuggingFace |
| 22 | +model = AutoModelForCausalLM.from_pretrained("<path_or_model_id>") |
| 23 | + |
| 24 | +# Select the quantization config, for example, FP8 |
| 25 | +config = mtq.FP8_DEFAULT_CFG |
| 26 | + |
| 27 | +# Define a forward loop function for calibration |
| 28 | +def forward_loop(model): |
| 29 | + for data in calib_set: |
| 30 | + model(data) |
| 31 | + |
| 32 | +# PTQ with in-place replacement of quantized modules |
| 33 | +model = mtq.quantize(model, config, forward_loop) |
| 34 | +``` |
| 35 | + |
| 36 | +After the model is quantized, you can export it to a quantized checkpoint using the export API: |
| 37 | + |
| 38 | +```python |
| 39 | +import torch |
| 40 | +from modelopt.torch.export import export_hf_checkpoint |
| 41 | + |
| 42 | +with torch.inference_mode(): |
| 43 | + export_hf_checkpoint( |
| 44 | + model, # The quantized model. |
| 45 | + export_dir, # The directory where the exported files will be stored. |
| 46 | + ) |
| 47 | +``` |
| 48 | + |
| 49 | +The quantized checkpoint can then be deployed with vLLM. As an example, the following code shows how to deploy `nvidia/Llama-3.1-8B-Instruct-FP8`, which is the FP8 quantized checkpoint derived from `meta-llama/Llama-3.1-8B-Instruct`, using vLLM: |
| 50 | + |
| 51 | +```python |
| 52 | +from vllm import LLM, SamplingParams |
| 53 | + |
| 54 | +def main(): |
| 55 | + |
| 56 | + model_id = "nvidia/Llama-3.1-8B-Instruct-FP8" |
| 57 | + # Ensure you specify quantization='modelopt' when loading the modelopt checkpoint |
| 58 | + llm = LLM(model=model_id, quantization="modelopt", trust_remote_code=True) |
| 59 | + |
| 60 | + sampling_params = SamplingParams(temperature=0.8, top_p=0.9) |
| 61 | + |
| 62 | + prompts = [ |
| 63 | + "Hello, my name is", |
| 64 | + "The president of the United States is", |
| 65 | + "The capital of France is", |
| 66 | + "The future of AI is", |
| 67 | + ] |
| 68 | + |
| 69 | + outputs = llm.generate(prompts, sampling_params) |
| 70 | + |
| 71 | + for output in outputs: |
| 72 | + prompt = output.prompt |
| 73 | + generated_text = output.outputs[0].text |
| 74 | + print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") |
| 75 | + |
| 76 | +if __name__ == "__main__": |
| 77 | + main() |
| 78 | +``` |
0 commit comments