-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
Add NVIDIA TensorRT Model Optimizer in vLLM documentation #17561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
simon-mo
merged 4 commits into
vllm-project:main
from
Edwardf0t1:zhiyu/add-modelopt-in-doc
May 2, 2025
Merged
Changes from 2 commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,51 @@ | ||
| # NVIDIA TensorRT Model Optimizer | ||
|
|
||
| The [NVIDIA TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) is a library designed to optimize models for inference with NVIDIA GPUs. It includes tools for Post-Training Quantization (PTQ) and Quantization Aware Training (QAT) of Large Language Models (LLMs), Vision Language Models (VLMs), and diffusion models. | ||
|
|
||
| We recommend installing the library with: | ||
|
|
||
| ```console | ||
| pip install nvidia-modelopt | ||
| ``` | ||
|
|
||
| ## Quantizing HuggingFace Models with PTQ | ||
|
|
||
| You can quantize HuggingFace models using the example scripts provided in the TensorRT Model Optimizer repository. The primary script for LLM PTQ is typically found within the `examples/llm_ptq` directory. | ||
|
|
||
| Here's an example of how you might run the quantization script (refer to the [specific examples](https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/llm_ptq) for exact arguments and usage): | ||
|
|
||
| ```console | ||
| # Quantize and export | ||
| python hf_ptq.py --pyt_ckpt_path meta-llama/Llama-3.1-8B-Instruct --qformat fp8 --export_fmt hf --export_path <quantized_ckpt_path> --trust_remote_code | ||
|
|
||
| # After quantization, the exported model can be potentially deployed with vLLM. | ||
| ``` | ||
|
|
||
| This process generates a quantized model checkpoint. As an example, the following codes show how to deploy `nvidia/Llama-3.1-8B-Instruct-FP8` which is the fp8 quantized checkpoint from `meta-llama/Llama-3.1-8B-Instruct` with vllm. | ||
|
|
||
| ```console | ||
| from vllm import LLM, SamplingParams | ||
|
|
||
| def main(): | ||
|
|
||
| model_id = "nvidia/Llama-3.1-8B-Instruct-FP8" | ||
| sampling_params = SamplingParams(temperature=0.8, top_p=0.9) | ||
|
|
||
| prompts = [ | ||
| "Hello, my name is", | ||
| "The president of the United States is", | ||
| "The capital of France is", | ||
| "The future of AI is", | ||
| ] | ||
|
|
||
| llm = LLM(model=model_id, quantization="modelopt") | ||
| outputs = llm.generate(prompts, sampling_params) | ||
|
|
||
| for output in outputs: | ||
| prompt = output.prompt | ||
| generated_text = output.outputs[0].text | ||
| print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") | ||
|
|
||
| if __name__ == "__main__": | ||
| main() | ||
| ``` | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.