TensorRT Model Optimizer | Local Examples | Configuration | Slurm Examples | Speculative Decoding | Advanced Topics
TensorRT Model Optimizer (ModelOpt, nvidia-modelopt)
provides end-to-end model optimization for NVIDIA hardware including quantization (real or simulated),
knowledge distillation, pruning, speculative decoding, and more.
- Start from Hugging Face pretrained model checkpoint with on-the-fly conversion to Megatron-LM checkpoint format.
- Support all kinds of model parallelism (TP, EP, ETP, PP).
- Export to TensorRT-LLM, vLLM, and SGLang ready unified checkpoint.
Model (conf/) |
Quantization | EAGLE3 | Pruning (PP only) | Distillation |
|---|---|---|---|---|
deepseek-ai/DeepSeek-R1 |
✅ | ✅ | - | - |
meta-llama/Llama-{3.1-8B, 3.1-405B, 3.2-1B}-Instruct |
✅ | ✅ | ✅ | ✅ |
meta-llama/Llama-4-{Scout,Maverick}-17B-{16,128}E-Instruct |
✅ | ✅ | - | - |
moonshotai/Kimi-K2-Instruct |
✅ | ✅ | - | - |
nvidia/NVIDIA-Nemotron-Nano-9B-v2 |
✅ | - | ✅ | ✅ |
openai/gpt-oss-{20b, 120b} |
✅ | Online | ✅ | ✅ |
Qwen/Qwen3-{0.6B, 8B} |
✅ | ✅ | ✅ | ✅ |
Qwen/Qwen3-{30B-A3B, 235B-A22B} |
WAR | ✅ | ✅ | ✅ |
Install nvidia-modelopt from PyPI:
pip install -U nvidia-modeloptAlternatively, you can install from source to try our latest features.
❗ IMPORTANT: The first positional argument (e.g.
meta-llama/Llama-3.2-1B-Instruct) of each script is the config name used to match the supported model config inconf/. The pretrained HF checkpoint should be downloaded and provided through${HF_MODEL_CKPT}.
Provide the pretrained checkpoint path through variable ${HF_MODEL_CKPT} and provide variable
${MLM_MODEL_SAVE} which stores a resumeable Megatron-LM distributed checkpoint. To export
Hugging Face-Like quantized checkpoint for TensorRT-LLM, vLLM, or SGLang deployement,
provide ${EXPORT_DIR} to export.sh.
📙 NOTE: ModelOpt supports different quantization formats. By default, we simulate the low-precision numerical behavior (fake-quant) which can be run on GPUs with compute > 80. Real low-precision paramters (e.g.
E4M3orE2M1) and low-precision compute (e.g.FP8Linear) are also supported depending on GPU compute capability. See Adanvanced Topics for details.
\
TP=1 \
HF_MODEL_CKPT=<pretrained_model_name_or_path> \
MLM_MODEL_SAVE=/tmp/Llama-3.2-1B-Instruct_quant \
./quantize.sh meta-llama/Llama-3.2-1B-Instruct nvfp4
\
PP=1 \
HF_MODEL_CKPT=<pretrained_model_name_or_path> \
MLM_MODEL_CKPT=/tmp/Llama-3.2-1B-Instruct_quant \
EXPORT_DIR=/tmp/Llama-3.2-1B-Instruct_export \
./export.sh meta-llama/Llama-3.2-1B-InstructOnline EAGLE3 training has both the target (frozen) and draft models in the memory where the hidden_states
required for training is generated on the fly. Periodically, acceptance length (AL, the higher the better) is
evaluated on MT-Bench prompts. Use the same export.sh script to export the EAGLE3 checkpoint for
deployment.
\
TP=1 \
HF_MODEL_CKPT=<pretrained_model_name_or_path> \
MLM_MODEL_SAVE=/tmp/Llama-3.2-1B-Eagle3 \
./eagle3.sh meta-llama/Llama-3.2-1B-Instruct
\
PP=1 \
HF_MODEL_CKPT=<pretrained_model_name_or_path> \
MLM_MODEL_CKPT=/tmp/Llama-3.2-1B-Eagle3 \
EXPORT_DIR=/tmp/Llama-3.2-1B-Eagle3-Export \
./export.sh meta-llama/Llama-3.2-1B-InstructSee Adanvanced Topics for a moonshotai/Kimi-K2-Instruct EAGLE3 training example using slurm.
Checkout pruning getting started section and guidelines for configuring pruning parameters in the ModelOpt pruning README.
Pruning is supported for GPT and Mamba models in Pipeline Parallel mode. Available pruning dimensions are:
TARGET_FFN_HIDDEN_SIZETARGET_HIDDEN_SIZETARGET_NUM_ATTENTION_HEADSTARGET_NUM_QUERY_GROUPSTARGET_MAMBA_NUM_HEADSTARGET_MAMBA_HEAD_DIMTARGET_NUM_MOE_EXPERTSTARGET_MOE_FFN_HIDDEN_SIZETARGET_MOE_SHARED_EXPERT_INTERMEDIATE_SIZETARGET_NUM_LAYERSLAYERS_TO_DROP(comma separated, 1-indexed list of layer numbers to directly drop)
Example for depth pruning Qwen3-8B from 36 to 24 layers:
PP=1 \
TARGET_NUM_LAYERS=24 \
HF_MODEL_CKPT=<pretrained_model_name_or_path> \
MLM_MODEL_SAVE=Qwen3-8B-Pruned \
./prune.sh Qwen/Qwen3-8BTip
If number of layers in the model is not divisible by pipeline parallel size (PP), you can configure uneven
PP by setting MLM_EXTRA_ARGS="--decoder-first-pipeline-num-layers <X> --decoder-last-pipeline-num-layers <Y>"
Tip
You can reuse pruning scores for pruning same model again to different architectures by setting
PRUNE_ARGS="--pruning-scores-path <path_to_save_scores>"
Note
When loading pruned M-LM checkpoint for subsequent steps, make sure overwrite the pruned parameters in the
default conf/ by setting MLM_EXTRA_ARGS. E.g.: for loading above pruned Qwen3-8B checkpoint for mmlu, set:
MLM_EXTRA_ARGS="--num-layers 24"
The saved Megatron-LM distributed checkpoint (output of above scripts) can be resumed for inference (generate or evaluate) or training (SFT or PEFT). To read more about these features, see Advanced Topics.
\
TP=1 \
MLM_MODEL_CKPT=/tmp/Llama-3.2-1B-Instruct_quant \
./generate.sh meta-llama/Llama-3.2-1B-Instruct
\
TP=1 \
MLM_MODEL_CKPT=/tmp/Llama-3.2-1B-Instruct_quant \
./mmlu.sh meta-llama/Llama-3.2-1B-Instruct
\
TP=1 \
MLM_MODEL_CKPT=/tmp/Llama-3.2-1B-Instruct_quant \
./finetune.sh meta-llama/Llama-3.2-1B-InstructTBD