Name	Name	Last commit message	Last commit date
parent directory ..
conf	conf
slurm	slurm
.gitignore	.gitignore
ADVANCED.md	ADVANCED.md
Dockerfile	Dockerfile
README.md	README.md
convert.sh	convert.sh
convert_model.py	convert_model.py
eagle3.sh	eagle3.sh
export.py	export.py
export.sh	export.sh
finetune.py	finetune.py
finetune.sh	finetune.sh
generate.py	generate.py
generate.sh	generate.sh
generation_server.sh	generation_server.sh
mmlu.py	mmlu.py
mmlu.sh	mmlu.sh
offline_feature_extract.py	offline_feature_extract.py
offline_feature_extract.sh	offline_feature_extract.sh
prune.py	prune.py
prune.sh	prune.sh
quantize.py	quantize.py
quantize.sh	quantize.sh
requirements.txt	requirements.txt
requirements_ssm.txt	requirements_ssm.txt
speculative.md	speculative.md
validate.py	validate.py
validate.sh	validate.sh

TensorRT Model Optimizer Integrated Examples

TensorRT Model Optimizer (ModelOpt, nvidia-modelopt) provides end-to-end model optimization for NVIDIA hardware including quantization (real or simulated), knowledge distillation, pruning, speculative decoding, and more.

Major Features

Start from Hugging Face pretrained model checkpoint with on-the-fly conversion to Megatron-LM checkpoint format.
Support all kinds of model parallelism (TP, EP, ETP, PP).
Export to TensorRT-LLM, vLLM, and SGLang ready unified checkpoint.

Support Matrix {Model}x{Features}

Model (`conf/`)	Quantization	EAGLE3	Pruning (PP only)	Distillation
`deepseek-ai/DeepSeek-R1`	✅	✅	-	-
`meta-llama/Llama-{3.1-8B, 3.1-405B, 3.2-1B}-Instruct`	✅	✅	✅	✅
`meta-llama/Llama-4-{Scout,Maverick}-17B-{16,128}E-Instruct`	✅	✅	-	-
`moonshotai/Kimi-K2-Instruct`	✅	✅	-	-
`nvidia/NVIDIA-Nemotron-Nano-9B-v2`	✅	-	✅	✅
`openai/gpt-oss-{20b, 120b}`	✅	Online	✅	✅
`Qwen/Qwen3-{0.6B, 8B}`	✅	✅	✅	✅
`Qwen/Qwen3-{30B-A3B, 235B-A22B}`	WAR	✅	✅	✅

Getting Started in a Local Environment

Install nvidia-modelopt from PyPI:

pip install -U nvidia-modelopt

Alternatively, you can install from source to try our latest features.

❗ IMPORTANT: The first positional argument (e.g. meta-llama/Llama-3.2-1B-Instruct) of each script is the config name used to match the supported model config in conf/. The pretrained HF checkpoint should be downloaded and provided through ${HF_MODEL_CKPT}.

⭐ NVFP4 Quantization, Qauntization-Aware Training, and Model Export

Provide the pretrained checkpoint path through variable ${HF_MODEL_CKPT} and provide variable ${MLM_MODEL_SAVE} which stores a resumeable Megatron-LM distributed checkpoint. To export Hugging Face-Like quantized checkpoint for TensorRT-LLM, vLLM, or SGLang deployement, provide ${EXPORT_DIR} to export.sh.

📙 NOTE: ModelOpt supports different quantization formats. By default, we simulate the low-precision numerical behavior (fake-quant) which can be run on GPUs with compute > 80. Real low-precision paramters (e.g. E4M3 or E2M1) and low-precision compute (e.g. FP8Linear) are also supported depending on GPU compute capability. See Adanvanced Topics for details.

\
    TP=1 \
    HF_MODEL_CKPT=<pretrained_model_name_or_path> \
    MLM_MODEL_SAVE=/tmp/Llama-3.2-1B-Instruct_quant \
    ./quantize.sh meta-llama/Llama-3.2-1B-Instruct nvfp4

\
    PP=1 \
    HF_MODEL_CKPT=<pretrained_model_name_or_path> \
    MLM_MODEL_CKPT=/tmp/Llama-3.2-1B-Instruct_quant \
    EXPORT_DIR=/tmp/Llama-3.2-1B-Instruct_export \
    ./export.sh meta-llama/Llama-3.2-1B-Instruct

⭐ Online BF16 EAGLE3 Training

Online EAGLE3 training has both the target (frozen) and draft models in the memory where the hidden_states required for training is generated on the fly. Periodically, acceptance length (AL, the higher the better) is evaluated on MT-Bench prompts. Use the same export.sh script to export the EAGLE3 checkpoint for deployment.

\
    TP=1 \
    HF_MODEL_CKPT=<pretrained_model_name_or_path> \
    MLM_MODEL_SAVE=/tmp/Llama-3.2-1B-Eagle3 \
    ./eagle3.sh meta-llama/Llama-3.2-1B-Instruct

\
    PP=1 \
    HF_MODEL_CKPT=<pretrained_model_name_or_path> \
    MLM_MODEL_CKPT=/tmp/Llama-3.2-1B-Eagle3 \
    EXPORT_DIR=/tmp/Llama-3.2-1B-Eagle3-Export \
    ./export.sh meta-llama/Llama-3.2-1B-Instruct

See Adanvanced Topics for a moonshotai/Kimi-K2-Instruct EAGLE3 training example using slurm.

⭐ Pruning

Checkout pruning getting started section and guidelines for configuring pruning parameters in the ModelOpt pruning README.

Pruning is supported for GPT and Mamba models in Pipeline Parallel mode. Available pruning dimensions are:

TARGET_FFN_HIDDEN_SIZE
TARGET_HIDDEN_SIZE
TARGET_NUM_ATTENTION_HEADS
TARGET_NUM_QUERY_GROUPS
TARGET_MAMBA_NUM_HEADS
TARGET_MAMBA_HEAD_DIM
TARGET_NUM_MOE_EXPERTS
TARGET_MOE_FFN_HIDDEN_SIZE
TARGET_MOE_SHARED_EXPERT_INTERMEDIATE_SIZE
TARGET_NUM_LAYERS
LAYERS_TO_DROP (comma separated, 1-indexed list of layer numbers to directly drop)

Example for depth pruning Qwen3-8B from 36 to 24 layers:

PP=1 \
TARGET_NUM_LAYERS=24 \
HF_MODEL_CKPT=<pretrained_model_name_or_path> \
MLM_MODEL_SAVE=Qwen3-8B-Pruned \
./prune.sh Qwen/Qwen3-8B

Tip

If number of layers in the model is not divisible by pipeline parallel size (PP), you can configure uneven PP by setting MLM_EXTRA_ARGS="--decoder-first-pipeline-num-layers <X> --decoder-last-pipeline-num-layers <Y>"

Tip

You can reuse pruning scores for pruning same model again to different architectures by setting PRUNE_ARGS="--pruning-scores-path <path_to_save_scores>"

Note

When loading pruned M-LM checkpoint for subsequent steps, make sure overwrite the pruned parameters in the default conf/ by setting MLM_EXTRA_ARGS. E.g.: for loading above pruned Qwen3-8B checkpoint for mmlu, set: MLM_EXTRA_ARGS="--num-layers 24"

⭐ Inference and Training

The saved Megatron-LM distributed checkpoint (output of above scripts) can be resumed for inference (generate or evaluate) or training (SFT or PEFT). To read more about these features, see Advanced Topics.

\
    TP=1 \
    MLM_MODEL_CKPT=/tmp/Llama-3.2-1B-Instruct_quant \
    ./generate.sh meta-llama/Llama-3.2-1B-Instruct

\
    TP=1 \
    MLM_MODEL_CKPT=/tmp/Llama-3.2-1B-Instruct_quant \
    ./mmlu.sh meta-llama/Llama-3.2-1B-Instruct

\
    TP=1 \
    MLM_MODEL_CKPT=/tmp/Llama-3.2-1B-Instruct_quant \
    ./finetune.sh meta-llama/Llama-3.2-1B-Instruct

Advanced Usage

TBD

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

TensorRT Model Optimizer Integrated Examples

Major Features

Support Matrix {Model}x{Features}

Getting Started in a Local Environment

⭐ NVFP4 Quantization, Qauntization-Aware Training, and Model Export

⭐ Online BF16 EAGLE3 Training

⭐ Pruning

⭐ Inference and Training

Advanced Usage

FilesExpand file tree

modelopt

Directory actions

More options

Directory actions

More options

Latest commit

History

modelopt

Folders and files

parent directory

README.md

TensorRT Model Optimizer Integrated Examples

Major Features

Support Matrix {Model}x{Features}

Getting Started in a Local Environment

⭐ NVFP4 Quantization, Qauntization-Aware Training, and Model Export

⭐ Online BF16 EAGLE3 Training

⭐ Pruning

⭐ Inference and Training

Advanced Usage