English | 中文
An efficient Large Language Model export tool that converts LLM models to ONNX and MNN formats, supporting quantization optimization and multimodal models.
- 🚀 Dynamic Shape Support: Optimized original code with dynamic input shape support
- 🚀 Model Optimization: Reduced constant parts for improved inference performance
- 🚀 Automatic Optimization: Integrated OnnxSlim for ONNX model optimization, ~5% performance improvement (Thanks @inisis)
- 🚀 LoRA Support: Support for LoRA weight merging/splitting export
- 🚀 Quantization Methods: Support for AWQ, GPTQ, HQQ, and other quantization methods
- 🚀 Multimodal Support: Support for text, image, audio, and other multimodal models
- 🚀 Inference Frameworks: Provides MNN and ONNX inference code
# Install from PyPI (Recommended)
pip install llmexport
# Install latest version from GitHub
pip install git+https://github.com/wangzhaode/llm-export@master
# Local development installation
git clone https://github.com/wangzhaode/llm-export
cd llm-export
pip install -e .# Using Hugging Face CLI
huggingface-cli download Qwen/Qwen2.5-1.5B-Instruct --local-dir Qwen2.5-1.5B-Instruct
# Or using ModelScope (Recommended for users in China)
modelscope download Qwen/Qwen2.5-1.5B-Instruct --local_dir Qwen2.5-1.5B-Instruct# Text conversation testing
llmexport --path Qwen2.5-1.5B-Instruct --test "Hello, please introduce yourself"
# Multimodal testing (Image + Text)
llmexport --path Qwen2-VL-2B-Instruct --test "<img>image_url</img>Describe this image"# Export to ONNX format
llmexport --path Qwen2.5-1.5B-Instruct --export onnx
# Export to MNN format (Default 4bit quantization)
llmexport --path Qwen2.5-1.5B-Instruct --export mnn
# Custom quantization parameters
llmexport --path Qwen2.5-1.5B-Instruct --export mnn --quant_bit 8 --quant_block 128- ONNX Export: Use
--export onnxto export to ONNX format - MNN Export: Use
--export mnnto export to MNN format - Model Optimization: OnnxSlim optimization enabled by default, use
--onnx_slimto explicitly enable
- Quantization Bits:
--quant_bit 4/8(Default 4bit) - Quantization Block Size:
--quant_block 64/128(Default 64) - LM Head Quantization:
--lm_quant_bitseparate setting for output layer quantization - Symmetric Quantization:
--symenable symmetric quantization (no zero point)
- AWQ Quantization:
--awqenable AWQ quantization - HQQ Quantization:
--hqqenable HQQ quantization - GPTQ Quantization:
--gptq_pathload GPTQ quantized model - Smooth Quantization:
--smoothenable Smooth quantization
- LoRA Merging:
--lora_pathspecify LoRA weight path - LoRA Splitting:
--lora_splitexport LoRA weights separately
- Visual Quantization:
--visual_quant_bit,--visual_quant_blockset visual module quantization - Visual Symmetric:
--visual_symvisual module symmetric quantization
- Verbose Output:
--verboseshow detailed logs - Performance Evaluation:
--pplget logits for all tokens - Custom Output:
--dst_pathspecify output directory (default./model)
| Parameter | Type | Description |
|---|---|---|
--path |
Required | Model path, supports local directory or Hugging Face model ID |
--export |
Optional | Export format: onnx or mnn |
--test |
Optional | Test query string |
--dst_path |
Optional | Output directory (default ./model) |
--verbose |
Flag | Show detailed logs |
| Parameter | Default | Description |
|---|---|---|
--quant_bit |
4 | Quantization bits (4 or 8) |
--quant_block |
64 | Quantization block size (0 means channel-wise) |
--lm_quant_bit |
Same as quant_bit |
LM Head layer quantization bits |
--visual_quant_bit |
Model dependent | Visual module quantization bits |
--visual_quant_block |
Model dependent | Visual module quantization block size |
| Parameter | Description |
|---|---|
--awq |
Enable AWQ quantization |
--hqq |
Enable HQQ quantization |
--smooth |
Enable Smooth quantization |
--sym |
Enable symmetric quantization (no zero point) |
--visual_sym |
Visual module symmetric quantization |
| Parameter | Description |
|---|---|
--lora_path |
LoRA weight path |
--lora_split |
Export LoRA weights separately |
| Parameter | Description |
|---|---|
--tokenizer_path |
Tokenizer path (default uses --path) |
--gptq_path |
GPTQ quantized model path |
--mnnconvert |
Local MNNConvert path |
--onnx_slim |
Enable ONNX-Slim optimization |
--ppl |
Get logits for all tokens |
--seperate_embed |
Separate embedding layer to avoid quantization |
--calib_data |
Calibration data path |
usage: llmexport.py [-h] --path PATH [--type TYPE] [--tokenizer_path TOKENIZER_PATH] [--lora_path LORA_PATH] [--gptq_path GPTQ_PATH] [--dst_path DST_PATH]
[--verbose] [--test TEST] [--export EXPORT] [--onnx_slim] [--quant_bit QUANT_BIT] [--quant_block QUANT_BLOCK] [--lm_quant_bit LM_QUANT_BIT]
[--mnnconvert MNNCONVERT] [--ppl] [--awq] [--sym] [--tie_embed] [--lora_split]
llm_exporter
options:
-h, --help show this help message and exit
--path PATH path(`str` or `os.PathLike`):
Can be either:
- A string, the *model id* of a pretrained model like `THUDM/chatglm-6b`. [TODO]
- A path to a *directory* clone from repo like `../chatglm-6b`.
--type TYPE type(`str`, *optional*):
The pretrain llm model type.
--tokenizer_path TOKENIZER_PATH
tokenizer path, defaut is `None` mean using `--path` value.
--lora_path LORA_PATH
lora path, defaut is `None` mean not apply lora.
--gptq_path GPTQ_PATH
gptq path, defaut is `None` mean not apply gptq.
--dst_path DST_PATH export onnx/mnn model to path, defaut is `./model`.
--verbose Whether or not to print verbose.
--test TEST test model inference with query `TEST`.
--export EXPORT export model to an onnx/mnn model.
--onnx_slim Whether or not to use onnx-slim.
--quant_bit QUANT_BIT
mnn quant bit, 4 or 8, default is 4.
--quant_block QUANT_BLOCK
mnn quant block, default is 0 mean channle-wise.
--lm_quant_bit LM_QUANT_BIT
mnn lm_head quant bit, 4 or 8, default is `quant_bit`.
--mnnconvert MNNCONVERT
local mnnconvert path, if invalid, using pymnn.
--ppl Whether or not to get all logits of input tokens.
--awq Whether or not to use awq quant.
--sym Whether or not to using symmetric quant (without zeropoint), defualt is False.
--tie_embed Whether or not to using tie_embedding, defualt is False.
--lora_split Whether or not export lora split, defualt is False.
Currently supports the following model types:
- Qwen Series: Qwen2.5, Qwen2, Qwen1.5, Qwen-VL, etc.
- LLaMA Series: Llama-3.2, Llama-3, Llama-2, etc.
- ChatGLM Series: ChatGLM4, ChatGLM3, ChatGLM2, etc.
- Baichuan Series: Baichuan2-7B-Chat, etc.
- Yi Series: Yi-6B-Chat, etc.
- Others: InternLM, DeepSeek, Phi, Gemma, TinyLlama, etc.
- Vision Models: Qwen2-VL, Qwen2.5-VL, Llama-3.2-Vision, InternVL, etc.
- Audio Models: Qwen2-Audio, Qwen2.5-Omni, etc.
- Text Embedding: bge-large-zh, gte-multilingual, etc.
We provide optimized model downloads:
- Hugging Face: taobao-mnn
- ModelScope: MNN
Popular models:
| Model | Hugging Face | ModelScope |
|---|---|---|
| DeepSeek-R1-1.5B-Qwen | Q4_1 | Q4_1 |
| Qwen2.5-0.5B-Instruct | Q4_1 | Q4_1 |
| Qwen2.5-1.5B-Instruct | Q4_1 | Q4_1 |
| GPT-OSS-20B | Q4_1 | Q4_1 |
| Qwen3-4B-Instruct-2507 | Q4_1 | Q4_1 |
See the complete list for more models.
- MNN Inference: mnn-llm - LLM inference library for MNN framework
- ONNX Inference: onnx-llm, OnnxLLM - ONNX format inference libraries
- Model Optimization: OnnxSlim - ONNX model optimization tool
This project is licensed under the MIT License.1.7B-Instruct-MNN) |