@@ -10,12 +10,6 @@ AutoDeploy is designed to simplify and accelerate the deployment of PyTorch mode
1010
1111______________________________________________________________________
1212
13- ## Latest News 🔥
14-
15- - \[ 2025/02/14\] Initial experimental release of ` auto_deploy ` backend for TensorRT-LLM
16-
17- ______________________________________________________________________
18-
1913## Motivation & Approach
2014
2115Deploying large language models (LLMs) can be challenging, especially when balancing ease of use with high performance. Teams need simple, intuitive deployment solutions that reduce engineering effort, speed up the integration of new models, and support rapid experimentation without compromising performance.
@@ -52,7 +46,7 @@ The general entrypoint to run the auto-deploy demo is the `build_and_run_ad.py`
5246
5347``` bash
5448cd examples/auto_deploy
55- python build_and_run_ad.py --config ' {" model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0"} '
49+ python build_and_run_ad.py --model " TinyLlama/TinyLlama-1.1B-Chat-v1.0"
5650```
5751
5852______________________________________________________________________
@@ -74,7 +68,7 @@ Additionally, we have officially verified support for the following models:
7468
7569| Model Series | HF Model Card | Model Factory | Precision | World Size | Runtime | Compile Backend ||| Attention Backend |||
7670| --------------| ----------------------| ----------------| -----------| ------------| ---------| -----------------| --------------------| --------------------| --------------------| ----------| ----------|
77- | | | | | | | torch-simple | torch-compile | torch-opt | TritonWithFlattenedInputs | FlashInfer | MultiHeadLatentAttention |
71+ | | | | | | | torch-simple | torch-compile | torch-opt | triton | flashinfer | MultiHeadLatentAttention |
7872| LLaMA | meta-llama/Llama-2-7b-chat-hf<br >meta-llama/Meta-Llama-3.1-8B-Instruct<br >meta-llama/Llama-3.1-70B-Instruct<br >codellama/CodeLlama-13b-Instruct-hf | AutoModelForCausalLM | BF16 | 1,2,4 | demollm, trtllm | ✅ | ✅ | ✅ | ✅ | ✅ | n/a |
7973| LLaMA-4 | meta-llama/Llama-4-Scout-17B-16E-Instruct<br >meta-llama/Llama-4-Maverick-17B-128E-Instruct | AutoModelForImageTextToText | BF16 | 1,2,4,8 | demollm, trtllm | ✅ | ✅ | ❌ | ✅ | ✅ | n/a |
8074| Nvidia Minitron | nvidia/Llama-3_1-Nemotron-51B-Instruct<br >nvidia/Llama-3.1-Minitron-4B-Width-Base<br >nvidia/Llama-3.1-Minitron-4B-Depth-Base | AutoModelForCausalLM | BF16 | 1,2,4 | demollm, trtllm | ✅ | ✅ | ✅ | ✅ | ✅ | n/a |
@@ -114,8 +108,8 @@ Optimize attention operations using different attention kernel implementations:
114108
115109| ` "attn_backend" ` | Description |
116110| ----------------------| -------------|
117- | ` TritonWithFlattenedInputs ` | Custom fused multi-head attention (MHA) with KV Cache kernels for efficient attention processing. |
118- | ` FlashInfer ` | Uses off-the-shelf optimized attention kernels with KV Cache from the [ ` flashinfer ` ] ( https://github.com/flashinfer-ai/flashinfer.git ) library. |
111+ | ` triton ` | Custom fused multi-head attention (MHA) with KV Cache kernels for efficient attention processing. |
112+ | ` flashinfer ` | Uses off-the-shelf optimized attention kernels with KV Cache from the [ ` flashinfer ` ] ( https://github.com/flashinfer-ai/flashinfer.git ) library. |
119113
120114### Precision Support
121115
@@ -128,58 +122,56 @@ ______________________________________________________________________
128122
129123## Advanced Usage
130124
131- ### Example Build Script ([ ` build_and_run_ad.py ` ] ( ./build_and_run_ad.py ) )
132-
133- #### Base Command
125+ ### Example Run Script ([ ` build_and_run_ad.py ` ] ( ./build_and_run_ad.py ) )
134126
135- To build and run AutoDeploy example, use the following command with the [ ` build_and_run_ad.py ` ] ( ./build_and_run_ad.py ) script:
136-
137- In the below example:
138-
139- | Configuration Key | Description |
140- | -------------------| -------------|
141- | ` "model" ` | The HF model card or path to a HF checkpoint folder |
142- | ` "model_factory" ` | Choose model factory implementation (` "hf" ` or ` "llama4" ` ) |
143- | ` "skip_loading_weights" ` | Only load the architecture, not the weights |
144- | ` "customize_tokenizer" ` | Use tokenizer from model factory (true) or from LLM API (false) |
145- | ` "model_kwargs" ` | Extra kwargs for the model config class to customize the model config |
146- | ` "tokenizer_kwargs" ` | Extra kwargs for the tokenizer class to customize the tokenizer |
147- | ` "world_size" ` | The number of GPUs for Tensor Parallel |
148- | ` "runtime" ` | Specifies which type of Engine to use during runtime |
149- | ` "compile_backend" ` | Specifies how to compile the graph at the end |
150- | ` "attn_backend" ` | Specifies kernel implementation for attention |
151- | ` "mla_backend" ` | Specifies implementation for multi-head latent attention |
152- | ` "max_seq_len" ` | Maximum sequence length for inference/cache |
153- | ` "max_batch_size" ` | Maximum dimension for statically allocated KV cache |
154- | ` "attn_page_size" ` | Page size for attention |
155- | ` "benchmark" ` | Indicates whether to run the built-in benchmark for token generation |
156-
157- For default values and additional configuration options, refer to the [ simple_config.py] ( ./simple_config.py ) file.
127+ To build and run AutoDeploy example, use the [ ` build_and_run_ad.py ` ] ( ./build_and_run_ad.py ) script:
158128
159129``` bash
160130cd examples/auto_deploy
161- python build_and_run_ad.py \
162- --config ' {"model": {HF_modelcard_or_path_to_local_folder}, "world_size": {num_GPUs}, "runtime": {"demollm"|"trtllm"}, "compile_backend": {"torch-simple"|"torch-opt"}, "attn_backend": {"TritonWithFlattenedInputs"|"FlashInfer"}, "benchmark": {true|false} }'
131+ python build_and_run_ad.py --model " TinyLlama/TinyLlama-1.1B-Chat-v1.0"
163132```
164133
165- #### Experiment Configuration
134+ You can arbitrarily configure your experiment. Use the ` -h/--help ` flag to see available options:
135+
136+ ``` bash
137+ python build_and_run_ad.py --help
138+ ```
166139
167- The experiment configuration ` dataclass ` is defined in
168- [ simple_config.py] ( ./simple_config.py ) . Check it out for detailed documentation on each
169- available configuration.
140+ Below is a non-exhaustive list of common config options:
170141
171- Arguments can be overwritten during runtime by specifying the ` --config ` argument on the command
172- line and providing a valid config dictionary in ` json ` format. For example, to run any experiment
173- with benchmarking enabled, use:
142+ | Configuration Key | Description |
143+ | -------------------| -------------|
144+ | ` --model ` | The HF model card or path to a HF checkpoint folder |
145+ | ` --args.model_factory ` | Choose model factory implementation (` "AutoModelForCausalLM" ` , ...) |
146+ | ` --args.skip_loading_weights ` | Only load the architecture, not the weights |
147+ | ` --args.model_kwargs ` | Extra kwargs that are being passed to the model initializer in the model factory |
148+ | ` --args.tokenizer_kwargs ` | Extra kwargs that are being passed to the tokenizer initializer in the model factory |
149+ | ` --args.world_size ` | The number of GPUs for Tensor Parallel |
150+ | ` --args.runtime ` | Specifies which type of Engine to use during runtime (` "demollm" ` or ` "trtllm" ` ) |
151+ | ` --args.compile_backend ` | Specifies how to compile the graph at the end |
152+ | ` --args.attn_backend ` | Specifies kernel implementation for attention |
153+ | ` --args.mla_backend ` | Specifies implementation for multi-head latent attention |
154+ | ` --args.max_seq_len ` | Maximum sequence length for inference/cache |
155+ | ` --args.max_batch_size ` | Maximum dimension for statically allocated KV cache |
156+ | ` --args.attn_page_size ` | Page size for attention |
157+ | ` --prompt.batch_size ` | Number of queries to generate |
158+ | ` --benchmark.enabled ` | Whether to run the built-in benchmark (true/false) |
159+
160+ For default values and additional configuration options, refer to the ` ExperimentConfig ` class in [ build_and_run_ad.py] ( ./build_and_run_ad.py ) file.
161+
162+ Here is a more complete example of using the script:
174163
175164``` bash
176165cd examples/auto_deploy
177- python build_and_run_ad.py --config ' {"benchmark": true}'
166+ python build_and_run_ad.py \
167+ --model " TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
168+ --args.world_size 2 \
169+ --args.runtime " demollm" \
170+ --args.compile_backend " torch-compile" \
171+ --args.attn_backend " flashinfer" \
172+ --benchmark.enabled True
178173```
179174
180- The ` model_kwargs ` and ` tokenizer_kwargs ` dictionaries can be supplied on the command line via
181- ` --model-kwargs '{}' ` and ` --tokenizer-kwargs '{}' ` .
182-
183175#### Logging Level
184176
185177Use the following env variable to specify the logging level of our built-in logger ordered by
@@ -222,7 +214,7 @@ Refer to [NVIDIA TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Mo
222214
223215``` bash
224216cd examples/auto_deploy
225- python build_and_run_ad.py --config ' {"world_size": 1, " model": "{ <MODELOPT_CKPT_PATH>}"} '
217+ python build_and_run_ad.py --model " <MODELOPT_CKPT_PATH>" --args.world_size 1
226218```
227219
228220### Incorporating ` auto_deploy ` into your own workflow
@@ -235,18 +227,16 @@ Here is an example of how you can build an LLM object with AutoDeploy integratio
235227<summary >Click to expand the example</summary >
236228
237229```
238- from tensorrt_llm import LLM, TorchCompileConfig
230+ from tensorrt_llm._torch.auto_deploy import LLM
239231
240232
241233# Construct the LLM high-level interface object with autodeploy as backend
242234llm = LLM(
243235 model=<HF_MODEL_CARD_OR_DIR>,
244- backend="_autodeploy",
245- tensor_parallel_size=<NUM_WORLD_RANK>,
246- use_cuda_graph=True, # set True if using "torch-opt" as compile backend
247- torch_compile_config=TorchCompileConfig(), # set this if using "torch-opt" as compile backend
248- model_kwargs={"use_cache": False}, # AutoDeploy uses its own cache implementation
249- attn_backend="TritonWithFlattenedInputs", # choose between "TritonWithFlattenedInputs" and "FlashInfer"
236+ world_size=<NUM_WORLD_RANK>,
237+ compile_backend="torch-compile",
238+ model_kwargs={"num_hidden_layers": 2}, # test with smaller model configuration
239+ attn_backend="flashinfer", # choose between "triton" and "flashinfer"
250240 attn_page_size=64, # page size for attention (tokens_per_block, should be == max_seq_len for triton)
251241 skip_loading_weights=False,
252242 model_factory="AutoModelForCausalLM", # choose appropriate model factory
0 commit comments