Skip to content

Commit 06a3207

Browse files
lucasliedominicshanshan
authored andcommitted
[AutoDeploy] merge feat/ad-2025-06-13 (NVIDIA#5556)
Signed-off-by: Lucas Liebenwein <[email protected]>
1 parent 355a73a commit 06a3207

File tree

31 files changed

+774
-944
lines changed

31 files changed

+774
-944
lines changed

examples/auto_deploy/.vscode/launch.json

Lines changed: 11 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -7,11 +7,17 @@
77
"request": "launch",
88
"program": "build_and_run_ad.py",
99
"args": [
10-
"--config",
11-
"{\"batch_size\": 2, \"attn_page_size\": 16, \"world_size\": 2, \"compile_backend\": \"torch-simple\", \"attn_backend\": \"FlashInfer\",\"model_factory\": \"AutoModelForCausalLM\", \"model\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\", \"benchmark\": false}",
12-
"--model-kwargs",
13-
"{}",
14-
// "{\"num_hidden_layers\": 3}",
10+
"--model=meta-llama/Meta-Llama-3.1-8B-Instruct",
11+
"--args.world_size=2",
12+
"--args.runtime=demollm",
13+
"--args.compile_backend=torch-simple",
14+
"--args.attn_page_size=16",
15+
"--args.attn_backend=flashinfer",
16+
"--args.model_factory=AutoModelForCausalLM",
17+
"--benchmark.enabled=false",
18+
"--prompt.batch_size=2",
19+
"--args.model_kwargs",
20+
"num_hidden_layers=3,num_attention_heads=32",
1521
],
1622
"console": "integratedTerminal",
1723
"justMyCode": false,

examples/auto_deploy/README.md

Lines changed: 47 additions & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -10,12 +10,6 @@ AutoDeploy is designed to simplify and accelerate the deployment of PyTorch mode
1010

1111
______________________________________________________________________
1212

13-
## Latest News 🔥
14-
15-
- \[2025/02/14\] Initial experimental release of `auto_deploy` backend for TensorRT-LLM
16-
17-
______________________________________________________________________
18-
1913
## Motivation & Approach
2014

2115
Deploying large language models (LLMs) can be challenging, especially when balancing ease of use with high performance. Teams need simple, intuitive deployment solutions that reduce engineering effort, speed up the integration of new models, and support rapid experimentation without compromising performance.
@@ -52,7 +46,7 @@ The general entrypoint to run the auto-deploy demo is the `build_and_run_ad.py`
5246

5347
```bash
5448
cd examples/auto_deploy
55-
python build_and_run_ad.py --config '{"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0"}'
49+
python build_and_run_ad.py --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
5650
```
5751

5852
______________________________________________________________________
@@ -74,7 +68,7 @@ Additionally, we have officially verified support for the following models:
7468

7569
| Model Series | HF Model Card | Model Factory | Precision | World Size | Runtime | Compile Backend ||| Attention Backend |||
7670
|--------------|----------------------|----------------|-----------|------------|---------|-----------------|--------------------|--------------------|--------------------|----------|----------|
77-
| | | | | | | torch-simple | torch-compile | torch-opt | TritonWithFlattenedInputs | FlashInfer | MultiHeadLatentAttention |
71+
| | | | | | | torch-simple | torch-compile | torch-opt | triton | flashinfer | MultiHeadLatentAttention |
7872
| LLaMA | meta-llama/Llama-2-7b-chat-hf<br>meta-llama/Meta-Llama-3.1-8B-Instruct<br>meta-llama/Llama-3.1-70B-Instruct<br>codellama/CodeLlama-13b-Instruct-hf | AutoModelForCausalLM | BF16 | 1,2,4 | demollm, trtllm |||||| n/a |
7973
| LLaMA-4 | meta-llama/Llama-4-Scout-17B-16E-Instruct<br>meta-llama/Llama-4-Maverick-17B-128E-Instruct | AutoModelForImageTextToText | BF16 | 1,2,4,8 | demollm, trtllm |||||| n/a |
8074
| Nvidia Minitron | nvidia/Llama-3_1-Nemotron-51B-Instruct<br>nvidia/Llama-3.1-Minitron-4B-Width-Base<br>nvidia/Llama-3.1-Minitron-4B-Depth-Base | AutoModelForCausalLM | BF16 | 1,2,4 | demollm, trtllm |||||| n/a |
@@ -114,8 +108,8 @@ Optimize attention operations using different attention kernel implementations:
114108

115109
| `"attn_backend"` | Description |
116110
|----------------------|-------------|
117-
| `TritonWithFlattenedInputs` | Custom fused multi-head attention (MHA) with KV Cache kernels for efficient attention processing. |
118-
| `FlashInfer` | Uses off-the-shelf optimized attention kernels with KV Cache from the [`flashinfer`](https://github.com/flashinfer-ai/flashinfer.git) library. |
111+
| `triton` | Custom fused multi-head attention (MHA) with KV Cache kernels for efficient attention processing. |
112+
| `flashinfer` | Uses off-the-shelf optimized attention kernels with KV Cache from the [`flashinfer`](https://github.com/flashinfer-ai/flashinfer.git) library. |
119113

120114
### Precision Support
121115

@@ -128,58 +122,56 @@ ______________________________________________________________________
128122

129123
## Advanced Usage
130124

131-
### Example Build Script ([`build_and_run_ad.py`](./build_and_run_ad.py))
132-
133-
#### Base Command
125+
### Example Run Script ([`build_and_run_ad.py`](./build_and_run_ad.py))
134126

135-
To build and run AutoDeploy example, use the following command with the [`build_and_run_ad.py`](./build_and_run_ad.py) script:
136-
137-
In the below example:
138-
139-
| Configuration Key | Description |
140-
|-------------------|-------------|
141-
| `"model"` | The HF model card or path to a HF checkpoint folder |
142-
| `"model_factory"` | Choose model factory implementation (`"hf"` or `"llama4"`) |
143-
| `"skip_loading_weights"` | Only load the architecture, not the weights |
144-
| `"customize_tokenizer"` | Use tokenizer from model factory (true) or from LLM API (false) |
145-
| `"model_kwargs"` | Extra kwargs for the model config class to customize the model config |
146-
| `"tokenizer_kwargs"` | Extra kwargs for the tokenizer class to customize the tokenizer |
147-
| `"world_size"` | The number of GPUs for Tensor Parallel |
148-
| `"runtime"` | Specifies which type of Engine to use during runtime |
149-
| `"compile_backend"` | Specifies how to compile the graph at the end |
150-
| `"attn_backend"` | Specifies kernel implementation for attention |
151-
| `"mla_backend"` | Specifies implementation for multi-head latent attention |
152-
| `"max_seq_len"` | Maximum sequence length for inference/cache |
153-
| `"max_batch_size"` | Maximum dimension for statically allocated KV cache |
154-
| `"attn_page_size"` | Page size for attention |
155-
| `"benchmark"` | Indicates whether to run the built-in benchmark for token generation |
156-
157-
For default values and additional configuration options, refer to the [simple_config.py](./simple_config.py) file.
127+
To build and run AutoDeploy example, use the [`build_and_run_ad.py`](./build_and_run_ad.py) script:
158128

159129
```bash
160130
cd examples/auto_deploy
161-
python build_and_run_ad.py \
162-
--config '{"model": {HF_modelcard_or_path_to_local_folder}, "world_size": {num_GPUs}, "runtime": {"demollm"|"trtllm"}, "compile_backend": {"torch-simple"|"torch-opt"}, "attn_backend": {"TritonWithFlattenedInputs"|"FlashInfer"}, "benchmark": {true|false} }'
131+
python build_and_run_ad.py --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
163132
```
164133

165-
#### Experiment Configuration
134+
You can arbitrarily configure your experiment. Use the `-h/--help` flag to see available options:
135+
136+
```bash
137+
python build_and_run_ad.py --help
138+
```
166139

167-
The experiment configuration `dataclass` is defined in
168-
[simple_config.py](./simple_config.py). Check it out for detailed documentation on each
169-
available configuration.
140+
Below is a non-exhaustive list of common config options:
170141

171-
Arguments can be overwritten during runtime by specifying the `--config` argument on the command
172-
line and providing a valid config dictionary in `json` format. For example, to run any experiment
173-
with benchmarking enabled, use:
142+
| Configuration Key | Description |
143+
|-------------------|-------------|
144+
| `--model` | The HF model card or path to a HF checkpoint folder |
145+
| `--args.model_factory` | Choose model factory implementation (`"AutoModelForCausalLM"`, ...) |
146+
| `--args.skip_loading_weights` | Only load the architecture, not the weights |
147+
| `--args.model_kwargs` | Extra kwargs that are being passed to the model initializer in the model factory |
148+
| `--args.tokenizer_kwargs` | Extra kwargs that are being passed to the tokenizer initializer in the model factory |
149+
| `--args.world_size` | The number of GPUs for Tensor Parallel |
150+
| `--args.runtime` | Specifies which type of Engine to use during runtime (`"demollm"` or `"trtllm"`) |
151+
| `--args.compile_backend` | Specifies how to compile the graph at the end |
152+
| `--args.attn_backend` | Specifies kernel implementation for attention |
153+
| `--args.mla_backend` | Specifies implementation for multi-head latent attention |
154+
| `--args.max_seq_len` | Maximum sequence length for inference/cache |
155+
| `--args.max_batch_size` | Maximum dimension for statically allocated KV cache |
156+
| `--args.attn_page_size` | Page size for attention |
157+
| `--prompt.batch_size` | Number of queries to generate |
158+
| `--benchmark.enabled` | Whether to run the built-in benchmark (true/false) |
159+
160+
For default values and additional configuration options, refer to the `ExperimentConfig` class in [build_and_run_ad.py](./build_and_run_ad.py) file.
161+
162+
Here is a more complete example of using the script:
174163

175164
```bash
176165
cd examples/auto_deploy
177-
python build_and_run_ad.py --config '{"benchmark": true}'
166+
python build_and_run_ad.py \
167+
--model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
168+
--args.world_size 2 \
169+
--args.runtime "demollm" \
170+
--args.compile_backend "torch-compile" \
171+
--args.attn_backend "flashinfer" \
172+
--benchmark.enabled True
178173
```
179174

180-
The `model_kwargs` and `tokenizer_kwargs` dictionaries can be supplied on the command line via
181-
`--model-kwargs '{}'` and `--tokenizer-kwargs '{}'`.
182-
183175
#### Logging Level
184176

185177
Use the following env variable to specify the logging level of our built-in logger ordered by
@@ -222,7 +214,7 @@ Refer to [NVIDIA TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Mo
222214

223215
```bash
224216
cd examples/auto_deploy
225-
python build_and_run_ad.py --config '{"world_size": 1, "model": "{<MODELOPT_CKPT_PATH>}"}'
217+
python build_and_run_ad.py --model "<MODELOPT_CKPT_PATH>" --args.world_size 1
226218
```
227219

228220
### Incorporating `auto_deploy` into your own workflow
@@ -235,18 +227,16 @@ Here is an example of how you can build an LLM object with AutoDeploy integratio
235227
<summary>Click to expand the example</summary>
236228

237229
```
238-
from tensorrt_llm import LLM, TorchCompileConfig
230+
from tensorrt_llm._torch.auto_deploy import LLM
239231
240232
241233
# Construct the LLM high-level interface object with autodeploy as backend
242234
llm = LLM(
243235
model=<HF_MODEL_CARD_OR_DIR>,
244-
backend="_autodeploy",
245-
tensor_parallel_size=<NUM_WORLD_RANK>,
246-
use_cuda_graph=True, # set True if using "torch-opt" as compile backend
247-
torch_compile_config=TorchCompileConfig(), # set this if using "torch-opt" as compile backend
248-
model_kwargs={"use_cache": False}, # AutoDeploy uses its own cache implementation
249-
attn_backend="TritonWithFlattenedInputs", # choose between "TritonWithFlattenedInputs" and "FlashInfer"
236+
world_size=<NUM_WORLD_RANK>,
237+
compile_backend="torch-compile",
238+
model_kwargs={"num_hidden_layers": 2}, # test with smaller model configuration
239+
attn_backend="flashinfer", # choose between "triton" and "flashinfer"
250240
attn_page_size=64, # page size for attention (tokens_per_block, should be == max_seq_len for triton)
251241
skip_loading_weights=False,
252242
model_factory="AutoModelForCausalLM", # choose appropriate model factory

0 commit comments

Comments
 (0)