Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
1d3153b
drafting docs for cuda graph v1
fhl2000 Sep 6, 2025
6a830d1
fix typos and minor polish
fhl2000 Sep 7, 2025
88d7346
fix broken table
fhl2000 Sep 7, 2025
ccc44f6
address comments
fhl2000 Sep 8, 2025
c12b82d
minor
fhl2000 Sep 8, 2025
35a0c54
fix pre-commit
fhl2000 Sep 8, 2025
631a8da
fix pre-commit again
fhl2000 Sep 8, 2025
e75a642
replace two images
fhl2000 Sep 9, 2025
c3f8115
replace one image
fhl2000 Sep 9, 2025
614e126
Merge branch 'main' into cudagraph_mode_docs
fhl2000 Sep 10, 2025
a02eb23
small fixing of torch_compile.md
fhl2000 Sep 15, 2025
a582107
Move assets
hmellor Sep 15, 2025
f723640
Formatting and `CUDA Graphs` consistency
hmellor Sep 15, 2025
7ef8153
Comment link formatting
hmellor Sep 15, 2025
8752c24
Fix `pre-commit`
hmellor Sep 15, 2025
0dd161d
`pre-commit` again...
hmellor Sep 15, 2025
50a73cb
address comments
fhl2000 Sep 15, 2025
bba2ba8
fix pre-commit
fhl2000 Sep 16, 2025
06d56de
Merge branch 'main' into cudagraph_mode_docs
fhl2000 Sep 16, 2025
8c2b392
fix links
fhl2000 Sep 19, 2025
0db6e26
modify notes for attn_ops fusion
fhl2000 Sep 20, 2025
08292de
update aiter_fa cudagraph_support
fhl2000 Sep 20, 2025
7813f6d
add some recent updates
fhl2000 Sep 27, 2025
9e549c8
small fix
fhl2000 Sep 27, 2025
9a5adf9
small
fhl2000 Sep 27, 2025
d53da8c
small
fhl2000 Sep 28, 2025
6e7e01c
Merge branch 'main' into cudagraph_mode_docs
fhl2000 Oct 3, 2025
99b3eb6
Update docs/design/cuda_graphs.md
fhl2000 Oct 7, 2025
1d57323
Apply suggestions from code review
fhl2000 Oct 7, 2025
f8dc933
adapt from review suggestions
fhl2000 Oct 7, 2025
2f5586b
fix default
fhl2000 Oct 7, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added docs/assets/design/cuda_graphs/current_design.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/design/cuda_graphs/wrapper_flow.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
241 changes: 241 additions & 0 deletions docs/design/cuda_graphs.md

Large diffs are not rendered by default.

9 changes: 5 additions & 4 deletions docs/design/torch_compile.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,9 @@ In vLLM's V1 architecture, `torch.compile` is enabled by default and is a critic

Throughout the example, we will run a common Llama model using v1, and turn on debug level logging to show all the details. The command to be used is `VLLM_USE_V1=1 VLLM_LOGGING_LEVEL=DEBUG vllm serve meta-llama/Llama-3.2-1B`.

!!! note
For more information and the latest progress of `torch.compile` integration, see this [Blog Post](https://blog.vllm.ai/2025/08/20/torch-compile.html).

## Compilation Cache

In the very verbose logs, we can see:
Expand Down Expand Up @@ -133,7 +136,7 @@ Unfortunately, because auto-tuning takes quite a long time (from seconds to minu

## Cudagraph Capture

vLLM's V1 architecture uses piecewise cudagraph. The full computation graph is split as mentioned above, and we only capture the cudagraph for the piece of graph between attention operations (including the first graph before any attention operation, and the last graph after all the attention operation). This is based on a common observation: computation between attentions are usually token-wise and easy to deal with for cudagraph; while the attention operation is non-trivial to be cudagraph compatible. Thus, by running the attention operation in eager mode while the rest operations in cudagraph, we keep the flexibility of the attention operation.
vLLM's V1 architecture uses piecewise cudagraph that aligns with the piecewise compilation. The full computation graph is split as mentioned above, and we only capture the cudagraph for the piece of graph between attention operations (including the first graph before any attention operation, and the last graph after all the attention operation). This is based on a common observation: computation between attentions are usually token-wise and easy to deal with for cudagraph; while the attention operation is non-trivial to be cudagraph compatible. Thus, by running the attention operation in eager mode while the rest operations in cudagraph, we keep the flexibility of the attention operation.

The piecewise cudagraph also has fine-grained memory management. The purpose is to only exclude the attention kernel from cudagraph, while keeping all the rest modules and the memory allocation operations in the cudagraph. This is why the attention operation in V1 has the output tensor as the input of the attention.

Expand All @@ -150,6 +153,4 @@ Then it will only capture cudagraph for the specified sizes. It can be useful to

### Full Cudagraph capture

It is possible to include attention as part of the cudagraph if using an attention backend that is cudagraph compatible. This can improve performance in some cases such as decode speed for smaller models. Enable this using `--compilation-config '{"full_cuda_graph": true}'`.

Currently only FlashAttention 3 is compatible, and only when cascade attention is disabled.
It is possible to include attention as part of the cudagraph if using an attention backend that is cudagraph compatible. This can improve performance in some cases such as decode speed for smaller models or MOEs. See [CUDA Graphs](cuda_graphs.md) for more details.