Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
f32d157
trtllm gen mla initial commit
farazkh80 Jul 10, 2025
3bca375
Unittest passing
farazkh80 Jul 11, 2025
15f3b80
diff output vs flashinfer mla
farazkh80 Jul 14, 2025
cdd315e
trtllm mla kernel working
farazkh80 Jul 15, 2025
274fdbb
refator code
farazkh80 Jul 15, 2025
ba1220d
add utils.py modification
farazkh80 Jul 15, 2025
f67fe41
pre-commit
farazkh80 Jul 15, 2025
f69b3e0
some neat picks
farazkh80 Jul 21, 2025
8a119cc
neater interface
farazkh80 Jul 21, 2025
d772ccc
precommit+rename
farazkh80 Jul 21, 2025
00125a2
updated docs
farazkh80 Jul 22, 2025
cd0d566
kv-cache fix
farazkh80 Jul 22, 2025
a06f252
remove query concat
farazkh80 Jul 23, 2025
ab0df43
server level changes
farazkh80 Jul 23, 2025
523293d
update args and toml
farazkh80 Jul 24, 2025
a3a8784
remove check
farazkh80 Jul 24, 2025
be55b7b
lint
farazkh80 Jul 24, 2025
e01890a
bug fix
farazkh80 Jul 25, 2025
ef24a0b
fix conflict
farazkh80 Jul 28, 2025
3ee133b
Merge branch 'main' into fkhoubsirat-trtllm_gen_mla_sglang
kushanam Jul 29, 2025
cc21f0b
Update python/pyproject.toml
merrymercy Jul 29, 2025
17af4bf
Merge branch 'main' into fkhoubsirat-trtllm_gen_mla_sglang
merrymercy Jul 29, 2025
edb792c
Merge branch 'main' into fkhoubsirat-trtllm_gen_mla_sglang
merrymercy Jul 29, 2025
67c442f
Merge branch 'main' into fkhoubsirat-trtllm_gen_mla_sglang
farazkh80 Jul 29, 2025
39cfd7d
some pr review fixes
farazkh80 Jul 29, 2025
4e73a10
Merge branch 'main' into fkhoubsirat-trtllm_gen_mla_sglang
kushanam Jul 29, 2025
81bf922
Merge branch 'main' into fkhoubsirat-trtllm_gen_mla_sglang
kushanam Jul 29, 2025
6310457
Merge branch 'main' into fkhoubsirat-trtllm_gen_mla_sglang
kushanam Jul 29, 2025
a39e817
add todo comment
farazkh80 Jul 29, 2025
a52db2a
Merge branch 'main' into fkhoubsirat-trtllm_gen_mla_sglang
kushanam Jul 30, 2025
348c22a
Merge branch 'main' into fkhoubsirat-trtllm_gen_mla_sglang
zhyncs Jul 30, 2025
67b73a2
Merge branch 'main' into fkhoubsirat-trtllm_gen_mla_sglang
kushanam Jul 30, 2025
cd77760
Merge branch 'main' into fkhoubsirat-trtllm_gen_mla_sglang
merrymercy Jul 30, 2025
f4746cd
Merge branch 'main' into fkhoubsirat-trtllm_gen_mla_sglang
kushanam Jul 30, 2025
aa9764f
perm change
farazkh80 Jul 31, 2025
240432e
Merge branch 'main' into fkhoubsirat-trtllm_gen_mla_sglang_new_branch
kushanam Jul 31, 2025
9de1f90
Merge branch 'main' into fkhoubsirat-trtllm_gen_mla_sglang_new_branch
kushanam Jul 31, 2025
fadc8a6
some pr changes
farazkh80 Jul 31, 2025
da8ab52
update doc
farazkh80 Jul 31, 2025
049699d
Merge branch 'main' into fkhoubsirat-trtllm_gen_mla_sglang_new_branch
zhyncs Jul 31, 2025
f6bc5c3
add sm100 check
farazkh80 Jul 31, 2025
5679c0f
dito
farazkh80 Jul 31, 2025
d396e94
Merge branch 'main' into fkhoubsirat-trtllm_gen_mla_sglang_new_branch
kushanam Jul 31, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions docs/backend/attention_backend.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,12 @@
| **Triton** | ❌ | ✅ | ✅ | ✅ | ❌ |
| **Torch Native** | ❌ | ❌ | ❌ | ❌ | ❌ |
| **FlashMLA** | ✅ | ✅ | ✅ | ❌ | ❌ |
| **TRTLLM MLA** | ✅ | ❌ | ✅ | ✅ | ❌ |
| **Ascend** | ✅ | ❌ | ❌ | ❌ | ❌ |

**Notes:**
- TRTLLM MLA only implements decode operations. For prefill operations (including multimodal inputs), it falls back to FlashInfer MLA backend.

Note: Every kernel backend is compatible with a page size > 1 by specifying an argument such as `--page-size 16`.
This is because a page size of 16 can be converted to a page size of 1 in the kernel backend.
The "❌" and "✅" symbols in the table above under "Page Size > 1" indicate whether the kernel actually operates with a page size greater than 1, rather than treating a page size of 16 as a page size of 1.
Expand Down Expand Up @@ -48,6 +52,11 @@ python3 -m sglang.launch_server --tp 8 --model deepseek-ai/DeepSeek-R1 --attenti
python3 -m sglang.launch_server --tp 8 --model deepseek-ai/DeepSeek-R1 --attention-backend flashmla --kv-cache-dtype fp8_e4m3 --trust-remote-code
```

- TRTLLM MLA (Optimized for Blackwell Architecture, e.g., B200)
```bash
python3 -m sglang.launch_server --tp 8 --model deepseek-ai/DeepSeek-R1 --attention-backend trtllm_mla --trust-remote-code
```

- Ascend
```bash
python3 -m sglang.launch_server --model meta-llama/Meta-Llama-3.1-8B-Instruct --attention-backend ascend
Expand Down
6 changes: 3 additions & 3 deletions docs/references/deepseek.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ Please refer to [the example](https://github.com/sgl-project/sglang/tree/main/be

- **Weight Absorption**: By applying the associative law of matrix multiplication to reorder computation steps, this method balances computation and memory access and improves efficiency in the decoding phase.

- **MLA Attention Backends**: Currently SGLang supports different optimized MLA attention backends, including [FlashAttention3](https://github.com/Dao-AILab/flash-attention), [Flashinfer](https://docs.flashinfer.ai/api/mla.html), [FlashMLA](https://github.com/deepseek-ai/FlashMLA), [CutlassMLA](https://github.com/sgl-project/sglang/pull/5390), and [Triton](https://github.com/triton-lang/triton) backends. The default FA3 provides good performance across wide workloads.
- **MLA Attention Backends**: Currently SGLang supports different optimized MLA attention backends, including [FlashAttention3](https://github.com/Dao-AILab/flash-attention), [Flashinfer](https://docs.flashinfer.ai/api/mla.html), [FlashMLA](https://github.com/deepseek-ai/FlashMLA), [CutlassMLA](https://github.com/sgl-project/sglang/pull/5390), **TRTLLM MLA** (optimized for Blackwell architecture), and [Triton](https://github.com/triton-lang/triton) backends. The default FA3 provides good performance across wide workloads.

- **FP8 Quantization**: W8A8 FP8 and KV Cache FP8 quantization enables efficient FP8 inference. Additionally, we have implemented Batched Matrix Multiplication (BMM) operator to facilitate FP8 inference in MLA with weight absorption.

Expand All @@ -104,7 +104,7 @@ Overall, with these optimizations, we have achieved up to **7x** acceleration in
<img src="https://lmsys.org/images/blog/sglang_v0_3/deepseek_mla.svg" alt="Multi-head Latent Attention for DeepSeek Series Models">
</p>

**Usage**: MLA optimization is enabled by default.
**Usage**: MLA optimization is enabled by default. For MLA models on Blackwell architecture (e.g., B200), the default backend is FlashInfer. To use the optimized TRTLLM MLA backend for decode operations, explicitly specify `--attention-backend trtllm_mla`. Note that TRTLLM MLA only optimizes decode operations - prefill operations (including multimodal inputs) will fall back to FlashInfer MLA.

**Reference**: Check [Blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/#deepseek-multi-head-latent-attention-mla-throughput-optimizations) and [Slides](https://github.com/sgl-project/sgl-learning-materials/blob/main/slides/lmsys_1st_meetup_deepseek_mla.pdf) for more details.

Expand Down Expand Up @@ -161,7 +161,7 @@ Add arguments `--speculative-algorithm`, `--speculative-num-steps`, `--speculati
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3-0324 --speculative-algorithm EAGLE --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 --trust-remote-code --tp 8
```
- The best configuration for `--speculative-num-steps`, `--speculative-eagle-topk` and `--speculative-num-draft-tokens` can be searched with [bench_speculative.py](https://github.com/sgl-project/sglang/blob/main/scripts/playground/bench_speculative.py) script for given batch size. The minimum configuration is `--speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2`, which can achieve speedup for larger batch sizes.
- FlashAttention3 FlashMLA and Triton backend fully supports MTP usage. For FlashInfer backend (`--attention-backend flashinfer`) with speculative decoding,`--speculative-eagle-topk` parameter should be set to `1`. MTP support for the CutlassMLA backend is still under development.
- FlashAttention3, FlashMLA, and Triton backend fully supports MTP usage. For FlashInfer backend (`--attention-backend flashinfer`) with speculative decoding,`--speculative-eagle-topk` parameter should be set to `1`. MTP support for the CutlassMLA and TRTLLM MLA backends are still under development.
- To enable DeepSeek MTP for large batch sizes (>32), there are some parameters should be changed (Reference [this discussion](https://github.com/sgl-project/sglang/issues/4543#issuecomment-2737413756)):
- Adjust `--max-running-requests` to a larger number. The default value is `32` for MTP. For larger batch sizes, you should increase this value beyond the default value.
- Set `--cuda-graph-bs`. It's a list of batch sizes for cuda graph capture. The default captured batch sizes for speculative decoding is set [here](https://github.com/sgl-project/sglang/blob/49420741746c8f3e80e0eb17e7d012bfaf25793a/python/sglang/srt/model_executor/cuda_graph_runner.py#L126). You can include more batch sizes into it.
Expand Down
Loading
Loading