Skip to content

Conversation

@benchislett
Copy link
Collaborator

@benchislett benchislett commented Oct 6, 2025

Purpose

The annotation was missing from FlashInfer-MLA while the implementation has support.

Running DSR1-FP4 on 4xB200 gets me 97 TPS:

VLLM_FLASHINFER_MOE_BACKEND=latency VLLM_USE_FLASHINFER_MOE_FP4=1 VLLM_ATTENTION_BACKEND=FLASHINFER_MLA vllm serve nvidia/DeepSeek-R1-FP4 -tp 4 --max-model-len 32768 --max-num-seqs 128 --no-enable-prefix-caching --async-scheduling --port 8049

I also tested on a local development branch for MTP containing #25984, and #25987.

On that branch, with 3 MTP speculative tokens, I get 165 TPS and passing GSM8k evals.

Test Plan

GSM8k run as follows:

lm_eval \
  --model local-completions \
  --tasks gsm8k \
  --model_args base_url=http://0.0.0.0:8049/v1/completions,model=nvidia/DeepSeek-R1-FP4,tokenized_requests=False,tokenizer_backend=None,num_concurrent=128,timeout=120,max_retries=5

Test Result

Matches the baseline:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9439|±  |0.0063|
|     |       |strict-match    |     5|exact_match|↑  |0.9439|±  |0.0063|

@mergify mergify bot added the v1 label Oct 6, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly enables full CUDA graph support for decode operations in the FlashInfer-MLA attention backend. The change is implemented by creating a new FlashInferMLAMetadataBuilder class that inherits from MLACommonMetadataBuilder and sets the cudagraph_support attribute to AttentionCGSupport.UNIFORM_BATCH. The FlashInferMLABackend is then updated to use this new builder. The approach is clean, follows the existing design patterns in the codebase, and seems to correctly enable the feature as described. The changes are minimal and well-targeted. I found no issues of high or critical severity.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Copy link
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; thanks!

@LucasWilkinson LucasWilkinson enabled auto-merge (squash) October 6, 2025 21:17
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 6, 2025
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

@LucasWilkinson LucasWilkinson merged commit f77df94 into vllm-project:main Oct 6, 2025
54 checks passed
southfreebird pushed a commit to southfreebird/vllm that referenced this pull request Oct 7, 2025
mrasquinha-g pushed a commit to mrasquinha-g/vllm that referenced this pull request Oct 9, 2025
@benchislett benchislett deleted the flashinfer-mla-cuda-graphs branch October 9, 2025 14:23
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants