Skip to content

Conversation

@WoosukKwon
Copy link
Collaborator

@WoosukKwon WoosukKwon commented Sep 19, 2025

Key Changes

  • Remove persistent batch
    • No “reordering” & complex bookkeeping
    • Almost all CPU states are Numpy arrays → We can vectorize most of the Python loops in pre-/post-processing
    • Simpler handling for requests resumed from preemption
  • GPU-persistent block tables
    • The CPU does not have the block tables at all. GPU maintains the persistent block tables.
    • In every step, we only send the “diff”s to the GPU, and use a Triton kernel to update the persistent block tables
    • We also use another Triton kernel to create new ephemeral block tables used for this forward pass.
    • More scalable as max_model_len and num_kv_groups increase
  • Triton-native sampler
    • No -1 temperature hack for greedy sampling
    • Efficient support for per-request seeds
    • Efficient support for logprobs by only materializing the top-k logprobs instead of the whole vocab
    • Memory-efficient implementation of prompt logprobs
  • Simple implementation of DP
  • Simple CUDA graphs
  • Efficient support for structured outputs

Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
@mergify mergify bot removed the needs-rebase label Nov 20, 2025
@WoosukKwon WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 20, 2025
@WoosukKwon WoosukKwon merged commit 30b44a1 into main Nov 21, 2025
54 of 58 checks passed
@github-project-automation github-project-automation bot moved this to Done in NVIDIA Nov 21, 2025
@WoosukKwon WoosukKwon deleted the woosuk/model-runner-v2 branch November 21, 2025 16:20
Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved :)

@DarkLight1337
Copy link
Member

Looks like this is failing pre-commit on main

@WoosukKwon
Copy link
Collaborator Author

yeah let me fix the error

The PR passed the pre-commit test in CI somehow.

ywang96 pushed a commit to ywang96/vllm that referenced this pull request Nov 23, 2025
ywang96 pushed a commit to ywang96/vllm that referenced this pull request Nov 23, 2025
@Selkh
Copy link

Selkh commented Nov 24, 2025

Will V2 support async spec decoding in another way or pick the impl in V1?

lpapavassiliou pushed a commit to lpapavassiliou/vllm that referenced this pull request Nov 24, 2025
lpapavassiliou pushed a commit to lpapavassiliou/vllm that referenced this pull request Nov 24, 2025
RunkaiTao pushed a commit to RunkaiTao/vllm that referenced this pull request Nov 24, 2025
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Runkai Tao <[email protected]>
RunkaiTao pushed a commit to RunkaiTao/vllm that referenced this pull request Nov 24, 2025
bringlein pushed a commit to bringlein/vllm that referenced this pull request Nov 26, 2025
bringlein pushed a commit to bringlein/vllm that referenced this pull request Nov 26, 2025
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025
kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025
charlotte12l pushed a commit to charlotte12l/vllm that referenced this pull request Dec 5, 2025
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Xingyu Liu <[email protected]>
charlotte12l pushed a commit to charlotte12l/vllm that referenced this pull request Dec 5, 2025
Zhathw pushed a commit to Zhathw/vllm that referenced this pull request Dec 6, 2025
Zhathw pushed a commit to Zhathw/vllm that referenced this pull request Dec 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build documentation Improvements or additions to documentation nvidia ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[RFC]: Redesigning Persistent Batch in vLLM