Skip to content
Merged
Show file tree
Hide file tree
Changes from 156 commits
Commits
Show all changes
173 commits
Select commit Hold shift + click to select a range
33a3a26
wip
WoosukKwon Aug 17, 2025
699bd79
Merge branch 'main' into woosuk/input-prep
WoosukKwon Aug 18, 2025
c472982
merge
WoosukKwon Aug 22, 2025
79e5eb3
wip
WoosukKwon Aug 22, 2025
64c8cce
rename
WoosukKwon Aug 22, 2025
48bca9a
merge
WoosukKwon Aug 23, 2025
a1e3745
wip
WoosukKwon Aug 25, 2025
da9cd26
Merge branch 'main' into woosuk/input-prep
WoosukKwon Aug 25, 2025
7b4b72e
fix
WoosukKwon Aug 25, 2025
65f9369
merge
WoosukKwon Aug 25, 2025
b1d5273
fix
WoosukKwon Aug 25, 2025
a851aaa
simplify
WoosukKwon Aug 25, 2025
e570b0a
merge
WoosukKwon Aug 28, 2025
d6d719f
Merge branch 'main' into woosuk/input-prep
WoosukKwon Aug 28, 2025
b21393c
Merge branch 'main' into woosuk/input-prep
WoosukKwon Aug 28, 2025
efba25e
minor
WoosukKwon Aug 28, 2025
e451045
fix
WoosukKwon Aug 28, 2025
19c0dfc
minor
WoosukKwon Aug 28, 2025
4055781
minor
WoosukKwon Aug 28, 2025
9ee9d0e
fix
WoosukKwon Aug 28, 2025
efcb786
merge
WoosukKwon Aug 31, 2025
e696f78
minor
WoosukKwon Aug 31, 2025
c11d1e6
optimize spec
WoosukKwon Aug 31, 2025
22771e5
work
WoosukKwon Sep 1, 2025
ba1a58f
MAX_SPEC_LEN
WoosukKwon Sep 1, 2025
62d23b3
fix
WoosukKwon Sep 1, 2025
af7b6c5
fix
WoosukKwon Sep 1, 2025
01bf16e
fix
WoosukKwon Sep 1, 2025
cc340e2
top_p top_k
WoosukKwon Sep 1, 2025
4c2a337
merge
WoosukKwon Sep 1, 2025
b16e2d9
fix
WoosukKwon Sep 1, 2025
23eae07
merge
WoosukKwon Sep 5, 2025
ead95fe
merge
WoosukKwon Sep 6, 2025
8e6cb9a
minor
WoosukKwon Sep 6, 2025
0c56069
merge
WoosukKwon Sep 6, 2025
6283995
minor
WoosukKwon Sep 7, 2025
286eeb9
merge
WoosukKwon Sep 7, 2025
5f95309
rename
WoosukKwon Sep 7, 2025
787e596
wip
WoosukKwon Sep 8, 2025
7a50a54
Merge branch 'main' into woosuk/input-prep
WoosukKwon Sep 13, 2025
9314a83
Merge branch 'main' into woosuk/input-prep
WoosukKwon Sep 14, 2025
caf963f
fix
WoosukKwon Sep 14, 2025
5c133fc
reorder
WoosukKwon Sep 14, 2025
e47bb99
fix
WoosukKwon Sep 14, 2025
eb3742c
fix
WoosukKwon Sep 14, 2025
633f9f0
Merge branch 'main' into woosuk/input-prep
WoosukKwon Sep 14, 2025
9a6fcca
fix
WoosukKwon Sep 14, 2025
8b3c13c
wip
WoosukKwon Sep 15, 2025
67852c1
minor
WoosukKwon Sep 15, 2025
69b1789
chunked prefilling
WoosukKwon Sep 15, 2025
f1981db
minor
WoosukKwon Sep 15, 2025
e107680
wip
WoosukKwon Sep 15, 2025
9f2becd
merge
WoosukKwon Sep 16, 2025
dfc84b1
wip
WoosukKwon Sep 15, 2025
83d1137
wip
WoosukKwon Sep 16, 2025
c320a33
skip warmup
WoosukKwon Sep 16, 2025
9151026
task
WoosukKwon Sep 16, 2025
c1d83f2
merge
WoosukKwon Sep 18, 2025
9050087
update
WoosukKwon Sep 18, 2025
92f337f
minor
WoosukKwon Sep 18, 2025
cbdb47d
working
WoosukKwon Sep 18, 2025
3f50030
fix
WoosukKwon Sep 18, 2025
a496283
minor
WoosukKwon Sep 18, 2025
bc6463a
hash
WoosukKwon Sep 18, 2025
aabfaa0
fix
WoosukKwon Sep 18, 2025
330058f
fix
WoosukKwon Sep 18, 2025
82e591f
remove
WoosukKwon Sep 18, 2025
8407fa0
fix
WoosukKwon Sep 18, 2025
e171e5b
merge
WoosukKwon Sep 18, 2025
2bb2cb1
revert
WoosukKwon Sep 18, 2025
67d8c0c
fix
WoosukKwon Sep 18, 2025
a98eff0
minor
WoosukKwon Sep 18, 2025
323a05b
update
WoosukKwon Sep 18, 2025
82da219
Implement topk_logprobs
WoosukKwon Sep 18, 2025
efda084
minor
WoosukKwon Sep 18, 2025
86dade7
fix
WoosukKwon Sep 18, 2025
d2be623
fix
WoosukKwon Sep 18, 2025
31619ff
fix
WoosukKwon Sep 18, 2025
b9c7448
logprobs
WoosukKwon Sep 19, 2025
8deedfa
-inf
WoosukKwon Sep 19, 2025
52ca2f5
sample
WoosukKwon Sep 19, 2025
af65838
dummy run
WoosukKwon Sep 19, 2025
8af8798
fix
WoosukKwon Sep 19, 2025
b405d78
DP sampler
WoosukKwon Sep 19, 2025
0d3de9e
fix
WoosukKwon Sep 19, 2025
3367277
Merge branch 'main' into woosuk/model-runner-v2
WoosukKwon Sep 19, 2025
37478c1
async output
WoosukKwon Sep 19, 2025
9c75d89
minor
WoosukKwon Sep 19, 2025
d30c0d5
refactor
WoosukKwon Sep 19, 2025
4be2c66
fix
WoosukKwon Sep 19, 2025
a8e7071
minor
WoosukKwon Sep 19, 2025
c7f3e84
minor
WoosukKwon Sep 19, 2025
396bbe6
Merge branch 'main' into woosuk/model-runner-v2
WoosukKwon Sep 19, 2025
010e39e
minor
WoosukKwon Sep 19, 2025
6f038fc
Merge branch 'main' into woosuk/model-runner-v2
WoosukKwon Sep 19, 2025
a66aa37
minor:
WoosukKwon Sep 19, 2025
98ef239
minor
WoosukKwon Sep 19, 2025
158a468
random uuid
WoosukKwon Sep 20, 2025
913b8e9
Merge branch 'main' into woosuk/model-runner-v2
WoosukKwon Sep 20, 2025
8aee6e9
64-bit for gumbel seed
WoosukKwon Sep 20, 2025
42ffdd9
wip
WoosukKwon Sep 20, 2025
631b5b4
Merge branch 'main' into woosuk/model-runner-v2
WoosukKwon Sep 21, 2025
bc73f67
compute_logits
WoosukKwon Sep 21, 2025
fe5472d
Merge branch 'main' into woosuk/model-runner-v2
WoosukKwon Sep 22, 2025
72f0a71
assert
WoosukKwon Sep 22, 2025
17c2c10
Merge branch 'main' into woosuk/model-runner-v2
WoosukKwon Sep 23, 2025
42f9915
fix
WoosukKwon Sep 23, 2025
704def2
Merge branch 'main' into woosuk/model-runner-v2
WoosukKwon Sep 23, 2025
ad2cf80
Merge branch 'main' into woosuk/model-runner-v2
WoosukKwon Sep 24, 2025
866eef5
minor
WoosukKwon Sep 24, 2025
1107701
Merge branch 'main' into woosuk/model-runner-v2
WoosukKwon Oct 30, 2025
09e4b2f
update
WoosukKwon Oct 30, 2025
5666a25
fix
WoosukKwon Oct 30, 2025
5c8049d
fix
WoosukKwon Oct 30, 2025
1c5c866
uint64
WoosukKwon Oct 30, 2025
8f8aaa8
forward context
WoosukKwon Nov 6, 2025
e40e85b
merge
WoosukKwon Nov 9, 2025
013daed
Add sample_tokens
WoosukKwon Nov 9, 2025
608fec3
fix lora
WoosukKwon Nov 9, 2025
bf3992c
allow torch compile
WoosukKwon Nov 9, 2025
a1249af
minor
WoosukKwon Nov 9, 2025
3ce8a08
Add DP
WoosukKwon Nov 9, 2025
b9ebedb
fix
WoosukKwon Nov 9, 2025
8d82fac
fix
WoosukKwon Nov 9, 2025
af23897
fix
WoosukKwon Nov 9, 2025
83943cd
minor
WoosukKwon Nov 9, 2025
cbd90df
fix
WoosukKwon Nov 9, 2025
5b5fd19
minor
WoosukKwon Nov 9, 2025
484135c
minor
WoosukKwon Nov 9, 2025
8912870
Add structured outputs
WoosukKwon Nov 9, 2025
312affc
fix
WoosukKwon Nov 9, 2025
523f27a
fix
WoosukKwon Nov 9, 2025
de64ce7
async structured outputs
WoosukKwon Nov 9, 2025
8b44f99
flag
WoosukKwon Nov 9, 2025
d8a8279
fix
WoosukKwon Nov 9, 2025
fe97bf9
fix dp
WoosukKwon Nov 9, 2025
8240f3a
minor
WoosukKwon Nov 9, 2025
ebdee19
minor
WoosukKwon Nov 9, 2025
e75ded3
minor
WoosukKwon Nov 9, 2025
493b4d6
minor
WoosukKwon Nov 9, 2025
75ef5f4
fix for DP
WoosukKwon Nov 9, 2025
724593b
fix
WoosukKwon Nov 9, 2025
6dc3d83
minor
WoosukKwon Nov 9, 2025
2b51ecb
skip sync in dummy run
WoosukKwon Nov 9, 2025
ecb2932
minor
WoosukKwon Nov 10, 2025
63e4387
minor
WoosukKwon Nov 10, 2025
dd254ce
code owner
WoosukKwon Nov 10, 2025
f510b9e
code owner
WoosukKwon Nov 10, 2025
a505e71
minor
WoosukKwon Nov 11, 2025
fb0782c
minor
WoosukKwon Nov 11, 2025
645650c
remove filtering for negative token
WoosukKwon Nov 12, 2025
2326a8c
minor on cudagraph utils
WoosukKwon Nov 12, 2025
e284750
merge
WoosukKwon Nov 12, 2025
4085ce8
minor
WoosukKwon Nov 12, 2025
1d8a671
fix
WoosukKwon Nov 12, 2025
31580e9
merge
WoosukKwon Nov 13, 2025
a0c396b
merge
WoosukKwon Nov 15, 2025
6da659f
mypy
WoosukKwon Nov 15, 2025
ff9a1aa
readme
WoosukKwon Nov 15, 2025
197ed08
fix
WoosukKwon Nov 16, 2025
a72b07e
fix cudagraph
WoosukKwon Nov 16, 2025
a9b4fa3
revert
WoosukKwon Nov 16, 2025
3da2e77
Merge branch 'main' into woosuk/model-runner-v2
WoosukKwon Nov 16, 2025
ee2c3b0
minor
WoosukKwon Nov 16, 2025
995f1aa
simplify get_kv_cache_spec
WoosukKwon Nov 16, 2025
ed84190
support mla
WoosukKwon Nov 16, 2025
5ea5e7e
merge
WoosukKwon Nov 17, 2025
784371c
preempt
WoosukKwon Nov 17, 2025
1402b93
Optimize gumbel sampling
WoosukKwon Nov 17, 2025
4ee6bc4
Merge branch 'main' into woosuk/model-runner-v2
WoosukKwon Nov 20, 2025
e9152dd
nick's comment
WoosukKwon Nov 20, 2025
104b2fa
minor
WoosukKwon Nov 20, 2025
327c0e3
num_computed_tokens_cpu
WoosukKwon Nov 20, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,9 @@ CMakeLists.txt @tlrmchlsmth @LucasWilkinson
/vllm/v1/kv_cache_interface.py @heheda12345
/vllm/v1/offloading @ApostaC

# Model runner V2
/vllm/v1/worker/gpu @WoosukKwon

# Test ownership
/.buildkite/lm-eval-harness @mgoin @simon-mo
/tests/distributed/test_multi_node_assignment.py @youkaichao
Expand Down
5 changes: 4 additions & 1 deletion examples/offline_inference/basic/basic.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,10 @@

def main():
# Create an LLM.
llm = LLM(model="facebook/opt-125m")
llm = LLM(
model="facebook/opt-125m",
compilation_config={"cudagraph_mode": "full_decode_only"},
)
# Generate texts from the prompts.
# The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
Expand Down
5 changes: 5 additions & 0 deletions vllm/envs.py
Original file line number Diff line number Diff line change
Expand Up @@ -225,6 +225,7 @@
VLLM_SHARED_EXPERTS_STREAM_TOKEN_THRESHOLD: int = 256
VLLM_COMPILE_CACHE_SAVE_FORMAT: Literal["binary", "unpacked"] = "binary"
VLLM_FLAT_LOGPROBS: bool = False
VLLM_USE_V2_MODEL_RUNNER: bool = False


def get_default_cache_root():
Expand Down Expand Up @@ -1498,6 +1499,10 @@ def get_vllm_port() -> int | None:
# After enabled, PromptLogprobs and SampleLogprobs would populated as
# FlatLogprobs.
"VLLM_FLAT_LOGPROBS": lambda: bool(int(os.getenv("VLLM_FLAT_LOGPROBS", "0"))),
# Flag to enable v2 model runner.
"VLLM_USE_V2_MODEL_RUNNER": lambda: bool(
int(os.getenv("VLLM_USE_V2_MODEL_RUNNER", "0"))
),
}

# --8<-- [end:env-vars-definition]
Expand Down
3 changes: 3 additions & 0 deletions vllm/v1/attention/backends/flashinfer.py
Original file line number Diff line number Diff line change
Expand Up @@ -455,6 +455,9 @@ def _get_workspace_buffer(self):
)
return self._workspace_buffer

def set_workspace_buffer(self, workspace_buffer: torch.Tensor):
self._workspace_buffer = workspace_buffer

def _get_prefill_wrapper(self):
if self._prefill_wrapper is None:
self._prefill_wrapper = BatchPrefillWithPagedKVCacheWrapper(
Expand Down
18 changes: 18 additions & 0 deletions vllm/v1/core/sched/output.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@
class NewRequestData:
req_id: str
prompt_token_ids: list[int] | None
prefill_token_ids: list[int] | None

Check warning on line 39 in vllm/v1/core/sched/output.py

View workflow job for this annotation

GitHub Actions / bc_lint

Function NewRequestData: prefill_token_ids was added
mm_features: list[MultiModalFeatureSpec]
sampling_params: SamplingParams | None
pooling_params: PoolingParams | None
Expand All @@ -53,6 +54,7 @@
return cls(
req_id=request.request_id,
prompt_token_ids=request.prompt_token_ids,
prefill_token_ids=request._all_token_ids,
mm_features=request.mm_features,
sampling_params=request.sampling_params,
pooling_params=request.pooling_params,
Expand Down Expand Up @@ -175,6 +177,7 @@
# This can be used for cascade attention.
num_common_prefix_blocks: list[int]

preempted_req_ids: set[str]

Check warning on line 180 in vllm/v1/core/sched/output.py

View workflow job for this annotation

GitHub Actions / bc_lint

Function SchedulerOutput: preempted_req_ids was added
# Request IDs that are finished in between the previous and the current
# steps. This is used to notify the workers about the finished requests
# so that they can free the cached states for those requests.
Expand All @@ -193,6 +196,21 @@
# EC Cache Connector metadata
ec_connector_metadata: ECConnectorMetadata | None = None

@classmethod
def make_empty(cls) -> "SchedulerOutput":
return cls(
scheduled_new_reqs=[],
scheduled_cached_reqs=CachedRequestData.make_empty(),
num_scheduled_tokens={},
total_num_scheduled_tokens=0,
scheduled_spec_decode_tokens={},
scheduled_encoder_inputs={},
num_common_prefix_blocks=[],
preempted_req_ids=set(),
finished_req_ids=set(),
free_encoder_mm_hashes=[],
)


@dataclass
class GrammarOutput:
Expand Down
18 changes: 7 additions & 11 deletions vllm/v1/core/sched/scheduler.py
Original file line number Diff line number Diff line change
Expand Up @@ -653,6 +653,9 @@ def schedule(self) -> SchedulerOutput:
)

# Construct the scheduler output.
scheduled_new_reqs = scheduled_new_reqs + scheduled_resumed_reqs
scheduled_resumed_reqs = []

new_reqs_data = [
NewRequestData.from_request(
req, req_to_new_blocks[req.request_id].get_block_ids()
Expand Down Expand Up @@ -680,6 +683,7 @@ def schedule(self) -> SchedulerOutput:
scheduled_spec_decode_tokens=scheduled_spec_decode_tokens,
scheduled_encoder_inputs=scheduled_encoder_inputs,
num_common_prefix_blocks=num_common_prefix_blocks,
preempted_req_ids={req.request_id for req in preempted_reqs},
# finished_req_ids is an existing state in the scheduler,
# instead of being newly scheduled in this step.
# It contains the request IDs that are finished in between
Expand Down Expand Up @@ -754,9 +758,8 @@ def _make_cached_request_data(
all_token_ids: dict[str, list[int]] = {}
num_computed_tokens: list[int] = []
num_output_tokens: list[int] = []
resumed_req_ids = set()
resumed_req_ids = set[str]()

num_running_reqs = len(running_reqs)
for idx, req in enumerate(itertools.chain(running_reqs, resumed_reqs)):
req_id = req.request_id
req_ids.append(req_id)
Expand All @@ -773,14 +776,6 @@ def _make_cached_request_data(
req.num_computed_tokens : req.num_computed_tokens + num_tokens
]
new_token_ids.append(token_ids)
scheduled_in_prev_step = req_id in self.prev_step_scheduled_req_ids
if idx >= num_running_reqs:
assert not scheduled_in_prev_step
resumed_req_ids.add(req_id)
if not scheduled_in_prev_step:
all_token_ids[req_id] = req.all_token_ids[
: req.num_computed_tokens + num_tokens
]
new_block_ids.append(
req_to_new_blocks[req_id].get_block_ids(allow_none=True)
)
Expand Down Expand Up @@ -997,7 +992,8 @@ def update_from_output(
# to avoid expensive operations inside the loop.
stopped_running_reqs: set[Request] = set()
stopped_preempted_reqs: set[Request] = set()
for req_id, num_tokens_scheduled in num_scheduled_tokens.items():
for req_index, req_id in enumerate(model_runner_output.req_ids):
num_tokens_scheduled = num_scheduled_tokens[req_id]
assert num_tokens_scheduled > 0
if failed_kv_load_req_ids and req_id in failed_kv_load_req_ids:
# Skip requests that were recovered from KV load failure
Expand Down
Empty file.
76 changes: 76 additions & 0 deletions vllm/v1/worker/gpu/async_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
from contextlib import contextmanager

import numpy as np
import torch

from vllm.v1.outputs import (
AsyncModelRunnerOutput,
ModelRunnerOutput,
SamplerOutput,
)


class AsyncOutput(AsyncModelRunnerOutput):
def __init__(
self,
model_runner_output: ModelRunnerOutput,
sampler_output: SamplerOutput,
num_sampled_tokens: np.ndarray,
copy_stream: torch.cuda.Stream,
):
self.model_runner_output = model_runner_output
self.sampler_output = sampler_output
self.num_sampled_tokens = num_sampled_tokens
self.copy_stream = copy_stream
self.copy_event = torch.cuda.Event()

default_stream = torch.cuda.current_stream()
with torch.cuda.stream(self.copy_stream):
self.copy_stream.wait_stream(default_stream)

# NOTE(woosuk): We should keep the CPU tensors unfreed
# until the copy completes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean?

I think we may need to keep a ref to the GPU tensors to ensure they are deallocated/reused.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

            # NOTE(woosuk): We must ensure that CPU tensors are not freed
            # before the device-to-host copy is fully completed. For instance,
            # operations like
            # self.sampled_token_np = ...to("cpu", non_blocking=True).numpy()
            # are unsafe because the underlying CPU tensor can be prematurely freed and
            # reused by other tensors before the asynchronous copy finishes, potentially
            # causing race conditions. To prevent this, we delay freeing by holding
            # references until the copy event signals completion.
            # Likewise, we also need to keep the reference to the GPU tensors.
            # This is done by keeping the reference to sampler_output and
            # model_runner_output.

Added more detailed comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

       # operations like
       # self.sampled_token_np = ...to("cpu", non_blocking=True).numpy()
       # are unsafe because the underlying CPU tensor can be prematurely freed

actually I think this is safe, the numpy array will retain a ref to the storage.

Also the reason we need to keep ref to the GPU tensors is that they were allocated in a different stream to where we are copying them, otherwise it wouldn't be necessary.

self.sampled_token_ids = sampler_output.sampled_token_ids.to(
"cpu", non_blocking=True
)
if sampler_output.logprobs_tensors is not None:
self.logprobs_tensors = (
sampler_output.logprobs_tensors.to_cpu_nonblocking()
)
else:
self.logprobs_tensors = None
self.prompt_logprobs_dict = {}
if self.model_runner_output.prompt_logprobs_dict:
for k, v in self.model_runner_output.prompt_logprobs_dict.items():
self.prompt_logprobs_dict[k] = v.to_cpu_nonblocking()
self.copy_event.record(self.copy_stream)

def get_output(self) -> ModelRunnerOutput:
self.copy_event.synchronize()

# NOTE(woosuk): The following code ensures compatibility with OSS vLLM.
# Going forward, we should keep the data structures as NumPy arrays
# rather than Python lists.
sampled_token_ids_np = self.sampled_token_ids.numpy()
sampled_token_ids = sampled_token_ids_np.tolist()
for i, tokens in enumerate(sampled_token_ids):
del tokens[self.num_sampled_tokens[i] :]
self.model_runner_output.sampled_token_ids = sampled_token_ids

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could actually move the synchronize down to here because the data isn't accessed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_sampled_tokens will be a GPU tensor (instead of np.ndarray) once we integrate spec decoding in a followup PR. So I'd like to keep the logic here.

Ideally, the output should only consist of two arrays: sampled_tokens: [num_reqs, num_spec_steps + 1] and num_sampled: [num_reqs].

if self.logprobs_tensors is not None:
self.model_runner_output.logprobs = self.logprobs_tensors.tolists()
self.model_runner_output.prompt_logprobs_dict = self.prompt_logprobs_dict
return self.model_runner_output


@contextmanager
def async_barrier(event: torch.cuda.Event | None):
if event is not None:
event.synchronize()
try:
yield
finally:
if event is not None:
event.record()
Loading
Loading