Skip to content
Merged
Show file tree
Hide file tree
Changes from 28 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
904690e
add vLLM V1 User Guide Template
JenZhao Feb 25, 2025
c294f75
wip update v1 user guide
JenZhao Feb 27, 2025
2283371
update logprob desc
JenZhao Feb 27, 2025
1486cb8
update logprobs desc
JenZhao Feb 27, 2025
f9bb563
update
JenZhao Feb 27, 2025
5002550
fix
JenZhao Feb 27, 2025
b395576
update
JenZhao Feb 27, 2025
658957f
update
JenZhao Feb 27, 2025
eacf90a
Update v1_user_guide.md
JenZhao Feb 27, 2025
bf852c6
Merge branch 'vllm-project:main' into v1_guide
JenZhao Mar 1, 2025
560b267
update unsupported
JenZhao Mar 2, 2025
33d759f
address comments
JenZhao Mar 2, 2025
2c28989
Apply suggestions from code review
JenZhao Mar 2, 2025
4e978c2
address comments
JenZhao Mar 2, 2025
32887cd
remove merged pr
JenZhao Mar 2, 2025
fcffa3c
fix
JenZhao Mar 2, 2025
047f46d
link pr #13361
JenZhao Mar 2, 2025
d858c6a
link pr #13997
JenZhao Mar 2, 2025
54d51cf
update
ywang96 Mar 2, 2025
b934056
pre-commit
ywang96 Mar 2, 2025
79549c9
update
JenZhao Mar 4, 2025
aa7bed4
update
JenZhao Mar 4, 2025
f80562b
update
JenZhao Mar 4, 2025
781a224
Merge branch 'vllm-project:main' into v1_guide
JenZhao Mar 13, 2025
6021f71
[Doc] Update v1 guide (#4)
JenZhao Mar 13, 2025
b4691c7
update
JenZhao Mar 13, 2025
7c48862
remove notes
JenZhao Mar 13, 2025
e4c5e81
fix
JenZhao Mar 13, 2025
4090ac9
Apply suggestions from code review
JenZhao Mar 13, 2025
b1dad35
update structured output
JenZhao Mar 13, 2025
c5cc253
sort table, add benchmark placeholder
JenZhao Mar 13, 2025
5b1374d
minor
JenZhao Mar 13, 2025
35b7af7
add hardware support
JenZhao Mar 14, 2025
16d9ce5
address comments
JenZhao Mar 14, 2025
dedc925
update
JenZhao Mar 14, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
132 changes: 132 additions & 0 deletions docs/source/getting_started/v1_user_guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
# vLLM V1 User Guide

## Why vLLM V1?

vLLM V0 successfully supported a wide range of models and hardware, but as new features were developed independently, the system grew increasingly complex. This complexity made it harder to integrate new capabilities and introduced technical debt, revealing the need for a more streamlined and unified design.

Building on V0’s success, vLLM V1 retains the stable and proven components from V0
(such as the models, GPU kernels, and utilities). At the same time, it significantly
re-architects the core systems, covering the scheduler, KV cache manager, worker,
sampler, and API server, to provide a cohesive, maintainable framework that better
accommodates continued growth and innovation.

Specifically, V1 aims to:

- Provide a **simple, modular, and easy-to-hack codebase**.
- Ensure **high performance** with near-zero CPU overhead.
- **Combine key optimizations** into a unified architecture.
- Require **zero configs** by enabling features/optimizations by default.

For more details, check out the vLLM V1 blog post [vLLM V1: A Major
Upgrade to vLLM’s Core Architecture](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html) (published Jan 27, 2025).

This living user guide outlines a few known **important changes and limitations** introduced by vLLM V1. The team has been working actively to bring V1 as the default engine, therefore this guide will be updated constantly as more features get supported on vLLM V1.

### Feature / Model Supports Overview

| Feature / Model | Status |
|-------------------------------------------|-----------------------------------------------------------------------------------|
| **Logprobs Calculation** | <nobr>🟢 Functional</nobr> |
| **Prompt Logprobs with Prefix Caching** | <nobr>🟡 Planned ([RFC #13414](https://github.com/vllm-project/vllm/issues/13414))</nobr> |
| **LoRA** | <nobr>🟢 Functional ([PR #13096](https://github.com/vllm-project/vllm/pull/13096))</nobr> |
| **Spec Decode** | <nobr>🚧 WIP ([PR #13933](https://github.com/vllm-project/vllm/pull/13933))</nobr> |
| **FP8 KV Cache** | <nobr>🟡 Planned</nobr> |
| **Structured Generation Fallback** | <nobr>🔴 Deprecated</nobr> |
| **best_of** | <nobr>🔴 Deprecated ([RFC #13361](https://github.com/vllm-project/vllm/issues/13361))</nobr>|
| **Per-Request Logits Processors** | <nobr>🔴 Deprecated ([RFC #13360](https://github.com/vllm-project/vllm/pull/13360))</nobr> |
| **GPU <> CPU KV Cache Swapping** | <nobr>🔴 Deprecated</nobr> |
| **Embedding Models** | <nobr>🟡 Planned</nobr> |
| **Mamba Models** | <nobr>🟡 Planned</nobr> |
| **Encoder-Decoder Models** | <nobr>🟡 Planned</nobr> |

- **🚀 Optimized**: Nearly fully optimized, with no further work currently planned.
- **🟢 Functional**: Fully operational, with ongoing optimizations.
- **🚧 WIP**: Under active development.
- **🟡 Planned**: Scheduled for future implementation (some may have open PRs/RFCs).
- **🔴 Deprecated**: Not planned for v1 unless there is strong demand.

### Semantic Changes and Deprecated Features

#### Logprobs

vLLM V1 supports logprobs and prompt logprobs. However, there are some important semantic
differences compared to V0:

**Logprobs Calculation**

Logprobs in V1 are now returned immediately once computed from the model’s raw output (i.e.
before applying any logits post-processing such as temperature scaling or penalty
adjustments). As a result, the returned logprobs do not reflect the final adjusted
probabilities used during sampling.

Support for logprobs with post-sampling adjustments is in progress and will be added in future updates.

**Prompt Logprobs with Prefix Caching**

Currently prompt logprobs are only supported when prefix caching is turned off via `--no-enable-prefix-caching`. In a future release, prompt logprobs will be compatible with prefix caching, but a recomputation will be triggered to recover the full prompt logprobs even upon a prefix cache hit. See details in [RFC #13414](https://github.com/vllm-project/vllm/issues/13414).

#### Deprecated Features

As part of the major architectural rework in vLLM V1, several legacy features have been deprecated.

**Sampling features**

- **best_of**: This feature has been deprecated due to limited usage. See details at [RFC #13361](https://github.com/vllm-project/vllm/issues/13361).
- **Per-Request Logits Processors**: In V0, users could pass custom
processing functions to adjust logits on a per-request basis. In vLLM V1, this
feature has been deprecated. Instead, the design is moving toward supporting **global logits
processors**, a feature the team is actively working on for future releases. See details at [RFC #13360](https://github.com/vllm-project/vllm/pull/13360).

**KV Cache features**

- **GPU <> CPU KV Cache Swapping**: with the new simplified core architecture, vLLM V1 no longer requires KV cache swapping
to handle request preemptions.

### Feature & Model Support in Progress
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One suggestion to this section is that it might be more clear to have an overview supported matrix: features vs status, where status could be one of the following:

  • Deprecated: No plan to support in v1 unless there's a strong motivation.
  • Planned: Plan to support but haven't started workings
  • WIP: Being supported.
  • Functional (unoptimized): It's working but is being optimized.
  • Optimized: It's almost optimized and no other planned work atm.

So that in the rest of this section we could be feature centric to describe the current status of each feature and point to the corresponding GitHub issue/PR/Project.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do


Although we have re-implemented and partially optimized many features and models from V0 in vLLM V1, optimization work is still ongoing for some, and others remain unsupported.

#### Features to be Optimized

These features are already supported in vLLM V1, but their optimization is still
in progress.

- **LoRA**: LoRA is functionally working on vLLM V1 but its performance is
inferior to that of V0. The team is actively working on improving its
performance
(e.g., see [PR #13096](https://github.com/vllm-project/vllm/pull/13096)).

- **Spec Decode**: Currently, only ngram-based spec decode is supported in V1. There
will be follow-up work to support other types of spec decode (e.g., see [PR #13933](https://github.com/vllm-project/vllm/pull/13933)). We will prioritize the support for Eagle, MTP compared to draft model based spec decode.

#### Unsupported Features

- **FP8 KV Cache**: While vLLM V1 introduces new FP8 kernels for model weight quantization, support for an FP8 key–value cache is not yet available. Users must continue using FP16 (or other supported precisions) for the KV cache.

- **Structured Generation Fallback**: For structured output tasks, V1 currently
supports only the `xgrammar:no_fallback` mode, meaning that it will error out if the output schema is unsupported by xgrammar.
Details about the structured generation can be found
[here](https://docs.vllm.ai/en/latest/features/structured_outputs.html).

#### Unsupported Models

vLLM V1 currently excludes model architectures with the `SupportsV0Only` protocol,
and the majority fall into the following categories. V1 support for these models will be added eventually.

**Embedding Models**
vLLM V1 does not yet include a `PoolingModelRunner` to support embedding/pooling
models (e.g, `XLMRobertaModel`).

**Mamba Models**
Models using selective state-space mechanisms (instead of standard transformer attention)
are not yet supported (e.g., `MambaForCausalLM`, `JambaForCausalLM`).

**Encoder-Decoder Models**
vLLM V1 is currently optimized for decoder-only transformers. Models requiring
cross-attention between separate encoder and decoder are not yet supported (e.g., `BartForConditionalGeneration`, `MllamaForConditionalGeneration`).

For a complete list of supported models, see the [list of supported models](https://docs.vllm.ai/en/latest/models/supported_models.html).

## FAQ

TODO
2 changes: 2 additions & 0 deletions docs/source/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,8 @@ getting_started/quickstart
getting_started/examples/examples_index
getting_started/troubleshooting
getting_started/faq
getting_started/v1_user_guide

:::

% What does vLLM support?
Expand Down