-
-
Notifications
You must be signed in to change notification settings - Fork 13.1k
Description
Hi vLLM Community,
We are tackling the roadmap a bit differently this year. In this roadmap, the planned objectives and milestones are captured by different areas. For each area, you will find more detailed tracking issues and lead committers to discuss with. Please continue to send feedback here!
Core Engine
Meeting Time/Link:
Channel: #sig-core
Members: @WoosukKwon
The team focuses on the vLLM Engine Core including Scheduler, KV Cache Manager, Distributed, Model Runner, KV Connector code path.
- Turn async scheduling on by default
- Turn model runner V2 on by default: missing dual batch overlap, pipeline parallelism, more attention backends (right now it’s only flash attention), more testing
- Data structure clean up: improve efficiently for data structures that grow with number of tokens and number of requests (e.g., list[int] -> numpy arrays, removing dictionaries).
- Turn on CPU KV cache by default with zero performance cost when no cache hit. Bet on async KV cache transfer.
- Process structure simplification/flattening @zhuohan123
- Fix memory profiling accuracy @zhuohan123
- Attention backend re-design @___
One goal this SIG will work on is interface stability and decoupling. Right now, we only provide backward compatibility on public APIs, LLM, AsyncLLM, LLMEngine. Besides these we provide zero backward compatibility. We will gradually refactor the codebase, starting with a stable Model implementation API. A more stable API base improves the overall plugin ecosystem and modularity of vLLM.
- Stable model implementation API
- Model config refactor @zhuohan123
- Refactor Weight Loading / Distributed Linear / Quantization ___
Finally, low latency serving with spec decoding is in scope as well, led by @benchislett
- Battle test Deepseek MTP without crash for DP/EP and cuda graph.
- Support MTP for Qwen3-Next
- Support and test EAGLE-3 thoroughly.
Large Scale Serving
Meeting Time/Link: See Channel
Channel: #sig-large-scale-serving
Members: @tlrmchlsmth
The team focuses on pushing vLLM to speed of light on disaggregated, wide EP, and elastic setting on clusters of H200, B200, and more importantly GB200. The team is also responsible for interfacing with ecosystem projects such as llm-d, Dynamo, and AMD team.
- Publish reproducible recipes for DeepSeek architecture on GB200 with
vllm-router - Finish the FusedMoE refactor, remove/deprecate unnecessary gemm and comm kernels.
- P/D and Wide EP recipes on AMD ROCm
- Evaluate the need to place vllm-router between API server and core engine.
- Achieve SoTA GB200 results on DeepSeek architecture by exploiting NVLink, CPU unified memory, FP4, and multi-stream concurrency.
- Publish GB300 recipes and roadmap
- Elastic EP in beta (ready for external testing)
- Minimize overheads from EPLB expert rearranging and use async EPLB by default when EPLB is enabled.
Speed of Light
Meeting Time/Link: Tuesday 11:30AM PT
Channel: #sig-model-bash
Members: @robertgshaw2-redhat @simon-mo
The team focuses on pure performance and reliability engineering within vLLM. The work involves capturing performance traces, enabling the right set of kernels by default, and continuously monitoring it. The work also covers monitoring and logging for production stability. This is a combination of #sig-model-bash and performance dashboard effort.
- Performance dashboard and model bash for high priority models (DSV3.2, K2, gpt-oss, Qwen3-Next, Gemma3) on popular hardwares (GB200, H200, MI355).
- Profiling tooling
- Python overhead reduction via dummy model on GB200
- Replicate InferenceMax and coordinate efforts for further improvements
Torch Compile
Meeting Time: Thursday 12.30 ET (9.30 PT)
Meeting Notes: torch.compile SIG (also includes joining instructions)
Channel: #sig-torch-compile
Members: @ProExpertProg @zou3519
The team focuses on improving performance, portability, and developer productivity via PyTorch compilation integration. Work includes custom compile & fusion passes, vLLM IR for kernel registration, reducing compile time via caching, improving developer UX with torch.compile, and co-development of new torch.compile features.
- Enable more optimizations by default using optimization (
-O2,-O3) levels - Migrate CustomOps to vLLM IR
- Integrate Helion into vLLM
- Improve cold and warm compilation time
- Unwrap wrapped custom ops (MLA, Fused MoE)
- Improved perf dashboard to track compile speedups and breakdown warm and cold start times.
- torch.compile x nvsymmetric memory integration
Frontend
Meeting Time/Link: TBD
Channel: #sig-frontend
Members:
The team focuses on the OpenAI compatible API server, as well as various other protocols. Its scope also covers the implementation of renderer and tool parser, which are components responsible for input and output format interfacing with engine core.
- Use structural tag for tool parser, overall refactoring of tool parsing logits for simplicity and robustness
- Responses API
- Renderer, disaggregate everything workstream
RL
Meeting Time/Link: TBD
Channel: #sig-post-training
Members: @youkaichao @robertgshaw2-redhat
The team focuses on delivering vLLM the best engine features for RL rollout including weight sync, kv cache reset, and ease-of-modification.
- Modular weight sync [RFC]: Native Weight Syncing APIs #31848
- Continue enhancement of test cases
- Publication of reproduction runs with SOTA open source RL techniques, collaborating with open source RL frameworks
- Harden external launcher mode
MultiModality
Meeting Time/Link: Every two weeks, see channel.
Channel: #sig-multi-modality
Members: @ywang96 @DarkLight1337
The team supports the abstractions, model support, and optimizations of multi-modality input.
- Streaming inputs: [Feature] add session based streaming input support to v1 #28973
- Input processing
Model Acceleration (Quantization and Speculators)
Meeting Time/Link: Every two weeks
Channel: TBD
Members: @mgoin @dsikka
vLLM’s core quantization and speculations integration, including LLM compressor and speculators, along with ModelOpt integration.
- Remove the deprecated quantization schemes
- Kernel integration fixed up
- nvfp4 + mxfp4 recipes + algorithms
- For speculators, release all frontier model speculators on HF
Documentation, Recipes, Blog
Channel: #sig-docs, #blogs, #recipes
The team will focus on lowering the learning curve for vLLM and enhance usability through materials and educational content.
- Enhanced recipes for all popular models
- Technical blog on vLLM’s optimization and technical deep dive specific to different models
- Educational material for developer on architecture, internals, and meetup
CI, Build, and Release
Meeting Time/Link: Tuesday 11AM Pacific
Channel: #sig-ci
Members: @khluu
The team focuses on developing world class infrastructure for vLLM’s CI system and ensuring we have a secure and reliable build and release process.
- Meet two weeks release cadence (there should be six releases in Q1!)
- Time to first test in 10 minutes and E2E CI time to signal in 30 minutes.
- Release nightly wheels covering SOTA hardware support (e.g GB300).
- Automatic quarantine for flaky tests
- Automatic test target determination
- Auto-bisect workflow
- CI dashboard
The following are semi-open program that focus on iterations of vLLM, handling potentially sensitive informations.
Committer Development Program
This program, led by the lead maintainers, focus on continue to cultivate new committers and enhance the contribution experience of vLLM:
- Publish Reviewer Guideline (quality and speed choose two, each PR should bring clear improvements, higher PR bar)
- Iterate on community PR maintenance policy
- Iterate on issue triage
- Continue to develop active contributors into committers
Model Support Program
We will work on streamlining our model support process with model and hardware vendors. All frontier model releases should be accuracy validated on day 0 with automated suite for widely used configs, basic performance (no synchronization and enabled fused ops) on week 1, and matured support by month 1.
- Automation and tracking around model support
- Develop a new model authoring tool/framework to ease model porting and reduce human errors (related RFC) with the feedback from model vendors
- Model testing pipeline
- Standardize marketing promotions around new model release and distribution
- Improve recipes for ease of modifications and performance result
Ecosystem Project Roadmap
- vLLM Omni
- Semantic Router
Metadata
Metadata
Assignees
Labels
Type
Projects
Status