[Roadmap] vLLM Roadmap Q1 2026

*Hi vLLM Community,*

*We are tackling the roadmap a bit differently this year. In this roadmap, the planned objectives and milestones are captured by different areas. For each area, you will find more detailed tracking issues and lead committers to discuss with. Please continue to send feedback here\!* 

### Core Engine

Meeting Time/Link:  
Channel: #sig-core  
Members: @WoosukKwon 

The team focuses on the vLLM Engine Core including Scheduler, KV Cache Manager, Distributed, Model Runner, KV Connector code path.

- [ ] Turn async scheduling on by default  
- [ ] Turn model runner V2 on by default: missing dual batch overlap, pipeline parallelism, more attention backends (right now it’s only flash attention), more testing  
- [ ] Data structure clean up: improve efficiently for data structures that grow with number of tokens and number of requests (e.g., list\[int\] \-\> numpy arrays, removing dictionaries).  
- [ ] Turn on CPU KV cache by default with zero performance cost when no cache hit. Bet on async KV cache transfer.  
- [ ] Process structure simplification/flattening @zhuohan123  
- [ ] Fix memory profiling accuracy @zhuohan123  
- [ ] Attention backend re-design @\_\_\_

One goal this SIG will work on is interface stability and decoupling. Right now, we only provide backward compatibility on public APIs, `LLM, AsyncLLM, LLMEngine`. Besides these we provide zero backward compatibility. We will gradually refactor the codebase, starting with a stable Model implementation API. A more stable API base improves the overall plugin ecosystem and modularity of vLLM. 

- [ ] Stable model implementation API  
- [ ] Model config refactor @zhuohan123  
- [ ] Refactor Weight Loading / Distributed Linear / Quantization \_\_\_

Finally, low latency serving with spec decoding is in scope as well, led by @benchislett 

- [ ] Battle test Deepseek MTP without crash for DP/EP and cuda graph.   
- [ ] Support MTP for Qwen3-Next  
- [ ] Support and test EAGLE-3 thoroughly.

### Large Scale Serving

Meeting Time/Link: See Channel  
Channel: #sig-large-scale-serving
Members:  @tlrmchlsmth 

The team focuses on pushing vLLM to speed of light on disaggregated, wide EP, and elastic setting on clusters of H200, B200, and more importantly GB200. The team is also responsible for interfacing with ecosystem projects such as llm-d, Dynamo, and AMD team. 

- [ ] Publish reproducible recipes for DeepSeek architecture on GB200 with `vllm-router`  
- [ ] Finish the FusedMoE refactor, remove/deprecate unnecessary gemm and comm kernels.   
- [ ] P/D and Wide EP recipes on AMD ROCm  
- [ ] Evaluate the need to place vllm-router between API server and core engine.  
- [ ] Achieve SoTA GB200 results on DeepSeek architecture by exploiting NVLink, CPU unified memory, FP4, and multi-stream concurrency.   
- [ ] Publish GB300 recipes and roadmap  
- [ ] Elastic EP in beta (ready for external testing)  
- [ ] Minimize overheads from EPLB expert rearranging and use async EPLB by default when EPLB is enabled. 

### Speed of Light

Meeting Time/Link: [Tuesday 11:30AM PT](https://zoom-lfx.platform.linuxfoundation.org/meeting/91977989959?password=a58f2ff1-6f23-4e59-ac3a-6f21e433ad60)
Channel: #sig-model-bash 
Members: @robertgshaw2-redhat @simon-mo 

The team focuses on pure performance and reliability engineering within vLLM. The work involves capturing performance traces, enabling the right set of kernels by default, and continuously monitoring it. The work also covers monitoring and logging for production stability. This is a combination of \#sig-model-bash and performance dashboard effort. 

- [ ] Performance dashboard and model bash for high priority models (DSV3.2, K2, gpt-oss, Qwen3-Next, Gemma3) on popular hardwares (GB200, H200, MI355).   
- [ ] Profiling tooling  
- [ ] Python overhead reduction via dummy model on GB200  
- [ ] Replicate InferenceMax and coordinate efforts for further improvements

### Torch Compile 

Meeting Time: Thursday 12.30 ET (9.30 PT)
Meeting Notes: [torch.compile SIG](https://docs.google.com/document/d/1zyPdGVX7TnVtvqD9NsaTO4E02u7PY7uudreLrdWp2GM/edit?tab=t.0#heading=h.m71diqxvm1wi) (also includes joining instructions)
Channel: [#sig-torch-compile](https://vllm-dev.slack.com/archives/C08K1FAHFPH)
Members: @ProExpertProg @zou3519

The team focuses on improving performance, portability, and developer productivity via PyTorch compilation integration. Work includes custom compile & fusion passes, vLLM IR for kernel registration, reducing compile time via caching, improving developer UX with torch.compile, and co-development of new torch.compile features.

- [ ] Enable more optimizations by default using optimization (`-O2`, `-O3`) levels
- [ ] Migrate CustomOps to vLLM IR
- [ ] Integrate Helion into vLLM
- [ ] Improve cold and warm compilation time
- [ ] Unwrap wrapped custom ops (MLA, Fused MoE)
- [ ] [Improved perf dashboard](https://hud.pytorch.org/benchmark/v3/dashboard/pytorch_x_vllm_benchmark?renderGroupId=main&time.start=2026-01-14T00%3A00%3A00.000Z&time.end=2026-01-21T23%3A59%3A59.999Z&lbranch=main&rbranch=main) to track compile speedups and breakdown warm and cold start times.
- [ ] torch.compile x nvsymmetric memory integration

### Frontend

Meeting Time/Link: TBD  
Channel: #sig-frontend  
Members: 

The team focuses on the OpenAI compatible API server, as well as various other protocols. Its scope also covers the implementation of renderer and tool parser, which are components responsible for input and output format interfacing with engine core. 

- [ ] Use structural tag for tool parser, overall refactoring of tool parsing logits for simplicity and robustness  
- [ ] Responses API  
- [ ] Renderer, disaggregate everything workstream

### RL

Meeting Time/Link: TBD  
Channel: #sig-post-training  
Members: @youkaichao @robertgshaw2-redhat 

The team focuses on delivering vLLM the best engine features for RL rollout including weight sync, kv cache reset, and ease-of-modification. 

- [ ] Modular weight sync [https://github.com/vllm-project/vllm/issues/31848](https://github.com/vllm-project/vllm/issues/31848)  
- [ ] Continue enhancement of test cases   
- [ ] Publication of reproduction runs with SOTA open source RL techniques, collaborating with open source RL frameworks  
- [ ] Harden external launcher mode

### MultiModality

Meeting Time/Link:  Every two weeks, see channel.
Channel: #sig-multi-modality  
Members: @ywang96 @DarkLight1337 

The team supports the abstractions, model support, and optimizations of multi-modality input. 

- [ ] Streaming inputs: [https://github.com/vllm-project/vllm/pull/28973](https://github.com/vllm-project/vllm/pull/28973)   
- [ ] Input processing 

### Model Acceleration (Quantization and Speculators)

Meeting Time/Link: Every two weeks  
Channel: TBD  
Members: @mgoin @dsikka 

vLLM’s core quantization and speculations integration, including LLM compressor and speculators, along with ModelOpt integration. 

- [ ] Remove the deprecated quantization schemes  
- [ ] Kernel integration fixed up  
- [ ] nvfp4 \+ mxfp4 recipes \+ algorithms  
- [ ] For speculators, release all frontier model speculators on HF

### Documentation, Recipes, Blog

Channel: #sig-docs, #blogs, #recipes  

The team will focus on lowering the learning curve for vLLM and enhance usability through materials and educational content. 

- [ ] Enhanced recipes for all popular models  
- [ ] Technical blog on vLLM’s optimization and technical deep dive specific to different models  
- [ ] Educational material for developer on architecture, internals, and meetup

### CI, Build, and Release

Meeting Time/Link: [Tuesday 11AM Pacific](https://zoom-lfx.platform.linuxfoundation.org/meeting/93422542964?password=186d8665-b8d3-4d26-a4ef-0c63838ee397)
Channel: #sig-ci
Members: @khluu 

The team focuses on developing world class infrastructure for vLLM’s CI system and ensuring we have a secure and reliable build and release process. 

- [ ] Meet two weeks release cadence (there should be six releases in Q1\!)  
- [ ] Time to first test in 10 minutes and E2E CI time to signal in 30 minutes.  
- [ ] Release nightly wheels covering SOTA hardware support (e.g GB300).
- [ ] Automatic quarantine for flaky tests
- [ ] Automatic test target determination
- [ ] Auto-bisect workflow
- [ ] CI dashboard

---

The following are semi-open program that focus on iterations of vLLM, handling potentially sensitive informations. 

### Committer Development Program

This program, led by the lead maintainers, focus on continue to cultivate new committers and enhance the contribution experience of vLLM:

- [ ] Publish Reviewer Guideline (quality and speed choose two, each PR should bring clear improvements, higher PR bar)  
- [ ] Iterate on community PR maintenance policy  
- [ ] Iterate on issue triage  
- [ ] Continue to develop active contributors into committers

### Model Support Program

We will work on streamlining our model support process with model and hardware vendors. All frontier model releases should be accuracy validated on day 0 with automated suite for widely used configs, basic performance (no synchronization and enabled fused ops) on week 1, and matured support by month 1\. 

- [ ] Automation and tracking around model support  
- [ ] Develop a new model authoring tool/framework to ease model porting and reduce human errors ([related RFC](https://github.com/vllm-project/vllm/issues/28326)) with the feedback from model vendors  
- [ ] Model testing pipeline  
- [ ] Standardize marketing promotions around new model release and distribution  
- [ ] Improve recipes for ease of modifications and performance result

---

### Ecosystem Project Roadmap

* vLLM Omni  
* Semantic Router   


  


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Roadmap] vLLM Roadmap Q1 2026 #32455

Core Engine

Large Scale Serving

Speed of Light

Torch Compile

Frontend

RL

MultiModality

Model Acceleration (Quantization and Speculators)

Documentation, Recipes, Blog

CI, Build, and Release

Committer Development Program

Model Support Program

Ecosystem Project Roadmap

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Roadmap] vLLM Roadmap Q1 2026 #32455

Description

Core Engine

Large Scale Serving

Speed of Light

Torch Compile

Frontend

RL

MultiModality

Model Acceleration (Quantization and Speculators)

Documentation, Recipes, Blog

CI, Build, and Release

Committer Development Program

Model Support Program

Ecosystem Project Roadmap

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions