Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
3199 commits
Select commit Hold shift + click to select a range
61a01b2
[V1] Delay all xgrammar usage until needed (#14616)
russellb Mar 11, 2025
d374f04
Fix run_tpu_test (#14641)
richardsliu Mar 11, 2025
863d315
[V1][TPU] Pad the block_table.shape[1] so the ragged paged attention …
vanbasten23 Mar 11, 2025
b706d89
[Bugfix][V1][PP] Only warmup sampler at last PP rank (#14643)
comaniac Mar 11, 2025
9f583e3
[release] Add commands to clean up logs on TPU release node (#14642)
khluu Mar 12, 2025
36e0c8f
[Feature] Add `vllm bench` CLI (#13993)
randyjhc Mar 12, 2025
47532cd
[core][V1] pluggable scheduler (#14466)
joerunde Mar 12, 2025
4a42b9f
[Doc] Update benchmarks README (#14646)
JenZhao Mar 12, 2025
80e78d0
[Model] Extend Ultravox to accept audio longer than 30s (#13631)
farzadab Mar 12, 2025
77a318b
[V1][Core] Support MistralTokenizer for Structured Output (#14625)
aarnphm Mar 12, 2025
e392d85
[Core] Refactor `QKVCrossParallelLinear` implementation to support BN…
Isotr0py Mar 12, 2025
e22ee1e
[Kernel] GGUF MoE kernel (#14613)
SzymonOzog Mar 12, 2025
5c538c3
[V1][Bugfix][Spec Decode] Fix incorrect outputs in V1 speculative dec…
benchislett Mar 12, 2025
debd6bb
[Kernel] Add ModelOpt FP4 Checkpoint Support (#12520)
pavanimajety Mar 12, 2025
ff47aab
[CPU] Upgrade CPU backend to torch-2.6 (#13381)
bigPYJ1151 Mar 12, 2025
45f3f3f
[ROCm][Bugfix] Ensure that the moe_wna16_gemm kernel is not built on …
SageMoore Mar 12, 2025
c0c25e2
[Model] Add support for Gemma 3 (#14660)
WoosukKwon Mar 12, 2025
4a754fc
[Bugfix] Missing thumbnail from NVLM-D processor (#14633)
ameyanjarlekar Mar 12, 2025
d9f83d6
[ROCm] Enable chunked prefill/paged attention in MLA on ROCm (#14316)
SageMoore Mar 12, 2025
916836b
[FEAT] [ROCm] [Embedding] Add encoder-only model support into ROCm Fl…
tjtanaa Mar 12, 2025
f5d3acd
[BugFix][V1] Fix parallel sampling finishing/aborts (#14512)
njhill Mar 12, 2025
53be4a8
[V1] Allow sliding window + prefix caching (#13069)
WoosukKwon Mar 12, 2025
ce20124
[release] Add force remove for TPU logs (#14697)
khluu Mar 12, 2025
165290d
[bugfix] fixup warning message for plugged schedulers for v1 (#14700)
joerunde Mar 13, 2025
ab426ec
Add ray[data] as tpu dependency (#14691)
richardsliu Mar 13, 2025
a94a699
[ROCm][FP8] Fix for adjustments needed only for fnuz (#14689)
gshtras Mar 13, 2025
128bf75
[BugFix][TritonMLA] Process weights after model loading for GGUF (#14…
tywuAMD Mar 13, 2025
1bd32bc
[Config][Disaggregated] Add timeout configuration for the torch.store…
hasB4K Mar 13, 2025
1bc3b73
[V1][TPU] Add assertion on multi-step-scheduler (#14707)
lsy323 Mar 13, 2025
36d1ccb
[Quant] BartModel SupportsQuant (#14699)
kylesayrs Mar 13, 2025
5d043c1
[Quant] Bamba SupportsQuant (#14698)
kylesayrs Mar 13, 2025
55211b0
[Bugfix] Fix chunked prefill for GGUF (#14666)
SzymonOzog Mar 13, 2025
bd44b81
[CI/Build] Delete ultravox LoRA test (#14730)
jeejeelee Mar 13, 2025
a73122d
[Bugfix] fix benchmark moe (#14653)
jeejeelee Mar 13, 2025
3824039
[VLM] Support pan-and-scan for Gemma3 multi-modal processor (#14672)
DarkLight1337 Mar 13, 2025
b1cc4df
[VLM] Support loading InternVideo2.5 models as original InternVLChatM…
Isotr0py Mar 13, 2025
f53a058
[Bugfix] Fix prompt format of GLM4V (#14539)
DarkLight1337 Mar 13, 2025
01b3fd0
[V1][Minor] Minor enhancements on scheduler (#14732)
WoosukKwon Mar 13, 2025
8e9ffd3
[Misc] Clean up processor tests (#14771)
DarkLight1337 Mar 13, 2025
8a4a2ef
[V1][Core] using cached vocab_size for Structured Outputs (#14630)
aarnphm Mar 13, 2025
02fcaa3
[V1] Detokenizer: Respect Stop Tokens + not include_stop_str_in_outpu…
afeldman-nm Mar 13, 2025
d47807b
[Attention] Remove slow setattr in MLA (#14769)
LucasWilkinson Mar 13, 2025
3fb17d2
[Doc] Fix typo in documentation (#14783)
yasu52 Mar 14, 2025
60c872d
[Doc] Fix small typo in Transformers fallback (#14791)
heheda12345 Mar 14, 2025
7888e1d
[V1] TPU - Enable prefix caching by default (#14773)
alexm-redhat Mar 14, 2025
2a602b0
forward fix PR 14245, restore build on ROCm 6.2 (#14709)
jeffdaily Mar 14, 2025
ad19c8a
[V1] Move OOM check into sampler run (#14728)
ywang96 Mar 14, 2025
32ef498
[V1] Temporarily disable FlashInfer Rejection Sampler (#14788)
WoosukKwon Mar 14, 2025
0b1cfa6
[Kernel] LoRA - Enable CUDAGraphs for V1 (#14626)
varun-sundar-rabindranath Mar 14, 2025
fb4c7f8
[Kernel] [V1] Further optimizations to ROCm (Triton) Backend to bette…
tdoublep Mar 14, 2025
95d680b
[Bugfix][IPEX] Add `VLLM_CPU_MOE_PREPACK` to allow disabling MoE prep…
gau-nernst Mar 14, 2025
f1f632d
[ci] Reduce number of tests in fastcheck (#14782)
khluu Mar 14, 2025
4059adc
[Misc][Minor] Simplify `SamplingParams.__post_init__()` (#14772)
njhill Mar 14, 2025
d3d4956
[Neuron] flatten test parameterization for neuron attention kernels (…
liangfu Mar 14, 2025
a6e0d09
[Feature] Add visionarena offline support for benchmark_throughput (#…
JenZhao Mar 14, 2025
0c2af17
[CI] Fix missing example model id in processor test (#14787)
ywang96 Mar 14, 2025
9532c49
[Attention] MLA get rid of materialization (#14770)
LucasWilkinson Mar 14, 2025
27b50f1
[Bugfix][Kernel][CPU] Fix num_tokens in CPU rotary embedding kernel (…
gau-nernst Mar 14, 2025
09269b3
[BugFix]Fix performance serving benchmark when enable profiling (#14737)
Potabk Mar 14, 2025
601bd32
[Misc] Clean up type annotation for `SupportsMultiModal` (#14794)
DarkLight1337 Mar 14, 2025
54cc46f
[Bugfix] Fix small typo in the example of Streaming delimiter (#14793)
bravo325806 Mar 14, 2025
989ecd2
[Misc] Gemma3ForConditionalGeneration supports LoRA (#14797)
jeejeelee Mar 14, 2025
c77620d
[V1][Minor] Minor code cleanup for scheduling metrics (#14800)
WoosukKwon Mar 14, 2025
40253ba
[Bugfix][W8A8] fixed cutlass block fp8 binding (#14796)
DefTruth Mar 14, 2025
ab93f13
[VLM] Various cleanup and fixes (#14806)
DarkLight1337 Mar 14, 2025
fd8e055
[BugFix]: properly catch templating error when preprocess input (#13976)
gcalmettes Mar 14, 2025
613c5bb
[Bugfix] Fix Aria test loading (#14823)
DarkLight1337 Mar 14, 2025
1140991
[V1] Fix vocab size calculation for structured output (#14826)
russellb Mar 14, 2025
0b0d642
[Frontend] Fix log message to use http vs https (#14774)
russellb Mar 14, 2025
9d2b4a7
[V1][Metrics] Updated list of deprecated metrics in v0.8 (#14695)
markmc Mar 14, 2025
73deea2
[Frontend] track server_load (#13950)
daniel-salib Mar 14, 2025
977a167
[Bugfix][Kernel]: Fix AllSpark kernel compilation errors and enable f…
wyajieha Mar 14, 2025
7097b4c
[release] Remove log cleanup commands from TPU job (#14838)
khluu Mar 14, 2025
270a5da
Re-enable the AMD Entrypoints Test (#14711)
Alexei-V-Ivanov-AMD Mar 14, 2025
fe66b34
[Model] Mamba2 Prefill Performance Tweaks: Fixing Flurry of Unnecessa…
cyang49 Mar 14, 2025
46f9889
[V1] Fix model parameterization for structured output tests (#14833)
russellb Mar 14, 2025
14f301b
Update to torch==2.6.0 (#12721)
mgoin Mar 14, 2025
4067778
[CI] Add TPU v1 test (#14834)
richardsliu Mar 14, 2025
233ffce
[Build/CI] Move ninja to common deps (#14835)
russellb Mar 14, 2025
bbd94a1
[Build/CI] Upgrade aiohttp to incldue CVE fix (#14840)
russellb Mar 14, 2025
54a8804
[Doc] More neutral K8s deployment guide (#14084)
terrytangyuan Mar 14, 2025
dd344e0
[Bugfix] Fix torch_xla in V0 which can't handle None seed introduced …
yarongmu-google Mar 15, 2025
9f37422
[Neuron][CI] update docker run command (#14829)
liangfu Mar 15, 2025
acaea3b
[Bugfix][V1] Fix flashinfer sampling (#14815)
DefTruth Mar 15, 2025
ccf02fc
Revert "[Model] Mamba2 Prefill Performance Tweaks: Fixing Flurry of U…
tlrmchlsmth Mar 15, 2025
776dcec
Disable outlines cache by default (#14837)
russellb Mar 15, 2025
97ac781
[Misc] Remove misleading message in gemma2 and gemma3 (#14850)
Isotr0py Mar 15, 2025
8c0d15d
[Misc][Easy] Annotate unused vars in the csrc files (#14798)
houseroad Mar 15, 2025
d4d93db
[V1] V1 Enablement Oracle (#13726)
robertgshaw2-redhat Mar 15, 2025
877e352
[Docs] Add new East Coast vLLM Meetup slides to README and meetups.md…
simon-mo Mar 15, 2025
a2ae496
[CPU] Support FP8 KV cache (#14741)
bigPYJ1151 Mar 15, 2025
5952d8a
[Attention] Get rid of mla cache alignment (#14842)
LucasWilkinson Mar 15, 2025
e0fdfa1
[CI/Build] Delete LoRA bias test (#14849)
jeejeelee Mar 15, 2025
4c7629c
[V1][Structured Output] calculate vocab_size eagerly (#14851)
aarnphm Mar 15, 2025
aaacf17
[Doc] V1 user guide (#13991)
JenZhao Mar 15, 2025
ee3778d
[Build/CI] Upgrade jinja2 to get 3 moderate CVE fixes (#14839)
russellb Mar 15, 2025
9ed6ee9
[Bugfix] EAGLE output norm bug (#14464)
luyuzhe111 Mar 15, 2025
3556a41
[VLM] Limit multimodal input cache by memory (#14805)
DarkLight1337 Mar 15, 2025
f58aea0
[CI][Intel GPU] refine intel GPU ci docker build (#14860)
jikunshang Mar 15, 2025
74bc397
[Core] Expose API endpoint `/is_sleeping` (#14312)
waltforme Mar 15, 2025
61c6a5a
[VLM] Merged multi-modal processor for Pixtral (#12211)
Flechman Mar 15, 2025
3453b96
[Misc][Doc] Minor benchmark README update (#14874)
ywang96 Mar 16, 2025
def232e
[VLM] Clean up Phi-4-MM ViT implementation (#14812)
Isotr0py Mar 16, 2025
b30c75d
[V1] Remove V0 fallback for mistral-tokenizer (#14873)
ywang96 Mar 16, 2025
71c1e07
[Kernel] Add more tuned configs (#14877)
simon-mo Mar 16, 2025
b82662d
[BugFix] Fix torch distributed stateless PG backend init (#14870)
njhill Mar 16, 2025
d1ad2a5
[V1] [Spec Decode] Fix ngram tests (#14878)
LiuXiaoxuanPKU Mar 16, 2025
d30aa7e
[Bugfix] Limit profiling run sequence length by max_model_len (#14785)
kylesayrs Mar 16, 2025
e53b135
[Bugfix] Explicitly disable Phi-4-multimodal in V1 (#14889)
DarkLight1337 Mar 16, 2025
f6137ad
Revert "[Bugfix] Limit profiling run sequence length by max_model_len…
DarkLight1337 Mar 16, 2025
fc1f677
[BugFix][V1] Fix overhead related to bad_words sampling when not in u…
njhill Mar 16, 2025
31060b2
[V1][BugFix] Detect interleaved sliding window attention (#14896)
WoosukKwon Mar 16, 2025
b9b5bdf
[Misc] Catching Ray Compiled Graph PP test failures for V1 (#14847)
ruisearch42 Mar 16, 2025
90df7f2
[Doc] Add guidance for using `ccache` with `pip install -e .` in doc …
vadiklyutiy Mar 16, 2025
aecc780
[V1] Enable Entrypoints Tests (#14903)
robertgshaw2-redhat Mar 17, 2025
bb3aedd
[CI] Nightly Tests (#14898)
robertgshaw2-redhat Mar 17, 2025
8a5a9b7
[CI/Build] Update defaults for test reproducibility (#14893)
DarkLight1337 Mar 17, 2025
faa0275
[V1] Optimize the overhead of rewinding (#14905)
WoosukKwon Mar 17, 2025
7f6c5ee
[V1][Minor] Add __repr__ to ConstantList (#14907)
WoosukKwon Mar 17, 2025
1e799b7
[BugFix] Fix MLA + V1 + TP==1 causing reinitialization of cuda contex…
LucasWilkinson Mar 17, 2025
a73e183
[Misc] Replace os environ to monkeypatch in test suite (#14516)
t-sibiraj Mar 17, 2025
583a977
[Benchmark] Do not save detailed info to json by default (#14879)
simon-mo Mar 17, 2025
8d6cf89
[V1] [Spec Decode] Support random sampling for spec decode (#13933)
LiuXiaoxuanPKU Mar 17, 2025
b539222
[V1] Remove input cache client (#14864)
DarkLight1337 Mar 17, 2025
9b87a57
[Misc][XPU] Use None as device capacity for XPU (#14932)
yma11 Mar 17, 2025
dd3b865
[Doc] Add vLLM Beijing meetup slide (#14938)
heheda12345 Mar 17, 2025
0a74bfc
setup.py: drop assumption about local `main` branch (#14692)
russellb Mar 17, 2025
cd0cd85
[MISC] More AMD unused var clean up (#14926)
houseroad Mar 17, 2025
69698f2
fix minor miscalled method (#14327)
kushanam Mar 17, 2025
b4ad56c
[V1][TPU] Apply the ragged paged attention kernel fix and remove the …
vanbasten23 Mar 17, 2025
868a8c5
[Bugfix] Fix Ultravox on V1 (#14929)
DarkLight1337 Mar 17, 2025
6eaf1e5
[Misc] Add `--seed` option to offline multi-modal examples (#14934)
DarkLight1337 Mar 17, 2025
2bb0e1a
[Bugfix][ROCm] running new process using spawn method for rocm in tes…
vllmellm Mar 17, 2025
166a168
[Doc] Fix misleading log during multi-modal profiling (#14955)
DarkLight1337 Mar 17, 2025
d20b0c1
Add patch merger (#14957)
patrickvonplaten Mar 17, 2025
89fca67
[V1] Default MLA to V1 (#14921)
simon-mo Mar 17, 2025
e1eb45d
[Bugfix] Fix precommit - line too long in pixtral.py (#14960)
tlrmchlsmth Mar 17, 2025
aaaec52
[Bugfix][Model] Mixtral: use unused head_dim config argument (#14961)
qtrrb Mar 17, 2025
c0efdd6
[Fix][Structured Output] using vocab_size to construct matcher (#14868)
aarnphm Mar 17, 2025
37e3806
[Bugfix] Make Gemma3 MM V0 only for now (#14971)
ywang96 Mar 17, 2025
5340b0e
[Bugfix] Fix interface for Olmo2 on V1 (#14976)
ywang96 Mar 17, 2025
b89fb2a
[CI/Build] Use `AutoModelForImageTextToText` to load VLMs in tests (#…
DarkLight1337 Mar 17, 2025
e41e160
[V1] Guard Against Main Thread Usage (#14972)
robertgshaw2-redhat Mar 17, 2025
18551e8
[V1] TPU - Fix CI/CD runner (#14974)
alexm-redhat Mar 17, 2025
5eeabc2
[Bugfix] Fix bnb quantization for models with both HF-format and Mist…
tristanleclercq Mar 17, 2025
53a0cf8
[Neuron] trim attention kernel tests to fit trn1.2x instance (#14988)
liangfu Mar 18, 2025
d169575
[Doc][V1] Fix V1 APC doc (#14920)
shen-shanshan Mar 18, 2025
400d483
[Kernels] LoRA - Retire SGMV and BGMV Kernels (#14685)
varun-sundar-rabindranath Mar 18, 2025
f863ffc
[Mistral-Small 3.1] Update docs and tests (#14977)
patrickvonplaten Mar 18, 2025
db7c8ca
[Misc] Embedding model support LoRA (#14935)
jeejeelee Mar 18, 2025
4149191
[Bugfix] torchrun compatibility (#14899)
hiyouga Mar 18, 2025
dd73202
[Bugfix][Frontend] Fix validation of `logprobs` in `ChatCompletionReq…
schoennenbeck Mar 18, 2025
64fc219
[Misc][Docs] fix the comments of KV_T and CACHE_T in CALL_RESHAPE_AND…
yangsijia-serena Mar 18, 2025
ab656f2
[Bugfix] Loosen type check to avoid errors in V1 (#15021)
DarkLight1337 Mar 18, 2025
3b45714
[Bugfix] Register serializers for V0 MQ Engine (#15009)
simon-mo Mar 18, 2025
af35d3a
[TPU][V1][Bugfix] Fix chunked prefill with padding (#15037)
NickLucche Mar 18, 2025
8b793f7
MI325 configs, fused_moe_kernel bugfix (#14987)
ekuznetsov139 Mar 18, 2025
452e8fd
[MODEL] Add support for Zamba2 models (#13185)
yury-tokpanov Mar 18, 2025
179a619
[Bugfix] Fix broken CPU quantization due to triton import (#15038)
Isotr0py Mar 18, 2025
46c759c
[Bugfix] Fix LoRA extra vocab size (#15047)
jeejeelee Mar 18, 2025
3a1e648
[V1] Refactor Structured Output for multiple backends (#14694)
russellb Mar 18, 2025
99abb8b
[V1][Spec Decode] Optimize Rejection Sampler with Triton Kernels (#14…
WoosukKwon Mar 18, 2025
72a8639
[V1] TPU - CI/CD use smaller model (#15054)
alexm-redhat Mar 18, 2025
027827c
fix long dtype in topk sampling (#15049)
chujiezheng Mar 18, 2025
228b768
[Doc] Minor v1_user_guide update (#15064)
JenZhao Mar 18, 2025
4f065f1
[Misc][V1] Skip device checking if not available (#15061)
comaniac Mar 19, 2025
437f916
[Model] Pixtral: Remove layer instantiation duplication (#15053)
juliendenize Mar 19, 2025
8b3e94a
[Model] Remove duplicated message check in Mistral chat completion re…
b8zhong Mar 19, 2025
f690372
[Core] Update dtype detection and defaults (#14858)
DarkLight1337 Mar 19, 2025
05ccd0a
[V1] Ensure using int64 for sampled token ids (#15065)
WoosukKwon Mar 19, 2025
61f4121
[Bugfix] Re-enable Gemma3 for V1 (#14980)
DarkLight1337 Mar 19, 2025
68cf160
[CI][Intel GPU] update XPU dockerfile and CI script (#15109)
jikunshang Mar 19, 2025
dafb4e5
[V1][Bugfix] Fix oracle for device checking (#15104)
ywang96 Mar 19, 2025
1fe0fd1
[Misc] Avoid unnecessary HF `do_rescale` warning when passing dummy d…
DarkLight1337 Mar 19, 2025
3d44643
[Bugfix] Fix size calculation of processing cache (#15114)
DarkLight1337 Mar 19, 2025
073d1ed
[Doc] Update tip info on using latest transformers when creating a cu…
MarcCote Mar 19, 2025
6c5a319
[Misc][Benchmark] Add support for different `tokenizer_mode` (#15040)
aarnphm Mar 19, 2025
8363cd0
[Bugfix] Adjust mllama to regional compilation (#15112)
jkaniecki Mar 19, 2025
a4d8366
[Misc] Update the "the first vLLM China Meetup" slides link to point …
imkero Mar 19, 2025
374ee28
[Frontend] Remove custom_cache_manager (#13791)
fulvius31 Mar 19, 2025
61c7a1b
[V1] Minor V1 async engine test refactor (#15075)
andoorve Mar 19, 2025
26dd972
[FEAT]Support reset prefix cache by specified device (#15003)
maobaolong Mar 19, 2025
8310e0b
simple bugfix: Update stats.py (#15139)
WrRan Mar 19, 2025
b0e96aa
[V1][TPU] Change kv cache shape. (#15145)
vanbasten23 Mar 19, 2025
22d33ba
[FrontEnd][Perf] `merge_async_iterators` fast-path for single-prompt …
njhill Mar 19, 2025
0fe5609
[Docs] Annouce Ollama and Singapore Meetups (#15161)
simon-mo Mar 19, 2025
cfbca8a
[V1] TPU - Tensor parallel MP support (#15059)
alexm-redhat Mar 20, 2025
c47aafa
[BugFix] Lazily import XgrammarBackend to avoid early cuda init (#15171)
njhill Mar 20, 2025
4cb1c05
[Doc] Clarify run vllm only on one node in distributed inference (#15…
ruisearch42 Mar 20, 2025
70e500c
Fix broken tests (#14713)
jovsa Mar 20, 2025
ffa443a
[Bugfix] Fix embedding assignment for InternVL-based models (#15086)
DarkLight1337 Mar 20, 2025
40828ce
fix "Total generated tokens:" is 0 if using --backend tgi and --endpo…
sywangyi Mar 20, 2025
d8c6d7d
[V1][TPU] Support V1 Sampler for ragged attention (#14227)
NickLucche Mar 20, 2025
b88be22
[Benchmark] Allow oversample request in benchmark dataset (#15170)
JenZhao Mar 20, 2025
1f16b7f
[Core][V0] Add guidance backend for structured output (#14589)
russellb Mar 20, 2025
34868b1
[Doc] Update Mistral Small 3.1/Pixtral example (#15184)
ywang96 Mar 20, 2025
ae65f3e
[Misc]fixed disable these http request logs (#14754)
chaunceyjiang Mar 20, 2025
a597a57
[Attention] Flash Attention 3 - fp8 (#14570)
mickaelseznec Mar 20, 2025
2f726b2
[Doc] Update README.md (#15187)
DarkLight1337 Mar 20, 2025
a8652f4
Enable CUDA graph support for llama 3.2 vision (#14917)
mritterfigma Mar 20, 2025
bfe2fe0
typo: Update config.py (#15189)
WrRan Mar 20, 2025
742369d
[Frontend][Bugfix] support prefill decode disaggregation on deepseek …
billishyahao Mar 20, 2025
3d45e3d
[release] Tag vllm-cpu with latest upon new version released (#15193)
khluu Mar 20, 2025
c607a26
Fixing Imprecise Type Annotations (#15192)
WrRan Mar 20, 2025
e3f813c
[macOS] Ugrade pytorch to 2.6.0 (#15129)
linktohack Mar 20, 2025
27261e4
[Bugfix] Multi-video inference on LLaVA-Onevision (#15082)
DarkLight1337 Mar 20, 2025
69ae238
Add user forum to README (#15220)
hmellor Mar 20, 2025
a8f12a6
Fix env vars for running Ray distributed backend on GKE (#15166)
richardsliu Mar 20, 2025
5a0905b
Replace `misc` issues with link to forum (#15226)
hmellor Mar 20, 2025
086b568
[ci] feat: make the test_torchrun_example run with tp=2, external_dp=…
vermouth1992 Mar 20, 2025
d8e82bc
[Bugfix] fix V1 Engine crash while handling requests with duplicate r…
JasonJ2021 Mar 20, 2025
2b22290
[V1] Add flag to disable cascade attention (#15243)
WoosukKwon Mar 20, 2025
06dd082
Enforce that TP > 1 is not supported for Mamba2 if Quantization is En…
fabianlim Mar 21, 2025
0c6f502
[V1] Scheduler Refactoring [1/N] - Add Scheduler Interface (#15250)
WoosukKwon Mar 21, 2025
0cfe7d3
[CI/Build] LoRA : make add_lora_test safer (#15181)
varun-sundar-rabindranath Mar 21, 2025
d3ccbd6
Fix CUDA kernel index data type in vllm/csrc/quantization/fused_kerne…
houseroad Mar 21, 2025
10f55fe
[Misc] Clean up the BitsAndBytes arguments (#15140)
jeejeelee Mar 21, 2025
2e0b4cf
[ROCM] Upgrade torch to 2.6 (#15244)
SageMoore Mar 21, 2025
1e50834
[Bugfix] Fix incorrect qwen2.5-vl attention mask pre-computation (#15…
Isotr0py Mar 21, 2025
6edbfa9
Mention `extra_body` as a way top pass vLLM only parameters using the…
hmellor Mar 21, 2025
4719505
[V1][TPU] Speed up top-k on TPU by using torch.topk (#15242)
hyeygit Mar 21, 2025
0032903
[Bugfix] detect alibi and revert to FA2 (#15231)
tjohnson31415 Mar 21, 2025
296f927
[Model] RE: Mamba2 Prefill Performance Tweaks: Fixing Flurry of Unnec…
cyang49 Mar 21, 2025
11b986b
[Docs] Trim the latest news in README (#15261)
WoosukKwon Mar 21, 2025
5df2da5
[Misc] Better RayExecutor and multiprocessing compatibility (#14705)
comaniac Mar 21, 2025
e588ac2
Add an example for reproducibility (#15262)
WoosukKwon Mar 21, 2025
b15fd2b
[Hardware][TPU] Add check for no additional graph compilation during …
lsy323 Mar 21, 2025
f8a08cb
[V1] Enable Triton(ROCm) Attention backend for Nvidia GPUs (#14071)
Isotr0py Mar 21, 2025
7297941
[Doc] Update LWS docs (#15163)
Edwinhr716 Mar 21, 2025
da6ea29
[V1] Avoid redundant input processing in n>1 case (#14985)
njhill Mar 21, 2025
0fa3970
[Feature] specify model in config.yaml (#14855)
wayzeng Mar 21, 2025
a989ca2
[Bugfix] Add int8 torch dtype for KVCache (#15260)
shen-shanshan Mar 21, 2025
47c7126
[Misc] Add attention mask pre-computation optimization back to Qwen2.…
Isotr0py Mar 21, 2025
84e00ad
[Bugfix] Fix incorrect resolving order for transformers fallback (#15…
Isotr0py Mar 21, 2025
91ca929
[V1] Fix wrong import path of get_flash_attn_version (#15280)
lhtin Mar 21, 2025
8afcd0f
[Bugfix] Fix broken kernel test due to missing rename for v1 Triton b…
Isotr0py Mar 21, 2025
61e8c18
[Misc] Add cProfile helpers (#15074)
russellb Mar 21, 2025
93a00d7
[v1] Refactor KVCacheConfig (#14079)
heheda12345 Mar 21, 2025
c21b99b
[Bugfix][VLM] fix llava processor (#15285)
MengqingCao Mar 21, 2025
baec0d4
Revert "[Feature] specify model in config.yaml (#14855)" (#15293)
DarkLight1337 Mar 21, 2025
cfbb8c9
[TPU][V1] MHA Pallas backend (#15288)
NickLucche Mar 21, 2025
790b797
[Build/CI] Fix env var typo (#15305)
russellb Mar 21, 2025
4c69e22
[Misc] Increase RayDistributedExecutor RAY_CGRAPH_get_timeout (#15301)
ruisearch42 Mar 22, 2025
df14302
[Bugfix][V0] Multi-sequence logprobs streaming edge case (#15259)
andylolu2 Mar 22, 2025
ec870fb
[FEAT] [ROCm]: Add AITER RMS Norm (Layer Norm) Feature (#14959)
tjtanaa Mar 22, 2025
1c2bec0
[Doc] add load_format items in docs (#14804)
wwl2755 Mar 22, 2025
2fa0e13
[Bugfix] Fix torch.compile raise FileNotFoundError (#15278)
jeejeelee Mar 22, 2025
8a8b30e
[Bugfix] LoRA V0 - Fix case where `max_num_seqs` is between cudagraph…
varun-sundar-rabindranath Mar 22, 2025
2f4bd35
[Model] Support Tele-FLM Model (#15023)
atone Mar 22, 2025
e96feff
vllm support for swissai model
AllenHaoHuang Mar 22, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
40 changes: 26 additions & 14 deletions .buildkite/check-wheel-size.py
Original file line number Diff line number Diff line change
@@ -1,36 +1,48 @@
# SPDX-License-Identifier: Apache-2.0

import os
import sys
import zipfile

MAX_SIZE_MB = 250
# Read the VLLM_MAX_SIZE_MB environment variable, defaulting to 400 MiB
# Note that we have 400 MiB quota, please use it wisely.
# See https://github.com/pypi/support/issues/3792 .
# Please also sync the value with the one in Dockerfile.
VLLM_MAX_SIZE_MB = int(os.environ.get('VLLM_MAX_SIZE_MB', 400))


def print_top_10_largest_files(zip_file):
"""Print the top 10 largest files in the given zip file."""
with zipfile.ZipFile(zip_file, 'r') as z:
file_sizes = [(f, z.getinfo(f).file_size) for f in z.namelist()]
file_sizes.sort(key=lambda x: x[1], reverse=True)
for f, size in file_sizes[:10]:
print(f"{f}: {size/(1024*1024)} MBs uncompressed.")
print(f"{f}: {size / (1024 * 1024):.2f} MBs uncompressed.")


def check_wheel_size(directory):
"""Check the size of .whl files in the given directory."""
for root, _, files in os.walk(directory):
for f in files:
if f.endswith(".whl"):
wheel_path = os.path.join(root, f)
wheel_size = os.path.getsize(wheel_path)
wheel_size_mb = wheel_size / (1024 * 1024)
if wheel_size_mb > MAX_SIZE_MB:
print(
f"Wheel {wheel_path} is too large ({wheel_size_mb} MB) "
f"compare to the allowed size ({MAX_SIZE_MB} MB).")
for file_name in files:
if file_name.endswith(".whl"):
wheel_path = os.path.join(root, file_name)
wheel_size_mb = os.path.getsize(wheel_path) / (1024 * 1024)
if wheel_size_mb > VLLM_MAX_SIZE_MB:
print(f"Not allowed: Wheel {wheel_path} is larger "
f"({wheel_size_mb:.2f} MB) than the limit "
f"({VLLM_MAX_SIZE_MB} MB).")
print_top_10_largest_files(wheel_path)
return 1
else:
print(f"Wheel {wheel_path} is within the allowed size "
f"({wheel_size_mb} MB).")
f"({wheel_size_mb:.2f} MB).")
return 0


if __name__ == "__main__":
import sys
sys.exit(check_wheel_size(sys.argv[1]))
if len(sys.argv) < 2:
print("Usage: python check-wheel-size.py <directory>")
sys.exit(1)

directory = sys.argv[1]
sys.exit(check_wheel_size(directory))
26 changes: 26 additions & 0 deletions .buildkite/generate_index.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# SPDX-License-Identifier: Apache-2.0

import argparse
import os

template = """<!DOCTYPE html>
<html>
<body>
<h1>Links for vLLM</h1/>
<a href="../{wheel_html_escaped}">{wheel}</a><br/>
</body>
</html>
"""

parser = argparse.ArgumentParser()
parser.add_argument("--wheel", help="The wheel path.", required=True)
args = parser.parse_args()

filename = os.path.basename(args.wheel)

with open("index.html", "w") as f:
print(f"Generated index.html for {args.wheel}")
# cloudfront requires escaping the '+' character
f.write(
template.format(wheel=filename,
wheel_html_escaped=filename.replace("+", "%2B")))
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,4 @@ tasks:
value: 0.664
limit: 1000
num_fewshot: 5
trust_remote_code: True
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Asym-Per-Token-Test -b "auto" -l 250 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Asym-Per-Token-Test"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.764
- name: "exact_match,flexible-extract"
value: 0.764
limit: 250
num_fewshot: 5
11 changes: 11 additions & 0 deletions .buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-QQQ.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m HandH1998/QQQ-Llama-3-8b-g128 -b 32 -l 1000 -f 5 -t 1
model_name: "HandH1998/QQQ-Llama-3-8b-g128"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.419
- name: "exact_match,flexible-extract"
value: 0.416
limit: 1000
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8 -b "auto" -l 1000 -f 5 -t 1
model_name: "neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.356
- name: "exact_match,flexible-extract"
value: 0.358
limit: 1000
num_fewshot: 5
11 changes: 11 additions & 0 deletions .buildkite/lm-eval-harness/configs/Minitron-4B-Base-FP8.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m mgoin/Minitron-4B-Base-FP8 -b auto -l 1000 -f 5 -t 1
model_name: "mgoin/Minitron-4B-Base-FP8"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.231
- name: "exact_match,flexible-extract"
value: 0.22
limit: 1000
num_fewshot: 5
11 changes: 0 additions & 11 deletions .buildkite/lm-eval-harness/configs/Minitron-4B-Base.yaml

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-chnl_wts_per_tok_dyn_act_fp8-BitM -b "auto" -t 2
model_name: "nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-chnl_wts_per_tok_dyn_act_fp8-BitM"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.6353
- name: "exact_match,flexible-extract"
value: 0.637
limit: null
num_fewshot: null
7 changes: 4 additions & 3 deletions .buildkite/lm-eval-harness/configs/models-small.txt
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
Meta-Llama-3-8B-Instruct.yaml
Meta-Llama-3-8B-Instruct-FP8.yaml
Meta-Llama-3-8B-Instruct-FP8-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-INT8-compressed-tensors.yaml
Meta-Llama-3.2-1B-Instruct-INT8-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-INT8-compressed-tensors-asym.yaml
Meta-Llama-3-8B-Instruct-nonuniform-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-Channelwise-compressed-tensors.yaml
Minitron-4B-Base.yaml
Minitron-4B-Base-FP8.yaml
Qwen2-1.5B-Instruct-INT8-compressed-tensors.yaml
Qwen2-1.5B-Instruct-FP8W8.yaml
Meta-Llama-3-8B-QQQ.yaml
8 changes: 4 additions & 4 deletions .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# We can use this script to compute baseline accuracy on GSM for transformers.
#
# Make sure you have lm-eval-harness installed:
# pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@9516087b81a61d0e220b22cc1b75be76de23bc10
# pip install lm-eval==0.4.4

usage() {
echo``
Expand Down Expand Up @@ -41,6 +41,6 @@ while getopts "m:b:l:f:" OPT; do
done

lm_eval --model hf \
--model_args pretrained=$MODEL,parallelize=True \
--tasks gsm8k --num_fewshot $FEWSHOT --limit $LIMIT \
--batch_size $BATCH_SIZE
--model_args "pretrained=$MODEL,parallelize=True" \
--tasks gsm8k --num_fewshot "$FEWSHOT" --limit "$LIMIT" \
--batch_size "$BATCH_SIZE"
8 changes: 4 additions & 4 deletions .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
# We use this for fp8, which HF does not support.
#
# Make sure you have lm-eval-harness installed:
# pip install lm-eval==0.4.3
# pip install lm-eval==0.4.4

usage() {
echo``
Expand Down Expand Up @@ -46,6 +46,6 @@ while getopts "m:b:l:f:t:" OPT; do
done

lm_eval --model vllm \
--model_args pretrained=$MODEL,tensor_parallel_size=$TP_SIZE,distributed_executor_backend="ray",trust_remote_code=true,max_model_len=4096 \
--tasks gsm8k --num_fewshot $FEWSHOT --limit $LIMIT \
--batch_size $BATCH_SIZE
--model_args "pretrained=$MODEL,tensor_parallel_size=$TP_SIZE,distributed_executor_backend=ray,trust_remote_code=true,max_model_len=4096" \
--tasks gsm8k --num_fewshot "$FEWSHOT" --limit "$LIMIT" \
--batch_size "$BATCH_SIZE"
2 changes: 1 addition & 1 deletion .buildkite/lm-eval-harness/run-tests.sh
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ while getopts "c:t:" OPT; do
done

# Parse list of configs.
IFS=$'\n' read -d '' -r -a MODEL_CONFIGS < $CONFIG
IFS=$'\n' read -d '' -r -a MODEL_CONFIGS < "$CONFIG"

for MODEL_CONFIG in "${MODEL_CONFIGS[@]}"
do
Expand Down
20 changes: 17 additions & 3 deletions .buildkite/lm-eval-harness/test_lm_eval_correctness.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# SPDX-License-Identifier: Apache-2.0
"""
LM eval harness on model to compare vs HF baseline computed offline.
Configs are found in configs/$MODEL.yaml
Expand All @@ -12,9 +13,10 @@

import lm_eval
import numpy
import pytest
import yaml

RTOL = 0.02
RTOL = 0.05
TEST_DATA_FILE = os.environ.get(
"LM_EVAL_TEST_DATA_FILE",
".buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct.yaml")
Expand All @@ -23,9 +25,12 @@


def launch_lm_eval(eval_config):
trust_remote_code = eval_config.get('trust_remote_code', False)

model_args = f"pretrained={eval_config['model_name']}," \
f"tensor_parallel_size={TP_SIZE}," \
f"add_bos_token=true"
f"add_bos_token=true," \
f"trust_remote_code={trust_remote_code}"

results = lm_eval.simple_evaluate(
model="vllm",
Expand All @@ -42,14 +47,23 @@ def test_lm_eval_correctness():
eval_config = yaml.safe_load(
Path(TEST_DATA_FILE).read_text(encoding="utf-8"))

if eval_config[
"model_name"] == "nm-testing/Meta-Llama-3-70B-Instruct-FBGEMM-nonuniform": #noqa: E501
pytest.skip("FBGEMM is currently failing on main.")

# Launch eval requests.
results = launch_lm_eval(eval_config)

# Confirm scores match ground truth.
success = True
for task in eval_config["tasks"]:
for metric in task["metrics"]:
ground_truth = metric["value"]
measured_value = results["results"][task["name"]][metric["name"]]
print(f'{task["name"]} | {metric["name"]}: '
f'ground_truth={ground_truth} | measured={measured_value}')
assert numpy.isclose(ground_truth, measured_value, rtol=RTOL)
success = success and numpy.isclose(
ground_truth, measured_value, rtol=RTOL)

# Assert at the end, print all scores even on failure for debugging.
assert success
Loading
Loading