Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
3246 commits
Select commit Hold shift + click to select a range
32ef498
[V1] Temporarily disable FlashInfer Rejection Sampler (#14788)
WoosukKwon Mar 14, 2025
0b1cfa6
[Kernel] LoRA - Enable CUDAGraphs for V1 (#14626)
varun-sundar-rabindranath Mar 14, 2025
fb4c7f8
[Kernel] [V1] Further optimizations to ROCm (Triton) Backend to bette…
tdoublep Mar 14, 2025
95d680b
[Bugfix][IPEX] Add `VLLM_CPU_MOE_PREPACK` to allow disabling MoE prep…
gau-nernst Mar 14, 2025
f1f632d
[ci] Reduce number of tests in fastcheck (#14782)
khluu Mar 14, 2025
4059adc
[Misc][Minor] Simplify `SamplingParams.__post_init__()` (#14772)
njhill Mar 14, 2025
d3d4956
[Neuron] flatten test parameterization for neuron attention kernels (…
liangfu Mar 14, 2025
a6e0d09
[Feature] Add visionarena offline support for benchmark_throughput (#…
JenZhao Mar 14, 2025
0c2af17
[CI] Fix missing example model id in processor test (#14787)
ywang96 Mar 14, 2025
9532c49
[Attention] MLA get rid of materialization (#14770)
LucasWilkinson Mar 14, 2025
27b50f1
[Bugfix][Kernel][CPU] Fix num_tokens in CPU rotary embedding kernel (…
gau-nernst Mar 14, 2025
09269b3
[BugFix]Fix performance serving benchmark when enable profiling (#14737)
Potabk Mar 14, 2025
601bd32
[Misc] Clean up type annotation for `SupportsMultiModal` (#14794)
DarkLight1337 Mar 14, 2025
54cc46f
[Bugfix] Fix small typo in the example of Streaming delimiter (#14793)
bravo325806 Mar 14, 2025
989ecd2
[Misc] Gemma3ForConditionalGeneration supports LoRA (#14797)
jeejeelee Mar 14, 2025
c77620d
[V1][Minor] Minor code cleanup for scheduling metrics (#14800)
WoosukKwon Mar 14, 2025
40253ba
[Bugfix][W8A8] fixed cutlass block fp8 binding (#14796)
DefTruth Mar 14, 2025
ab93f13
[VLM] Various cleanup and fixes (#14806)
DarkLight1337 Mar 14, 2025
fd8e055
[BugFix]: properly catch templating error when preprocess input (#13976)
gcalmettes Mar 14, 2025
613c5bb
[Bugfix] Fix Aria test loading (#14823)
DarkLight1337 Mar 14, 2025
1140991
[V1] Fix vocab size calculation for structured output (#14826)
russellb Mar 14, 2025
0b0d642
[Frontend] Fix log message to use http vs https (#14774)
russellb Mar 14, 2025
9d2b4a7
[V1][Metrics] Updated list of deprecated metrics in v0.8 (#14695)
markmc Mar 14, 2025
73deea2
[Frontend] track server_load (#13950)
daniel-salib Mar 14, 2025
977a167
[Bugfix][Kernel]: Fix AllSpark kernel compilation errors and enable f…
wyajieha Mar 14, 2025
7097b4c
[release] Remove log cleanup commands from TPU job (#14838)
khluu Mar 14, 2025
270a5da
Re-enable the AMD Entrypoints Test (#14711)
Alexei-V-Ivanov-AMD Mar 14, 2025
fe66b34
[Model] Mamba2 Prefill Performance Tweaks: Fixing Flurry of Unnecessa…
cyang49 Mar 14, 2025
46f9889
[V1] Fix model parameterization for structured output tests (#14833)
russellb Mar 14, 2025
14f301b
Update to torch==2.6.0 (#12721)
mgoin Mar 14, 2025
4067778
[CI] Add TPU v1 test (#14834)
richardsliu Mar 14, 2025
233ffce
[Build/CI] Move ninja to common deps (#14835)
russellb Mar 14, 2025
bbd94a1
[Build/CI] Upgrade aiohttp to incldue CVE fix (#14840)
russellb Mar 14, 2025
54a8804
[Doc] More neutral K8s deployment guide (#14084)
terrytangyuan Mar 14, 2025
dd344e0
[Bugfix] Fix torch_xla in V0 which can't handle None seed introduced …
yarongmu-google Mar 15, 2025
9f37422
[Neuron][CI] update docker run command (#14829)
liangfu Mar 15, 2025
acaea3b
[Bugfix][V1] Fix flashinfer sampling (#14815)
DefTruth Mar 15, 2025
ccf02fc
Revert "[Model] Mamba2 Prefill Performance Tweaks: Fixing Flurry of U…
tlrmchlsmth Mar 15, 2025
776dcec
Disable outlines cache by default (#14837)
russellb Mar 15, 2025
97ac781
[Misc] Remove misleading message in gemma2 and gemma3 (#14850)
Isotr0py Mar 15, 2025
8c0d15d
[Misc][Easy] Annotate unused vars in the csrc files (#14798)
houseroad Mar 15, 2025
d4d93db
[V1] V1 Enablement Oracle (#13726)
robertgshaw2-redhat Mar 15, 2025
877e352
[Docs] Add new East Coast vLLM Meetup slides to README and meetups.md…
simon-mo Mar 15, 2025
a2ae496
[CPU] Support FP8 KV cache (#14741)
bigPYJ1151 Mar 15, 2025
5952d8a
[Attention] Get rid of mla cache alignment (#14842)
LucasWilkinson Mar 15, 2025
e0fdfa1
[CI/Build] Delete LoRA bias test (#14849)
jeejeelee Mar 15, 2025
4c7629c
[V1][Structured Output] calculate vocab_size eagerly (#14851)
aarnphm Mar 15, 2025
aaacf17
[Doc] V1 user guide (#13991)
JenZhao Mar 15, 2025
ee3778d
[Build/CI] Upgrade jinja2 to get 3 moderate CVE fixes (#14839)
russellb Mar 15, 2025
9ed6ee9
[Bugfix] EAGLE output norm bug (#14464)
luyuzhe111 Mar 15, 2025
3556a41
[VLM] Limit multimodal input cache by memory (#14805)
DarkLight1337 Mar 15, 2025
f58aea0
[CI][Intel GPU] refine intel GPU ci docker build (#14860)
jikunshang Mar 15, 2025
74bc397
[Core] Expose API endpoint `/is_sleeping` (#14312)
waltforme Mar 15, 2025
61c6a5a
[VLM] Merged multi-modal processor for Pixtral (#12211)
Flechman Mar 15, 2025
3453b96
[Misc][Doc] Minor benchmark README update (#14874)
ywang96 Mar 16, 2025
def232e
[VLM] Clean up Phi-4-MM ViT implementation (#14812)
Isotr0py Mar 16, 2025
b30c75d
[V1] Remove V0 fallback for mistral-tokenizer (#14873)
ywang96 Mar 16, 2025
71c1e07
[Kernel] Add more tuned configs (#14877)
simon-mo Mar 16, 2025
b82662d
[BugFix] Fix torch distributed stateless PG backend init (#14870)
njhill Mar 16, 2025
d1ad2a5
[V1] [Spec Decode] Fix ngram tests (#14878)
LiuXiaoxuanPKU Mar 16, 2025
d30aa7e
[Bugfix] Limit profiling run sequence length by max_model_len (#14785)
kylesayrs Mar 16, 2025
e53b135
[Bugfix] Explicitly disable Phi-4-multimodal in V1 (#14889)
DarkLight1337 Mar 16, 2025
f6137ad
Revert "[Bugfix] Limit profiling run sequence length by max_model_len…
DarkLight1337 Mar 16, 2025
fc1f677
[BugFix][V1] Fix overhead related to bad_words sampling when not in u…
njhill Mar 16, 2025
31060b2
[V1][BugFix] Detect interleaved sliding window attention (#14896)
WoosukKwon Mar 16, 2025
b9b5bdf
[Misc] Catching Ray Compiled Graph PP test failures for V1 (#14847)
ruisearch42 Mar 16, 2025
90df7f2
[Doc] Add guidance for using `ccache` with `pip install -e .` in doc …
vadiklyutiy Mar 16, 2025
aecc780
[V1] Enable Entrypoints Tests (#14903)
robertgshaw2-redhat Mar 17, 2025
bb3aedd
[CI] Nightly Tests (#14898)
robertgshaw2-redhat Mar 17, 2025
8a5a9b7
[CI/Build] Update defaults for test reproducibility (#14893)
DarkLight1337 Mar 17, 2025
faa0275
[V1] Optimize the overhead of rewinding (#14905)
WoosukKwon Mar 17, 2025
7f6c5ee
[V1][Minor] Add __repr__ to ConstantList (#14907)
WoosukKwon Mar 17, 2025
1e799b7
[BugFix] Fix MLA + V1 + TP==1 causing reinitialization of cuda contex…
LucasWilkinson Mar 17, 2025
a73e183
[Misc] Replace os environ to monkeypatch in test suite (#14516)
t-sibiraj Mar 17, 2025
583a977
[Benchmark] Do not save detailed info to json by default (#14879)
simon-mo Mar 17, 2025
8d6cf89
[V1] [Spec Decode] Support random sampling for spec decode (#13933)
LiuXiaoxuanPKU Mar 17, 2025
b539222
[V1] Remove input cache client (#14864)
DarkLight1337 Mar 17, 2025
9b87a57
[Misc][XPU] Use None as device capacity for XPU (#14932)
yma11 Mar 17, 2025
dd3b865
[Doc] Add vLLM Beijing meetup slide (#14938)
heheda12345 Mar 17, 2025
0a74bfc
setup.py: drop assumption about local `main` branch (#14692)
russellb Mar 17, 2025
cd0cd85
[MISC] More AMD unused var clean up (#14926)
houseroad Mar 17, 2025
69698f2
fix minor miscalled method (#14327)
kushanam Mar 17, 2025
b4ad56c
[V1][TPU] Apply the ragged paged attention kernel fix and remove the …
vanbasten23 Mar 17, 2025
868a8c5
[Bugfix] Fix Ultravox on V1 (#14929)
DarkLight1337 Mar 17, 2025
6eaf1e5
[Misc] Add `--seed` option to offline multi-modal examples (#14934)
DarkLight1337 Mar 17, 2025
2bb0e1a
[Bugfix][ROCm] running new process using spawn method for rocm in tes…
vllmellm Mar 17, 2025
166a168
[Doc] Fix misleading log during multi-modal profiling (#14955)
DarkLight1337 Mar 17, 2025
d20b0c1
Add patch merger (#14957)
patrickvonplaten Mar 17, 2025
89fca67
[V1] Default MLA to V1 (#14921)
simon-mo Mar 17, 2025
e1eb45d
[Bugfix] Fix precommit - line too long in pixtral.py (#14960)
tlrmchlsmth Mar 17, 2025
aaaec52
[Bugfix][Model] Mixtral: use unused head_dim config argument (#14961)
qtrrb Mar 17, 2025
c0efdd6
[Fix][Structured Output] using vocab_size to construct matcher (#14868)
aarnphm Mar 17, 2025
37e3806
[Bugfix] Make Gemma3 MM V0 only for now (#14971)
ywang96 Mar 17, 2025
5340b0e
[Bugfix] Fix interface for Olmo2 on V1 (#14976)
ywang96 Mar 17, 2025
b89fb2a
[CI/Build] Use `AutoModelForImageTextToText` to load VLMs in tests (#…
DarkLight1337 Mar 17, 2025
e41e160
[V1] Guard Against Main Thread Usage (#14972)
robertgshaw2-redhat Mar 17, 2025
18551e8
[V1] TPU - Fix CI/CD runner (#14974)
alexm-redhat Mar 17, 2025
5eeabc2
[Bugfix] Fix bnb quantization for models with both HF-format and Mist…
tristanleclercq Mar 17, 2025
53a0cf8
[Neuron] trim attention kernel tests to fit trn1.2x instance (#14988)
liangfu Mar 18, 2025
d169575
[Doc][V1] Fix V1 APC doc (#14920)
shen-shanshan Mar 18, 2025
400d483
[Kernels] LoRA - Retire SGMV and BGMV Kernels (#14685)
varun-sundar-rabindranath Mar 18, 2025
f863ffc
[Mistral-Small 3.1] Update docs and tests (#14977)
patrickvonplaten Mar 18, 2025
db7c8ca
[Misc] Embedding model support LoRA (#14935)
jeejeelee Mar 18, 2025
4149191
[Bugfix] torchrun compatibility (#14899)
hiyouga Mar 18, 2025
dd73202
[Bugfix][Frontend] Fix validation of `logprobs` in `ChatCompletionReq…
schoennenbeck Mar 18, 2025
64fc219
[Misc][Docs] fix the comments of KV_T and CACHE_T in CALL_RESHAPE_AND…
yangsijia-serena Mar 18, 2025
ab656f2
[Bugfix] Loosen type check to avoid errors in V1 (#15021)
DarkLight1337 Mar 18, 2025
3b45714
[Bugfix] Register serializers for V0 MQ Engine (#15009)
simon-mo Mar 18, 2025
af35d3a
[TPU][V1][Bugfix] Fix chunked prefill with padding (#15037)
NickLucche Mar 18, 2025
8b793f7
MI325 configs, fused_moe_kernel bugfix (#14987)
ekuznetsov139 Mar 18, 2025
452e8fd
[MODEL] Add support for Zamba2 models (#13185)
yury-tokpanov Mar 18, 2025
179a619
[Bugfix] Fix broken CPU quantization due to triton import (#15038)
Isotr0py Mar 18, 2025
46c759c
[Bugfix] Fix LoRA extra vocab size (#15047)
jeejeelee Mar 18, 2025
3a1e648
[V1] Refactor Structured Output for multiple backends (#14694)
russellb Mar 18, 2025
99abb8b
[V1][Spec Decode] Optimize Rejection Sampler with Triton Kernels (#14…
WoosukKwon Mar 18, 2025
72a8639
[V1] TPU - CI/CD use smaller model (#15054)
alexm-redhat Mar 18, 2025
027827c
fix long dtype in topk sampling (#15049)
chujiezheng Mar 18, 2025
228b768
[Doc] Minor v1_user_guide update (#15064)
JenZhao Mar 18, 2025
4f065f1
[Misc][V1] Skip device checking if not available (#15061)
comaniac Mar 19, 2025
437f916
[Model] Pixtral: Remove layer instantiation duplication (#15053)
juliendenize Mar 19, 2025
8b3e94a
[Model] Remove duplicated message check in Mistral chat completion re…
b8zhong Mar 19, 2025
f690372
[Core] Update dtype detection and defaults (#14858)
DarkLight1337 Mar 19, 2025
05ccd0a
[V1] Ensure using int64 for sampled token ids (#15065)
WoosukKwon Mar 19, 2025
61f4121
[Bugfix] Re-enable Gemma3 for V1 (#14980)
DarkLight1337 Mar 19, 2025
68cf160
[CI][Intel GPU] update XPU dockerfile and CI script (#15109)
jikunshang Mar 19, 2025
dafb4e5
[V1][Bugfix] Fix oracle for device checking (#15104)
ywang96 Mar 19, 2025
1fe0fd1
[Misc] Avoid unnecessary HF `do_rescale` warning when passing dummy d…
DarkLight1337 Mar 19, 2025
3d44643
[Bugfix] Fix size calculation of processing cache (#15114)
DarkLight1337 Mar 19, 2025
073d1ed
[Doc] Update tip info on using latest transformers when creating a cu…
MarcCote Mar 19, 2025
6c5a319
[Misc][Benchmark] Add support for different `tokenizer_mode` (#15040)
aarnphm Mar 19, 2025
8363cd0
[Bugfix] Adjust mllama to regional compilation (#15112)
jkaniecki Mar 19, 2025
a4d8366
[Misc] Update the "the first vLLM China Meetup" slides link to point …
imkero Mar 19, 2025
374ee28
[Frontend] Remove custom_cache_manager (#13791)
fulvius31 Mar 19, 2025
61c7a1b
[V1] Minor V1 async engine test refactor (#15075)
andoorve Mar 19, 2025
26dd972
[FEAT]Support reset prefix cache by specified device (#15003)
maobaolong Mar 19, 2025
8310e0b
simple bugfix: Update stats.py (#15139)
WrRan Mar 19, 2025
b0e96aa
[V1][TPU] Change kv cache shape. (#15145)
vanbasten23 Mar 19, 2025
22d33ba
[FrontEnd][Perf] `merge_async_iterators` fast-path for single-prompt …
njhill Mar 19, 2025
0fe5609
[Docs] Annouce Ollama and Singapore Meetups (#15161)
simon-mo Mar 19, 2025
cfbca8a
[V1] TPU - Tensor parallel MP support (#15059)
alexm-redhat Mar 20, 2025
c47aafa
[BugFix] Lazily import XgrammarBackend to avoid early cuda init (#15171)
njhill Mar 20, 2025
4cb1c05
[Doc] Clarify run vllm only on one node in distributed inference (#15…
ruisearch42 Mar 20, 2025
70e500c
Fix broken tests (#14713)
jovsa Mar 20, 2025
ffa443a
[Bugfix] Fix embedding assignment for InternVL-based models (#15086)
DarkLight1337 Mar 20, 2025
40828ce
fix "Total generated tokens:" is 0 if using --backend tgi and --endpo…
sywangyi Mar 20, 2025
d8c6d7d
[V1][TPU] Support V1 Sampler for ragged attention (#14227)
NickLucche Mar 20, 2025
b88be22
[Benchmark] Allow oversample request in benchmark dataset (#15170)
JenZhao Mar 20, 2025
1f16b7f
[Core][V0] Add guidance backend for structured output (#14589)
russellb Mar 20, 2025
34868b1
[Doc] Update Mistral Small 3.1/Pixtral example (#15184)
ywang96 Mar 20, 2025
ae65f3e
[Misc]fixed disable these http request logs (#14754)
chaunceyjiang Mar 20, 2025
a597a57
[Attention] Flash Attention 3 - fp8 (#14570)
mickaelseznec Mar 20, 2025
2f726b2
[Doc] Update README.md (#15187)
DarkLight1337 Mar 20, 2025
a8652f4
Enable CUDA graph support for llama 3.2 vision (#14917)
mritterfigma Mar 20, 2025
bfe2fe0
typo: Update config.py (#15189)
WrRan Mar 20, 2025
742369d
[Frontend][Bugfix] support prefill decode disaggregation on deepseek …
billishyahao Mar 20, 2025
3d45e3d
[release] Tag vllm-cpu with latest upon new version released (#15193)
khluu Mar 20, 2025
c607a26
Fixing Imprecise Type Annotations (#15192)
WrRan Mar 20, 2025
e3f813c
[macOS] Ugrade pytorch to 2.6.0 (#15129)
linktohack Mar 20, 2025
27261e4
[Bugfix] Multi-video inference on LLaVA-Onevision (#15082)
DarkLight1337 Mar 20, 2025
69ae238
Add user forum to README (#15220)
hmellor Mar 20, 2025
a8f12a6
Fix env vars for running Ray distributed backend on GKE (#15166)
richardsliu Mar 20, 2025
5a0905b
Replace `misc` issues with link to forum (#15226)
hmellor Mar 20, 2025
086b568
[ci] feat: make the test_torchrun_example run with tp=2, external_dp=…
vermouth1992 Mar 20, 2025
d8e82bc
[Bugfix] fix V1 Engine crash while handling requests with duplicate r…
JasonJ2021 Mar 20, 2025
2b22290
[V1] Add flag to disable cascade attention (#15243)
WoosukKwon Mar 20, 2025
06dd082
Enforce that TP > 1 is not supported for Mamba2 if Quantization is En…
fabianlim Mar 21, 2025
0c6f502
[V1] Scheduler Refactoring [1/N] - Add Scheduler Interface (#15250)
WoosukKwon Mar 21, 2025
0cfe7d3
[CI/Build] LoRA : make add_lora_test safer (#15181)
varun-sundar-rabindranath Mar 21, 2025
d3ccbd6
Fix CUDA kernel index data type in vllm/csrc/quantization/fused_kerne…
houseroad Mar 21, 2025
10f55fe
[Misc] Clean up the BitsAndBytes arguments (#15140)
jeejeelee Mar 21, 2025
2e0b4cf
[ROCM] Upgrade torch to 2.6 (#15244)
SageMoore Mar 21, 2025
1e50834
[Bugfix] Fix incorrect qwen2.5-vl attention mask pre-computation (#15…
Isotr0py Mar 21, 2025
6edbfa9
Mention `extra_body` as a way top pass vLLM only parameters using the…
hmellor Mar 21, 2025
4719505
[V1][TPU] Speed up top-k on TPU by using torch.topk (#15242)
hyeygit Mar 21, 2025
0032903
[Bugfix] detect alibi and revert to FA2 (#15231)
tjohnson31415 Mar 21, 2025
296f927
[Model] RE: Mamba2 Prefill Performance Tweaks: Fixing Flurry of Unnec…
cyang49 Mar 21, 2025
11b986b
[Docs] Trim the latest news in README (#15261)
WoosukKwon Mar 21, 2025
5df2da5
[Misc] Better RayExecutor and multiprocessing compatibility (#14705)
comaniac Mar 21, 2025
e588ac2
Add an example for reproducibility (#15262)
WoosukKwon Mar 21, 2025
b15fd2b
[Hardware][TPU] Add check for no additional graph compilation during …
lsy323 Mar 21, 2025
f8a08cb
[V1] Enable Triton(ROCm) Attention backend for Nvidia GPUs (#14071)
Isotr0py Mar 21, 2025
7297941
[Doc] Update LWS docs (#15163)
Edwinhr716 Mar 21, 2025
da6ea29
[V1] Avoid redundant input processing in n>1 case (#14985)
njhill Mar 21, 2025
0fa3970
[Feature] specify model in config.yaml (#14855)
wayzeng Mar 21, 2025
a989ca2
[Bugfix] Add int8 torch dtype for KVCache (#15260)
shen-shanshan Mar 21, 2025
47c7126
[Misc] Add attention mask pre-computation optimization back to Qwen2.…
Isotr0py Mar 21, 2025
84e00ad
[Bugfix] Fix incorrect resolving order for transformers fallback (#15…
Isotr0py Mar 21, 2025
91ca929
[V1] Fix wrong import path of get_flash_attn_version (#15280)
lhtin Mar 21, 2025
8afcd0f
[Bugfix] Fix broken kernel test due to missing rename for v1 Triton b…
Isotr0py Mar 21, 2025
61e8c18
[Misc] Add cProfile helpers (#15074)
russellb Mar 21, 2025
93a00d7
[v1] Refactor KVCacheConfig (#14079)
heheda12345 Mar 21, 2025
c21b99b
[Bugfix][VLM] fix llava processor (#15285)
MengqingCao Mar 21, 2025
baec0d4
Revert "[Feature] specify model in config.yaml (#14855)" (#15293)
DarkLight1337 Mar 21, 2025
cfbb8c9
[TPU][V1] MHA Pallas backend (#15288)
NickLucche Mar 21, 2025
790b797
[Build/CI] Fix env var typo (#15305)
russellb Mar 21, 2025
4c69e22
[Misc] Increase RayDistributedExecutor RAY_CGRAPH_get_timeout (#15301)
ruisearch42 Mar 22, 2025
df14302
[Bugfix][V0] Multi-sequence logprobs streaming edge case (#15259)
andylolu2 Mar 22, 2025
ec870fb
[FEAT] [ROCm]: Add AITER RMS Norm (Layer Norm) Feature (#14959)
tjtanaa Mar 22, 2025
1c2bec0
[Doc] add load_format items in docs (#14804)
wwl2755 Mar 22, 2025
2fa0e13
[Bugfix] Fix torch.compile raise FileNotFoundError (#15278)
jeejeelee Mar 22, 2025
8a8b30e
[Bugfix] LoRA V0 - Fix case where `max_num_seqs` is between cudagraph…
varun-sundar-rabindranath Mar 22, 2025
2f4bd35
[Model] Support Tele-FLM Model (#15023)
atone Mar 22, 2025
eb63ea1
[V1] Add `disable-any-whitespace` option support for xgrammar (#15316)
russellb Mar 22, 2025
dd861b9
[BugFix][Typing] Fix Imprecise Type Annotations (#15208)
WrRan Mar 22, 2025
b877031
Remove openvino support in favor of external plugin (#15339)
russellb Mar 22, 2025
a827aa8
[doc] Add back previous news (#15331)
heheda12345 Mar 23, 2025
0661cfe
Fix v1 supported oracle for worker-cls and worker-extension-cls (#15324)
hijkzzz Mar 23, 2025
50c9636
[V1][Usage] Refactor speculative decoding configuration and tests (#1…
ShangmingCai Mar 23, 2025
09b6a95
[ci/build] update torch nightly version for GH200 (#15135)
youkaichao Mar 23, 2025
f68cce8
[ci/build] fix broken tests in LLM.collective_rpc (#15350)
youkaichao Mar 23, 2025
f90d34b
[Misc] Add tuned R1 w8a8 and MoE configs for NVIDIA L20 (#15322)
DefTruth Mar 23, 2025
6ebaf9a
[Bugfix] consider related env vars for torch.compiled cache hash (#14…
DefTruth Mar 23, 2025
b9bd76c
[V1][Spec Decode] Respect prompt_lookup_max (#15348)
WoosukKwon Mar 23, 2025
bc8ed3c
[V1][Spec Decode] Use better defaults for N-gram (#15358)
WoosukKwon Mar 23, 2025
d6cd59f
[Frontend] Support tool calling and reasoning parser (#14511)
WangErXiao Mar 23, 2025
9c5c81b
[Misc][Doc] Add note regarding loading `generation_config` by default…
ywang96 Mar 23, 2025
dccf535
[V1] Enable V1 Fp8 cache for FA3 in the oracle (#15191)
LucasWilkinson Mar 23, 2025
f622dbc
[Fix] [torch.compile] Improve UUID system for custom passes (#15249)
ProExpertProg Mar 24, 2025
d20e261
Fix non-contiguous input passed to Marlin kernel (#15319)
Qubitium Mar 24, 2025
3892e58
[Misc] Upgrade BNB version (#15183)
jeejeelee Mar 24, 2025
5797fb9
[Misc] Remove ignore_reinit_error for ray.init() (#15373)
ruisearch42 Mar 24, 2025
948ab03
[Bugfix][V1] Avoid importing PreTrainedModel (#15366)
HollowMan6 Mar 24, 2025
cc8accf
[Misc] Update guided decoding logs to debug (#15310)
sfbemerk Mar 24, 2025
7ffcccf
Revert "[CI/Build] Use uv python for docker rather than ppa:deadsnake…
simon-mo Mar 24, 2025
6b3cc75
[Kernel] allow non-contiguous input for marlin kernel (#14658)
jinzhen-lin Mar 24, 2025
038de04
Fix zmq IPv6 URL format error (#15341)
russellb Mar 24, 2025
cbcdf2c
[Bugfix] Fix chat template loading (#15143)
DarkLight1337 Mar 24, 2025
9606d57
[distributed] fix dp group (#15355)
youkaichao Mar 24, 2025
761702f
[Core] Integrate `fastsafetensors` loader for loading model weights (…
manish-sethi Mar 24, 2025
8abe69b
[Core] Don't force uppercase for VLLM_LOGGING_LEVEL (#15306)
russellb Mar 24, 2025
0893567
[V1][Minor] fix comments (#15392)
Chen-0210 Mar 24, 2025
9cc6451
[MISC] Refine no available block debug msg (#15076)
yiliu30 Mar 24, 2025
3aee657
[V1] Aggregate chunked prompt logprobs in model runner (#14875)
njhill Mar 24, 2025
5eeadc2
[Hardware][Gaudi][Feature] Enable Dynamic MoE for Mixtral (#12303)
zhenwei-intel Mar 24, 2025
3eb08ed
[DOC] Add Kubernetes deployment guide with CPUs (#14865)
terrytangyuan Mar 24, 2025
6dd55af
[Doc] Update docs on handling OOM (#15357)
DarkLight1337 Mar 24, 2025
9d72daf
[V1][Perf] Simpler request output queues (#15156)
njhill Mar 24, 2025
623e2ed
[BugFix][V1] Quick fix for min_tokens with multiple EOS (#15407)
njhill Mar 24, 2025
23fdab0
[Hardware][TPU] Skip failed compilation test (#15421)
lsy323 Mar 24, 2025
8279201
[Build] Cython compilation support fix (#14296)
gshtras Mar 24, 2025
f533b58
[ROCm][Kernel] MoE weights padding (#14454)
gshtras Mar 24, 2025
ebcebee
[V1][Spec Decode] Enable spec decode for top-p & top-k sampling (#15063)
WoosukKwon Mar 25, 2025
911c8eb
[Minor][Spec Decode] Remove compiled_softmax (#15416)
WoosukKwon Mar 25, 2025
97cfa65
Add pipeline parallel support to `TransformersModel` (#12832)
hmellor Mar 25, 2025
6db9457
[Misc] Remove LoRA log (#15388)
jeejeelee Mar 25, 2025
b5269db
Revert "Fix non-contiguous input passed to Marlin kernel (#15319)" (#…
tlrmchlsmth Mar 25, 2025
10b34e3
[Bugfix] Fixed the issue of not being able to input video and image s…
chaunceyjiang Mar 25, 2025
a09ad90
[V1] guidance backend for structured output + `auto` fallback mode (#…
russellb Mar 25, 2025
25f560a
[V1][Spec Decode] Update target_logits in place for rejection samplin…
WoosukKwon Mar 25, 2025
304e7f4
vllm support for swissai model
AllenHaoHuang Mar 22, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
40 changes: 26 additions & 14 deletions .buildkite/check-wheel-size.py
Original file line number Diff line number Diff line change
@@ -1,36 +1,48 @@
# SPDX-License-Identifier: Apache-2.0

import os
import sys
import zipfile

MAX_SIZE_MB = 250
# Read the VLLM_MAX_SIZE_MB environment variable, defaulting to 400 MiB
# Note that we have 400 MiB quota, please use it wisely.
# See https://github.com/pypi/support/issues/3792 .
# Please also sync the value with the one in Dockerfile.
VLLM_MAX_SIZE_MB = int(os.environ.get('VLLM_MAX_SIZE_MB', 400))


def print_top_10_largest_files(zip_file):
"""Print the top 10 largest files in the given zip file."""
with zipfile.ZipFile(zip_file, 'r') as z:
file_sizes = [(f, z.getinfo(f).file_size) for f in z.namelist()]
file_sizes.sort(key=lambda x: x[1], reverse=True)
for f, size in file_sizes[:10]:
print(f"{f}: {size/(1024*1024)} MBs uncompressed.")
print(f"{f}: {size / (1024 * 1024):.2f} MBs uncompressed.")


def check_wheel_size(directory):
"""Check the size of .whl files in the given directory."""
for root, _, files in os.walk(directory):
for f in files:
if f.endswith(".whl"):
wheel_path = os.path.join(root, f)
wheel_size = os.path.getsize(wheel_path)
wheel_size_mb = wheel_size / (1024 * 1024)
if wheel_size_mb > MAX_SIZE_MB:
print(
f"Wheel {wheel_path} is too large ({wheel_size_mb} MB) "
f"compare to the allowed size ({MAX_SIZE_MB} MB).")
for file_name in files:
if file_name.endswith(".whl"):
wheel_path = os.path.join(root, file_name)
wheel_size_mb = os.path.getsize(wheel_path) / (1024 * 1024)
if wheel_size_mb > VLLM_MAX_SIZE_MB:
print(f"Not allowed: Wheel {wheel_path} is larger "
f"({wheel_size_mb:.2f} MB) than the limit "
f"({VLLM_MAX_SIZE_MB} MB).")
print_top_10_largest_files(wheel_path)
return 1
else:
print(f"Wheel {wheel_path} is within the allowed size "
f"({wheel_size_mb} MB).")
f"({wheel_size_mb:.2f} MB).")
return 0


if __name__ == "__main__":
import sys
sys.exit(check_wheel_size(sys.argv[1]))
if len(sys.argv) < 2:
print("Usage: python check-wheel-size.py <directory>")
sys.exit(1)

directory = sys.argv[1]
sys.exit(check_wheel_size(directory))
26 changes: 26 additions & 0 deletions .buildkite/generate_index.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# SPDX-License-Identifier: Apache-2.0

import argparse
import os

template = """<!DOCTYPE html>
<html>
<body>
<h1>Links for vLLM</h1/>
<a href="../{wheel_html_escaped}">{wheel}</a><br/>
</body>
</html>
"""

parser = argparse.ArgumentParser()
parser.add_argument("--wheel", help="The wheel path.", required=True)
args = parser.parse_args()

filename = os.path.basename(args.wheel)

with open("index.html", "w") as f:
print(f"Generated index.html for {args.wheel}")
# cloudfront requires escaping the '+' character
f.write(
template.format(wheel=filename,
wheel_html_escaped=filename.replace("+", "%2B")))
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,4 @@ tasks:
value: 0.664
limit: 1000
num_fewshot: 5
trust_remote_code: True
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Asym-Per-Token-Test -b "auto" -l 250 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Asym-Per-Token-Test"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.764
- name: "exact_match,flexible-extract"
value: 0.764
limit: 250
num_fewshot: 5
11 changes: 11 additions & 0 deletions .buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-QQQ.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m HandH1998/QQQ-Llama-3-8b-g128 -b 32 -l 1000 -f 5 -t 1
model_name: "HandH1998/QQQ-Llama-3-8b-g128"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.419
- name: "exact_match,flexible-extract"
value: 0.416
limit: 1000
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8 -b "auto" -l 1000 -f 5 -t 1
model_name: "neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.356
- name: "exact_match,flexible-extract"
value: 0.358
limit: 1000
num_fewshot: 5
11 changes: 11 additions & 0 deletions .buildkite/lm-eval-harness/configs/Minitron-4B-Base-FP8.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m mgoin/Minitron-4B-Base-FP8 -b auto -l 1000 -f 5 -t 1
model_name: "mgoin/Minitron-4B-Base-FP8"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.231
- name: "exact_match,flexible-extract"
value: 0.22
limit: 1000
num_fewshot: 5
11 changes: 0 additions & 11 deletions .buildkite/lm-eval-harness/configs/Minitron-4B-Base.yaml

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-chnl_wts_per_tok_dyn_act_fp8-BitM -b "auto" -t 2
model_name: "nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-chnl_wts_per_tok_dyn_act_fp8-BitM"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.6353
- name: "exact_match,flexible-extract"
value: 0.637
limit: null
num_fewshot: null
7 changes: 4 additions & 3 deletions .buildkite/lm-eval-harness/configs/models-small.txt
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
Meta-Llama-3-8B-Instruct.yaml
Meta-Llama-3-8B-Instruct-FP8.yaml
Meta-Llama-3-8B-Instruct-FP8-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-INT8-compressed-tensors.yaml
Meta-Llama-3.2-1B-Instruct-INT8-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-INT8-compressed-tensors-asym.yaml
Meta-Llama-3-8B-Instruct-nonuniform-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-Channelwise-compressed-tensors.yaml
Minitron-4B-Base.yaml
Minitron-4B-Base-FP8.yaml
Qwen2-1.5B-Instruct-INT8-compressed-tensors.yaml
Qwen2-1.5B-Instruct-FP8W8.yaml
Meta-Llama-3-8B-QQQ.yaml
8 changes: 4 additions & 4 deletions .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# We can use this script to compute baseline accuracy on GSM for transformers.
#
# Make sure you have lm-eval-harness installed:
# pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@9516087b81a61d0e220b22cc1b75be76de23bc10
# pip install lm-eval==0.4.4

usage() {
echo``
Expand Down Expand Up @@ -41,6 +41,6 @@ while getopts "m:b:l:f:" OPT; do
done

lm_eval --model hf \
--model_args pretrained=$MODEL,parallelize=True \
--tasks gsm8k --num_fewshot $FEWSHOT --limit $LIMIT \
--batch_size $BATCH_SIZE
--model_args "pretrained=$MODEL,parallelize=True" \
--tasks gsm8k --num_fewshot "$FEWSHOT" --limit "$LIMIT" \
--batch_size "$BATCH_SIZE"
8 changes: 4 additions & 4 deletions .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
# We use this for fp8, which HF does not support.
#
# Make sure you have lm-eval-harness installed:
# pip install lm-eval==0.4.3
# pip install lm-eval==0.4.4

usage() {
echo``
Expand Down Expand Up @@ -46,6 +46,6 @@ while getopts "m:b:l:f:t:" OPT; do
done

lm_eval --model vllm \
--model_args pretrained=$MODEL,tensor_parallel_size=$TP_SIZE,distributed_executor_backend="ray",trust_remote_code=true,max_model_len=4096 \
--tasks gsm8k --num_fewshot $FEWSHOT --limit $LIMIT \
--batch_size $BATCH_SIZE
--model_args "pretrained=$MODEL,tensor_parallel_size=$TP_SIZE,distributed_executor_backend=ray,trust_remote_code=true,max_model_len=4096" \
--tasks gsm8k --num_fewshot "$FEWSHOT" --limit "$LIMIT" \
--batch_size "$BATCH_SIZE"
2 changes: 1 addition & 1 deletion .buildkite/lm-eval-harness/run-tests.sh
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ while getopts "c:t:" OPT; do
done

# Parse list of configs.
IFS=$'\n' read -d '' -r -a MODEL_CONFIGS < $CONFIG
IFS=$'\n' read -d '' -r -a MODEL_CONFIGS < "$CONFIG"

for MODEL_CONFIG in "${MODEL_CONFIGS[@]}"
do
Expand Down
20 changes: 17 additions & 3 deletions .buildkite/lm-eval-harness/test_lm_eval_correctness.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# SPDX-License-Identifier: Apache-2.0
"""
LM eval harness on model to compare vs HF baseline computed offline.
Configs are found in configs/$MODEL.yaml
Expand All @@ -12,9 +13,10 @@

import lm_eval
import numpy
import pytest
import yaml

RTOL = 0.02
RTOL = 0.05
TEST_DATA_FILE = os.environ.get(
"LM_EVAL_TEST_DATA_FILE",
".buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct.yaml")
Expand All @@ -23,9 +25,12 @@


def launch_lm_eval(eval_config):
trust_remote_code = eval_config.get('trust_remote_code', False)

model_args = f"pretrained={eval_config['model_name']}," \
f"tensor_parallel_size={TP_SIZE}," \
f"add_bos_token=true"
f"add_bos_token=true," \
f"trust_remote_code={trust_remote_code}"

results = lm_eval.simple_evaluate(
model="vllm",
Expand All @@ -42,14 +47,23 @@ def test_lm_eval_correctness():
eval_config = yaml.safe_load(
Path(TEST_DATA_FILE).read_text(encoding="utf-8"))

if eval_config[
"model_name"] == "nm-testing/Meta-Llama-3-70B-Instruct-FBGEMM-nonuniform": #noqa: E501
pytest.skip("FBGEMM is currently failing on main.")

# Launch eval requests.
results = launch_lm_eval(eval_config)

# Confirm scores match ground truth.
success = True
for task in eval_config["tasks"]:
for metric in task["metrics"]:
ground_truth = metric["value"]
measured_value = results["results"][task["name"]][metric["name"]]
print(f'{task["name"]} | {metric["name"]}: '
f'ground_truth={ground_truth} | measured={measured_value}')
assert numpy.isclose(ground_truth, measured_value, rtol=RTOL)
success = success and numpy.isclose(
ground_truth, measured_value, rtol=RTOL)

# Assert at the end, print all scores even on failure for debugging.
assert success
Loading
Loading