Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
820 commits
Select commit Hold shift + click to select a range
6de3e13
Add logging for torch nightly version (#17669)
yangw-dev May 7, 2025
18dd5e0
[Model] Mamba2 causal conv1d Refactor to Split Prefill and Decode Req…
cyang49 May 7, 2025
a17cef7
Removed unused marlin cuda code (#17684)
mgoin May 7, 2025
e50a1f1
[TPU] Add kernel test for moe_pallas (#17496)
mgoin May 7, 2025
950b711
Replace lm-eval bash script with pytest and use enforce_eager for fas…
mgoin May 7, 2025
8d84d83
[BugFix][Spec Decode] Fix hidden size mismatch between target and eag…
WoosukKwon May 7, 2025
822de7f
[Misc] Split model loader (#17712)
jeejeelee May 7, 2025
c3e9d50
[Misc] Use `apply_rotary_emb` from vllm_flash_attn for Qwen2-VL visio…
Isotr0py May 7, 2025
1a45a61
[Kernel] GGUF MoeVec kernel (#16780)
SzymonOzog May 7, 2025
f80ae5b
[Kernel] Use fused rmsnorm for some models like qwen3 series (#17735)
Eviannn May 7, 2025
ba7703e
[Misc] Remove qlora_adapter_name_or_path (#17699)
jeejeelee May 7, 2025
043e4c4
Add NeuronxDistributedInference support, Speculative Decoding, Dynami…
aws-satyajith May 7, 2025
8a15c26
[Frontend] Add missing chat templates for various MLLMs (#17758)
DarkLight1337 May 7, 2025
324a311
Fix test_memory_usage_no_spec (#17754)
sarckk May 7, 2025
98c89e1
Make key optional for rotary embedding (#17566)
sarckk May 7, 2025
7377dd0
[doc] update the issue link (#17782)
reidliu41 May 7, 2025
32aa74c
[ROCm][FP8][Kernel] FP8 quantization fused into Custom Paged Attentio…
gshtras May 7, 2025
1a6af14
Only depend on importlib-metadata for Python < 3.10 (#17776)
tiran May 7, 2025
be8ff88
[Bugfix] Fix Video IO error for short video (#17791)
Isotr0py May 7, 2025
646a31e
Fix and simplify `deprecated=True` CLI `kwarg` (#17781)
hmellor May 7, 2025
f98e307
[Bugfix] Fix missing lora name mapping for lora without prefix (#17793)
Isotr0py May 7, 2025
db593aa
[Quantization] Quark MXFP4 format loading (#16943)
BowenBao May 7, 2025
c20ef40
[Hardware][TPU][V1] Multi-LoRA implementation for the V1 TPU backend …
Akshat-Tripathi May 7, 2025
ed5272c
[BugFix] Avoid secondary missing `MultiprocExecutor.workers` error (#…
njhill May 7, 2025
d43f914
[Core][Feature] Input metadata dump on crash (#13407)
wallashss May 7, 2025
a8238bb
[Chore][Doc] uses model id determined from OpenAI client (#17815)
aarnphm May 8, 2025
66ab3b1
Don't call the venv `vllm` (#17810)
hmellor May 8, 2025
3d13ca0
[BugFix] Fix `--disable-log-stats` in V1 server mode (#17600)
njhill May 8, 2025
7ea2adb
[Core] Support full cuda graph in v1 (#16072)
chanh May 8, 2025
b2da14a
Improve exception reporting in MP engine (#17800)
vmarkovtsev May 8, 2025
c747d84
[Installation] OpenTelemetry version update (#17771)
Xarbirus May 8, 2025
998eea4
Only log non-default CLI args for online serving (#17803)
hmellor May 8, 2025
6930a41
[V1] Add VLLM_ALLOW_INSECURE_SERIALIZATION env var (#17490)
russellb May 8, 2025
5a499e7
[Kernel][Hardware][AMD] Bf16 mfma opt for ROCm skinny GEMMs (#17071)
amd-hhashemi May 8, 2025
e515668
[Hardware][Power] Enable compressed tensor W8A8 INT8 quantization for…
Akashcodes732 May 8, 2025
843b222
[Hardware][Intel-Gaudi] Support Automatic Prefix Caching on HPU (#17648)
adobrzyn May 8, 2025
96722aa
[Frontend] Chat template fallbacks for multimodal models (#17805)
DarkLight1337 May 8, 2025
597051e
[Qwen3]add qwen3-235b-bf16 fused moe config on A100 (#17715)
Ximingwang-09 May 8, 2025
39956ef
[Bugfix] Fix bad words for Mistral models (#17753)
qionghuang6 May 8, 2025
0a9bbaa
[Misc] support model prefix & add deepseek vl2 tiny fused moe config …
xsank May 8, 2025
ca04b97
[Bugfix] Fix tool call template validation for Mistral models (#17644)
RIckYuan999 May 8, 2025
a463555
[TPU] Fix the test_sampler (#17820)
bythew3i May 8, 2025
bb239a7
[Bugfix] Fix quark fp8 format loading on AMD GPUs (#12612)
fxmarty-amd May 8, 2025
a1e19b6
[Doc] Fix a typo in the file name (#17836)
DarkLight1337 May 8, 2025
f50dcb7
[Easy] Eliminate c10::optional usage in vllm/csrc (#17819)
houseroad May 8, 2025
53d0cb7
[Misc] add chatbox integration (#17828)
reidliu41 May 8, 2025
e4ca6e3
Fix transient dependency error in docs build (#17848)
hmellor May 8, 2025
015815f
[Bugfix] `use_fast` failing to be propagated to Qwen2-VL image proces…
DarkLight1337 May 8, 2025
a944f8e
[Misc] Delete LoRA-related redundancy code (#17841)
jeejeelee May 8, 2025
ec54d73
[CI] Fix test_collective_rpc (#17858)
russellb May 8, 2025
226a427
[V1] Improve VLLM_ALLOW_INSECURE_SERIALIZATION logging (#17860)
russellb May 8, 2025
a83a0f9
[Test] Attempt all TPU V1 tests, even if some of them fail. (#17334)
yarongmu-google May 8, 2025
8342e3a
[CI] Prune down lm-eval small tests (#17012)
mgoin May 8, 2025
4f605a6
Fix noisy warning for uncalibrated q_scale/p_scale (#17414)
mgoin May 8, 2025
376786f
Add cutlass support for blackwell fp8 blockwise gemm (#14383)
wenscarl May 8, 2025
3c9396a
[FEAT][ROCm]: Support AITER MLA on V1 Engine (#17523)
vllmellm May 9, 2025
760e3ec
[V1][Structured Output] Update llguidance (`>= 0.7.11`) to avoid Attr…
shen-shanshan May 9, 2025
5e6f939
[Attention] MLA move rotary embedding to cuda-graph region (#17668)
LucasWilkinson May 9, 2025
d310e6d
[BUGFIX]: return fast when request requires prompt logprobs (#17251)
andyxning May 9, 2025
3d1e387
[Docs] Add Slides from NYC Meetup (#17879)
simon-mo May 9, 2025
89a0315
[Doc] Update several links in reasoning_outputs.md (#17846)
windsonsea May 9, 2025
ff8c400
[Doc] remove visible token in doc (#17884)
yma11 May 9, 2025
217db4b
[Bugfix][ROCm] Fix AITER MLA V1 (#17880)
vllmellm May 9, 2025
6e4a93e
[Bugfix][CPU] Fix broken AVX2 CPU TP support (#17252)
Isotr0py May 9, 2025
5b2dcbf
Fix Whisper crash caused by invalid``` max_num_batched_tokens``` conf…
inkcherry May 9, 2025
c6798ba
Change `top_k` to be disabled with `0` (still accept `-1` for now) (#…
hmellor May 9, 2025
ec61ea2
[Misc] add dify integration (#17895)
reidliu41 May 9, 2025
9f64e93
[BugFix][AMD] Compatible patch for latest AITER(05/07/2025) (#17864)
qli88 May 9, 2025
200da9a
[v1] Move block management logic from KVCacheManager to SpecializedMa…
heheda12345 May 9, 2025
6e5595c
[CI/Build] Automatically retry flaky tests (#17856)
DarkLight1337 May 9, 2025
85b72cb
Revert "[BugFix][AMD] Compatible patch for latest AITER(05/07/2025)" …
mgoin May 9, 2025
c44c384
[Misc] Add references in ray_serve_deepseek example (#17907)
ruisearch42 May 9, 2025
5c4c08f
[Misc] Auto fallback to float16 for pre-Ampere GPUs when detected bfl…
Isotr0py May 9, 2025
22481fb
Update CT WNA16MarlinMoE integration (#16666)
mgoin May 9, 2025
7d4aeda
Handle error when `str` passed to `/v1/audio/transcriptions` (#17909)
hmellor May 9, 2025
ea2236b
Add option to use torch._inductor.standalone_compile (#17057)
zou3519 May 9, 2025
7e35711
[V1][Spec Decoding] Include bonus tokens in mean acceptance length (#…
markmc May 9, 2025
4b2ed79
Improve configs - the rest! (#17562)
hmellor May 9, 2025
3b602cd
AMD conditional all test execution // new test groups (#17556)
Alexei-V-Ivanov-AMD May 9, 2025
0c0fdae
[Hardware/NVIDIA/Kernel] Enable nvidia/DeepSeek-R1-FP4 Model (#16362)
pavanimajety May 9, 2025
7042cc9
[V1][Spec Decoding] Log accumulated metrics after system goes idle (#…
markmc May 10, 2025
246e3e0
fix broken test vllm:test_kernels - test_attention_selector.py::test_…
tracelogfb May 10, 2025
fc4441a
Add missing content type headers to /ping and /health (#17036) (#17786)
edrevo May 10, 2025
6831189
Don't default construct `ModelConfig` when default constructing `Vllm…
hmellor May 10, 2025
4c31218
[Misc] remove --model from vllm serve usage (#17944)
reidliu41 May 10, 2025
950751a
[v1] Pass BlockTable and KVCacheSpec to AttentionMetadataBuilders (#1…
heheda12345 May 10, 2025
ca66a16
[v1] Rename specialized_manager.py to single_type_kv_cache_manager.py…
heheda12345 May 10, 2025
d74e5f3
[Kernel] fp4 marlin kernel (#17687)
jinzhen-lin May 11, 2025
90d0a74
[Bugfix] Add revision to `transformers.Auto*.from_pretrained` process…
xinli-sw May 11, 2025
9112155
[Perf] Use small max_num_batched_tokens for A100 (#17885)
KuntaiDu May 11, 2025
eea22a5
fix amd triton mla path (#17871)
842974287 May 11, 2025
8132365
[Bugfix]: v1 engine - consider lora adapters in allowed_token_ids (#1…
bbrowning May 11, 2025
d1110f5
[doc] update lora doc (#17936)
reidliu41 May 11, 2025
9cea90e
[Frontend] Add /classify endpoint (#17032)
frieda-huang May 11, 2025
cd3edfc
[Misc] Add compressed-tensors NVFP4A16 emulation support (#17914)
dsikka May 11, 2025
06c0922
[FP8][ROCm][Attention] Enable FP8 KV cache on ROCm for V1 (#17870)
gshtras May 11, 2025
e4b8713
[New Model]: nomic-embed-text-v2-moe (#17785)
noooop May 11, 2025
009b3d5
[Misc] not show --model in vllm serve --help (#16691)
reidliu41 May 11, 2025
a810b5b
[BugFix] [ROCm]: Bugfix and handle addition case of input for `rocm_a…
tjtanaa May 11, 2025
7de18d5
[BUG] [ROCm] [MLA] Fix variable name bug due to change in variable na…
tjtanaa May 11, 2025
021c16c
[Model] Broadcast Ovis2 implementation to fit Ovis1.6 (#17861)
Isotr0py May 12, 2025
d45fe33
[misc] add instructions on how to install nvshmem/pplx/deepep (#17964)
youkaichao May 12, 2025
08bf784
[Bugfix] validate grammar and throw 400 error instead of crashing the…
Jason-CKY May 12, 2025
ada50aa
[bugfix] fix the wrong parser (#17958)
reidliu41 May 12, 2025
19a3c78
[Bugfix] Fix pydantic.errors.PydanticUserError (#17962)
Potabk May 12, 2025
4307830
[Bugfix][TPU] Use np array when updating cache slot_mapping (#17971)
lsy323 May 12, 2025
891b9d3
[Fix] Benchmark `"EngineClient" has no attribute "model_config"` (#17…
b8zhong May 12, 2025
3a5ea75
[Feature] Support DeepSeekV3 Function Call (#17784)
Xu-Wenqing May 12, 2025
9fbf2bf
Correcting testcases in builkite job for IBM Power (#17675)
AaruniAggarwal May 12, 2025
7ea6cb2
[Misc] Improve modelscope import error (#17983)
jeejeelee May 12, 2025
05a4324
Initialize the delta tool call fields explicitly (#17340)
maxdebayser May 12, 2025
d191102
[P/D] NIXL Integration (#17751)
robertgshaw2-redhat May 12, 2025
98ea356
[Lora][Frontend]Add default local directory LoRA resolver plugin. (#1…
jberkhahn May 12, 2025
72a3f6b
Construct `KVTransferConfig` properly from Python instead of using JS…
hmellor May 12, 2025
b9fd0d7
[CI/Build] Fix TPU V1 Test mixed use of & and && across tests (#17968)
CAROLZXYZXY May 12, 2025
289199f
[Core] Use platform-agnostic device control for DP engine core (#17245)
jianzs May 12, 2025
e9c730c
Enabling "Weight Loading Multiple GPU Test - Large Models" (#18020)
Alexei-V-Ivanov-AMD May 12, 2025
302f3ac
[v1][KVCacheManager] Change prefix caching metric from counting block…
heheda12345 May 12, 2025
195adb4
[Chore] Remove unused method (#18024)
robertgshaw2-redhat May 12, 2025
2b0db9b
Enable standard language model for torhc nightly (#18004)
yangw-dev May 12, 2025
ebab1ac
[CI] Make JSON output tests less likely to fail (#17859)
russellb May 12, 2025
dc99053
[V1][Spec Decode] Eagle unit tests (#17350)
wwl2755 May 12, 2025
f065de4
Fix FBGEMM integration (#18002)
mgoin May 12, 2025
acee8f4
[Model] Support MiMo-7B inference with MTP (#17433)
bwshen-mi May 12, 2025
9d7ea9d
Update some more deprecated type hinting (#17998)
hmellor May 12, 2025
307939f
Use NVFP4 Marlin for CompressedTensorsW4A16Fp4 (#18000)
mgoin May 13, 2025
d67085c
Remove noisy warnings from `SchedulerConfig` (#17995)
hmellor May 13, 2025
f6518b2
[ROCm] Skip tests for quantizations incompatible with ROCm (#17905)
hissu-hyvarinen May 13, 2025
60f7624
Implements dual-chunk-flash-attn backend for dual chunk attention wit…
sighingnow May 13, 2025
c06af9a
[Misc] Slight spelling modification (#18039)
jeejeelee May 13, 2025
d8487ef
[ROCm]: Fix build from source failure with gcc14 and ROCm 6.3 (#13779)
arjunkathuria May 13, 2025
1df491c
[Bugfix] Fixes for new marlin moe usage (#18017)
mgoin May 13, 2025
61e0a50
[Bugfix] Avoid repeatedly creating dummy data during engine startup (…
DarkLight1337 May 13, 2025
dc1a821
[Feature][V1] Support `tool_choice: required` when using Xgrammar as…
chaunceyjiang May 13, 2025
4854572
cleanup invalid prints (#18050)
calvin0327 May 13, 2025
ee5be83
[BugFix] Fix 4-GPU RLHF tests (#18007)
njhill May 13, 2025
e57e4d6
Fix Broken macro for cutlass moe (#18049)
drisspg May 13, 2025
f0d610a
[v1][KVCacheManager] Avoid full cache hit by controlling max_length (…
heheda12345 May 13, 2025
8dd0671
[Bugfix][V1] Only get input embeddings w/ multi-modal models if first…
jinhuang12 May 13, 2025
2ff297d
[BugFix] Set default random seed to 0 for V1 (#17929)
WoosukKwon May 13, 2025
ea6ae8c
[Bugfix] Fix marlin moe fallback logic for llama4 (#18042)
mgoin May 13, 2025
23b3134
[Benchmarks] Refactor run_structured_output_benchmarks.sh (#17722)
russellb May 13, 2025
98fcba1
Convert `.buildkite` to `ruff format` (#17656)
hmellor May 13, 2025
cb528d0
[Fix] check to make sure processor has chat templates (#18047)
aarnphm May 13, 2025
906f059
[doc] add download/list/delete HF model CLI usage (#17940)
reidliu41 May 13, 2025
6223dd8
Update deprecated type hinting in `model_executor/layers` (#18056)
hmellor May 13, 2025
ff334ca
Update deprecated type hinting in `vllm/profiler` (#18057)
hmellor May 13, 2025
8c946ce
Update deprecated type hinting in `vllm/transformers_utils` (#18058)
hmellor May 13, 2025
9944011
[CI] Set token permissions for reminder comment CI job (#17728)
russellb May 13, 2025
79a1d25
[CI] Add workflow permissions for helm CI job (#17727)
russellb May 13, 2025
54e467e
[CI] Add token permissions for add-ready-label CI job (#17730)
russellb May 13, 2025
00b14e0
[CI] set token permissions for pre-commit CI job (#17729)
russellb May 13, 2025
b922c2e
[Bugfix] Fix entrypoints metrics tests (#18063)
DarkLight1337 May 13, 2025
009d9e7
Convert `benchmarks` to `ruff format` (#18068)
hmellor May 13, 2025
fc407a1
Give auto-merge label workflow permission to add labels to issues (#1…
hmellor May 13, 2025
19324d6
Update deprecated type hinting in `vllm/compilation` (#18072)
hmellor May 13, 2025
0b217da
Update deprecated type hinting in `vllm/adapter_commons` (#18073)
hmellor May 13, 2025
55aa7af
[V1] DP scale-out (2/N): Decouple engine process management and comms…
njhill May 13, 2025
0189a65
[Docs] Expand security doc with firewall info (#18081)
russellb May 13, 2025
40de1ef
[FEAT] [ROCm]: Add AITER Block-Scaled GEMM Feature (#14968)
vllmellm May 14, 2025
f2ae883
[v1][KVCacheManager] pass num_new_computed_tokens to kv cache manager…
heheda12345 May 14, 2025
176a95c
[Fix] Support CUDAGraph capture for encoder-decoder on ROCm (#18104)
ProExpertProg May 14, 2025
65f0f74
[Hardware/NVIDIA/Modelopt] Fix modelopt forward method for v1 torch.c…
pavanimajety May 14, 2025
d5af47a
[P/D] Add some more debug logs to `NixlConnector` (#18102)
njhill May 14, 2025
6e27c6d
[Misc] Remove unused numpy tensor (#18084)
May 14, 2025
754b699
[Bug]: Fix S3 model/tokenizer path resolution (#18083)
gilljon May 14, 2025
6266c57
[core][distributed] add ep group and all2all interface (#18077)
youkaichao May 14, 2025
9a2a635
[Bugfix] Fix FP8 Marlin MoE and enable for compressed-tensors models …
mgoin May 14, 2025
12e6c0b
[Bugfix][V1] Fix FlashInfer V1 backend using the wrong VllmConfig (#1…
mgoin May 14, 2025
2d912fb
[FEAT] [ROCm] [V1]: Add AITER biased group topk for DeepSeekV3 (#17955)
vllmellm May 14, 2025
7b2f28d
[AMD][torch.compile] Enable silu+fp8_quant fusion for rocm (#18082)
charlifu May 14, 2025
4f8b373
[BugFix][AMD] Compatible patch for AITER lib after 04/20 (#17912)
qli88 May 14, 2025
3301131
Fix broken example: examples/offline_inference/profiling at scheduler…
Ecthlion May 14, 2025
6685890
[Fix] Move "model_config" as keyword args in chat_utils.py (#18098)
lk-chen May 14, 2025
d4154c3
[Bugfix] fix moe marlin `topk_weight` loading (#18080)
jinzhen-lin May 14, 2025
e7ef61c
[Bugfix][Example] make lmcache v0 work. (#18051)
majianpeng May 14, 2025
63ad622
[New Model]: support GTE NewModel (#17986)
noooop May 14, 2025
8f5dc41
[Bugfix] Fix entrypoints audio test failure (#18111)
DarkLight1337 May 14, 2025
63dc342
[Model] Add packed_modules_mapping for Qwen3-MOE (#18118)
jeejeelee May 14, 2025
82e7f9b
[Misc] replace does not exist model (#18119)
lengrongfu May 14, 2025
38fe728
[Bugfix] Fix QKVCrossParallelLinear::sync_weight_attrs for PyTorch co…
anko-intel May 14, 2025
612c2ed
[FEAT] [ROCm]: Add AITER CK 2 Stages MoE support (#17110)
tjtanaa May 14, 2025
259127f
[Bugfix] Fix LoRA test (#18123)
jeejeelee May 14, 2025
d62a076
[Model] GritLM supports other attention backends (#18109)
DarkLight1337 May 14, 2025
9ccc6de
[doc] add missing import (#18133)
reidliu41 May 14, 2025
9b5b39b
Update deprecated type hinting in `vllm/lora` (#18128)
hmellor May 14, 2025
dc372b9
Update deprecated type hinting in `vllm/device_allocator` and `vllm/d…
hmellor May 14, 2025
c8ea982
Update deprecated type hinting in `platform`, `plugins`, `triton_util…
hmellor May 14, 2025
d066e52
[Bugfix] Fix chat utils tests (#18139)
DarkLight1337 May 14, 2025
59dd311
[KVConnector] Keep KVTransferParams as a dict (#18033)
njhill May 14, 2025
964472b
[Doc] Update prefix cache metrics to counting tokens (#18138)
heheda12345 May 14, 2025
418d2f8
[V1][Spec Decode] Share input embedding of target model with EAGLE dr…
ekagra-ranjan May 14, 2025
f9c069c
Modularize fused experts and integrate PPLX kernels (#15956)
bnellnm May 14, 2025
8568650
[CI] Disable Failing Tests (#18165)
robertgshaw2-redhat May 14, 2025
749f792
[Frontend] decrease import time of vllm.multimodal (#18031)
davidxia May 14, 2025
d93c976
[Kernel] Have rotary embeddings support tensors (#18046)
LucasWilkinson May 14, 2025
2fc9075
[V1] Structured Outputs + Thinking compatibility (#16577)
aarnphm May 14, 2025
7974736
Add support for loading torchao models with `AOPerModuleConfig` (#17826)
jerryzh168 May 14, 2025
78aa341
[CI] Fix race condition in test_kv_cache_events test (#18169)
russellb May 14, 2025
2142035
[V1] Support multiple kv connectors (#17564)
mgoin May 14, 2025
09f106a
Upload vllm index for the rc builds (#18173)
atalman May 14, 2025
f25e0d1
[Bugfix]: make most of `test_openai_schema.py` pass (#17664)
davidxia May 15, 2025
e60f550
[v1] Support multiple KV cache groups in GPU model runner (#17945)
heheda12345 May 15, 2025
65334ef
[V1][Metrics] Remove unused code (#18158)
markmc May 15, 2025
afe3236
[Chore] astral's ty (#18116)
aarnphm May 15, 2025
2dff093
[Misc] add lobe-chat support (#18177)
reidliu41 May 15, 2025
83f74c6
[Fix][ROCm] Enforce eager for all encoder-decoder models on ROCm (#18…
ProExpertProg May 15, 2025
26d0419
Update deprecated type hinting in `models` (#18132)
hmellor May 15, 2025
e6b8e65
[Bugfix] Fix fp8 tests for triton_unified_attention for Triton 3.3 (#…
tdoublep May 15, 2025
4f07a64
Support custom implementations of VideoLoader backends. (#18091)
huachenheli May 15, 2025
420caf7
[UT] Add ut for none hash (#17892)
andyxning May 15, 2025
dd2a945
[Model] Allow the use of sliding window in Qwen2 (#17772)
inkcherry May 15, 2025
70f8b96
[Bugfix] Fix FusedMoEPrepareAndFinalize for cuda-disalike backends (#…
MengqingCao May 15, 2025
de71fec
[CI] don't skip fixed `test_kv_cache_events()` (#18183)
davidxia May 15, 2025
a8f5aec
[V1] Update zmq socket creation in nixl connector (#18148)
russellb May 15, 2025
a9944aa
fix: typos (#18151)
omahs May 15, 2025
07ad271
Update deprecated type hinting in `model_loader` (#18130)
hmellor May 15, 2025
451da4b
add tools into TokenizeChatRequest (#18187)
hustxiayang May 15, 2025
01c2233
[Kernel] [V1] Fix performance regression for triton unified attention…
tdoublep May 15, 2025
566ec04
Adding "Basic Models Test" and "Multi-Modal Models Test (Extended) 3"…
Alexei-V-Ivanov-AMD May 15, 2025
51ff154
Improve examples rendering in docs and GitHub (#18203)
hmellor May 15, 2025
2aa5470
[Frontend] Fix chat template content format detection (#18190)
schoennenbeck May 15, 2025
fadb8d5
[Bugfix]Change the exception thrown by call_hf_processor from Runtime…
Abatom May 15, 2025
9254052
[Bugfix] [ROCm]: Remove assertion logic when using AITER fused moe in…
tjtanaa May 15, 2025
e3f3aee
[Misc] Avoid cuda graph log when sizes still match (#18202)
NickLucche May 15, 2025
0b34593
Adding "AMD: Tensorizer Test" to amdproduction. (#18216)
Alexei-V-Ivanov-AMD May 15, 2025
8795eb9
[Bugfix] Fix test_eagle test (#18223)
luccafong May 15, 2025
c7852a6
[Build] Allow shipping PTX on a per-file basis (#18155)
LucasWilkinson May 15, 2025
4e1c6a0
[Bugfix] fix rotary embedding test for _get_padded_tensor_shape (#18229)
LucasWilkinson May 16, 2025
ee659e3
[Bugfix][ROCm] Use `chunked_prefill_paged_decode` as fallback for V1 …
kliuae May 16, 2025
f4937a5
[Model] vLLM v1 supports Medusa (#17956)
skylee-01 May 16, 2025
b18201f
Allow users to pass arbitrary JSON keys from CLI (#18208)
hmellor May 16, 2025
6b31c84
Throw better error for when running into k8s service discovery issue …
wseaton May 16, 2025
3d2779c
[Feature] Support Pipeline Parallism in torchrun SPMD offline inferen…
luccafong May 16, 2025
5c04bb8
[doc] fix multimodal example script (#18089)
davidxia May 16, 2025
67da572
[PERF] Speed up Qwen2.5-VL model by speed up rotary position embeddin…
vadiklyutiy May 16, 2025
5418176
[Misc] Add Ray Prometheus logger to V1 (#17925)
eicherseiji May 16, 2025
390ec88
[Misc] Consolidate Audio tests into multimodal common generation test…
Isotr0py May 16, 2025
e23564c
use ceil_div in cutlass block scaling shape check (#17918)
IwakuraRein May 16, 2025
a5f8c11
[Fix] Fix typo in `resolve_hf_chat_template` (#18259)
fxmarty-amd May 16, 2025
87d8714
[Model] Use autoweightloader for dbrx (#18251)
learner0810 May 16, 2025
d3d91b6
[Misc][MacOS] fix bfloat16 error (#18249)
reidliu41 May 16, 2025
1db4f47
[BugFix] Fix multi async save in MultiConnector (#18246)
njhill May 16, 2025
0ceaebf
[BugFix] Fix ordering of KVConnector finished send/rcv sets (#18211)
njhill May 16, 2025
aef94c6
[CI] Assign reviewer to mergify with changes to Tensorizer files (#18…
sangstar May 16, 2025
7fdfa01
[Sampler] Adapt to FlashInfer 0.2.3 sampler API (#15777)
abmfy May 16, 2025
e73b7df
[Bugfix] fix `an illegal memory access was encountered` of marlin ker…
jinzhen-lin May 16, 2025
fabe89b
[Spec Decode] Don't fall back to V0 when spec decoding is enabled (#1…
WoosukKwon May 16, 2025
fd195b1
[V1][P/D] Local attention optimization for NIXL (#18170)
mgoin May 17, 2025
c1f89fe
metric hack
ekagra-ranjan May 17, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
20 changes: 12 additions & 8 deletions .buildkite/check-wheel-size.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,12 @@
# Note that we have 400 MiB quota, please use it wisely.
# See https://github.com/pypi/support/issues/3792 .
# Please also sync the value with the one in Dockerfile.
VLLM_MAX_SIZE_MB = int(os.environ.get('VLLM_MAX_SIZE_MB', 400))
VLLM_MAX_SIZE_MB = int(os.environ.get("VLLM_MAX_SIZE_MB", 400))


def print_top_10_largest_files(zip_file):
"""Print the top 10 largest files in the given zip file."""
with zipfile.ZipFile(zip_file, 'r') as z:
with zipfile.ZipFile(zip_file, "r") as z:
file_sizes = [(f, z.getinfo(f).file_size) for f in z.namelist()]
file_sizes.sort(key=lambda x: x[1], reverse=True)
for f, size in file_sizes[:10]:
Expand All @@ -28,14 +28,18 @@ def check_wheel_size(directory):
wheel_path = os.path.join(root, file_name)
wheel_size_mb = os.path.getsize(wheel_path) / (1024 * 1024)
if wheel_size_mb > VLLM_MAX_SIZE_MB:
print(f"Not allowed: Wheel {wheel_path} is larger "
f"({wheel_size_mb:.2f} MB) than the limit "
f"({VLLM_MAX_SIZE_MB} MB).")
print(
f"Not allowed: Wheel {wheel_path} is larger "
f"({wheel_size_mb:.2f} MB) than the limit "
f"({VLLM_MAX_SIZE_MB} MB)."
)
print_top_10_largest_files(wheel_path)
return 1
else:
print(f"Wheel {wheel_path} is within the allowed size "
f"({wheel_size_mb:.2f} MB).")
print(
f"Wheel {wheel_path} is within the allowed size "
f"({wheel_size_mb:.2f} MB)."
)
return 0


Expand All @@ -45,4 +49,4 @@ def check_wheel_size(directory):
sys.exit(1)

directory = sys.argv[1]
sys.exit(check_wheel_size(directory))
sys.exit(check_wheel_size(directory))
4 changes: 2 additions & 2 deletions .buildkite/generate_index.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,5 +22,5 @@
print(f"Generated index.html for {args.wheel}")
# cloudfront requires escaping the '+' character
f.write(
template.format(wheel=filename,
wheel_html_escaped=filename.replace("+", "%2B")))
template.format(wheel=filename, wheel_html_escaped=filename.replace("+", "%2B"))
)
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m deepseek-ai/DeepSeek-V2-Lite-Chat -b "auto" -l 1000 -f 5 -t 2
model_name: "deepseek-ai/DeepSeek-V2-Lite-Chat"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For hf script, without -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m nm-testing/Meta-Llama-3-70B-Instruct-FBGEMM-nonuniform -b auto -l 1000 -f 5
model_name: "nm-testing/Meta-Llama-3-70B-Instruct-FBGEMM-nonuniform"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For hf script, without -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m meta-llama/Meta-Llama-3-70B-Instruct -b 32 -l 250 -f 5
model_name: "meta-llama/Meta-Llama-3-70B-Instruct"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-W8A8-FP8-Channelwise-compressed-tensors -b auto -l 1000 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-Instruct-W8A8-FP8-Channelwise-compressed-tensors"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-FBGEMM-nonuniform -b auto -l 1000 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-Instruct-FBGEMM-nonuniform"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-FP8-compressed-tensors-test -b 32 -l 1000 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-FP8-compressed-tensors-test"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Meta-Llama-3-8B-Instruct-FP8 -b 32 -l 250 -f 5 -t 1
model_name: "neuralmagic/Meta-Llama-3-8B-Instruct-FP8"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Asym-Per-Token-Test -b "auto" -l 250 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Asym-Per-Token-Test"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Per-Token-Test -b "auto" -l 250 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Per-Token-Test"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-nonuniform-test -b auto -l 1000 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-Instruct-nonuniform-test"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m meta-llama/Meta-Llama-3-8B-Instruct -b 32 -l 250 -f 5 -t 1
# For hf script, without -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m meta-llama/Meta-Llama-3-8B-Instruct -b 32 -l 250 -f 5
model_name: "meta-llama/Meta-Llama-3-8B-Instruct"
tasks:
- name: "gsm8k"
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m HandH1998/QQQ-Llama-3-8b-g128 -b 32 -l 1000 -f 5 -t 1
model_name: "HandH1998/QQQ-Llama-3-8b-g128"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m RedHatAI/Llama-3.2-1B-Instruct-FP8 -b "auto" -l 1319 -f 5 -t 1
model_name: "RedHatAI/Llama-3.2-1B-Instruct-FP8"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.335
- name: "exact_match,flexible-extract"
value: 0.323
limit: 1319
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8 -b "auto" -l 1000 -f 5 -t 1
model_name: "neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m mgoin/Minitron-4B-Base-FP8 -b auto -l 1000 -f 5 -t 1
model_name: "mgoin/Minitron-4B-Base-FP8"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Mixtral-8x22B-Instruct-v0.1-FP8-dynamic -b "auto" -l 250 -f 5 -t 8
model_name: "neuralmagic/Mixtral-8x22B-Instruct-v0.1-FP8-dynamic"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8 -b "auto" -l 250 -f 5 -t 4
model_name: "neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m neuralmagic/Mixtral-8x7B-Instruct-v0.1 -b 32 -l 250 -f 5 -t 4
# For hf script, without -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m neuralmagic/Mixtral-8x7B-Instruct-v0.1 -b 32 -l 250 -f 5
model_name: "mistralai/Mixtral-8x7B-Instruct-v0.1"
tasks:
- name: "gsm8k"
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Qwen1.5-MoE-A2.7B-Chat-quantized.w4a16 -b auto -l 1319 -f 5 -t 1
model_name: "nm-testing/Qwen1.5-MoE-A2.7B-Chat-quantized.w4a16"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.30
- name: "exact_match,flexible-extract"
value: 0.465
limit: 1319
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Qwen2-1.5B-Instruct-FP8W8 -b auto -l 1000 -f 5 -t 1
model_name: "nm-testing/Qwen2-1.5B-Instruct-FP8W8"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Qwen2-1.5B-Instruct-quantized.w8a8 -b "auto" -l 1000 -f 5 -t 1
model_name: "neuralmagic/Qwen2-1.5B-Instruct-quantized.w8a8"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Qwen2-1.5B-Instruct-W8A16-Channelwise -b "auto" -l 1000 -f 5 -t 1
model_name: "nm-testing/Qwen2-1.5B-Instruct-W8A16-Channelwise"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m Qwen/Qwen2-57B-A14B-Instruct -b "auto" -l 250 -f 5 -t 4
model_name: "Qwen/Qwen2-57B-A14B-Instruct"
tasks:
Expand Down
11 changes: 11 additions & 0 deletions .buildkite/lm-eval-harness/configs/Qwen2.5-1.5B-Instruct.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m Qwen/Qwen2.5-1.5B-Instruct -b auto -l 1319 -f 5 -t 1
model_name: "Qwen/Qwen2.5-1.5B-Instruct"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.54
- name: "exact_match,flexible-extract"
value: 0.59
limit: 1319
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m RedHatAI/Qwen2.5-VL-3B-Instruct-FP8-Dynamic -b auto -l 1319 -f 5 -t 1
model_name: "RedHatAI/Qwen2.5-VL-3B-Instruct-FP8-Dynamic"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.47
- name: "exact_match,flexible-extract"
value: 0.64
limit: 1319
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-chnl_wts_per_tok_dyn_act_fp8-BitM -b "auto" -t 2
model_name: "nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-chnl_wts_per_tok_dyn_act_fp8-BitM"
tasks:
Expand Down
1 change: 1 addition & 0 deletions .buildkite/lm-eval-harness/configs/models-large.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@ Meta-Llama-3-70B-Instruct.yaml
Mixtral-8x7B-Instruct-v0.1.yaml
Qwen2-57B-A14-Instruct.yaml
DeepSeek-V2-Lite-Chat.yaml
Meta-Llama-3-8B-QQQ.yaml
10 changes: 3 additions & 7 deletions .buildkite/lm-eval-harness/configs/models-small.txt
Original file line number Diff line number Diff line change
@@ -1,10 +1,6 @@
Meta-Llama-3-8B-Instruct.yaml
Meta-Llama-3-8B-Instruct-FP8-compressed-tensors.yaml
Qwen2.5-1.5B-Instruct.yaml
Meta-Llama-3.2-1B-Instruct-INT8-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-INT8-compressed-tensors-asym.yaml
Meta-Llama-3-8B-Instruct-nonuniform-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-Channelwise-compressed-tensors.yaml
Minitron-4B-Base-FP8.yaml
Qwen2-1.5B-Instruct-INT8-compressed-tensors.yaml
Qwen2-1.5B-Instruct-FP8W8.yaml
Meta-Llama-3-8B-QQQ.yaml
Qwen2.5-VL-3B-Instruct-FP8-dynamic.yaml
Qwen1.5-MoE-W4A16-compressed-tensors.yaml
43 changes: 43 additions & 0 deletions .buildkite/lm-eval-harness/conftest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# SPDX-License-Identifier: Apache-2.0
from pathlib import Path

import pytest


def pytest_addoption(parser):
parser.addoption(
"--config-list-file",
action="store",
help="Path to the file listing model config YAMLs (one per line)",
)
parser.addoption(
"--tp-size",
action="store",
default="1",
help="Tensor parallel size to use for evaluation",
)


@pytest.fixture(scope="session")
def config_list_file(pytestconfig, config_dir):
rel_path = pytestconfig.getoption("--config-list-file")
return config_dir / rel_path


@pytest.fixture(scope="session")
def tp_size(pytestconfig):
return pytestconfig.getoption("--tp-size")


def pytest_generate_tests(metafunc):
if "config_filename" in metafunc.fixturenames:
rel_path = metafunc.config.getoption("--config-list-file")
config_list_file = Path(rel_path).resolve()
config_dir = config_list_file.parent
with open(config_list_file, encoding="utf-8") as f:
configs = [
config_dir / line.strip()
for line in f
if line.strip() and not line.startswith("#")
]
metafunc.parametrize("config_filename", configs)
59 changes: 0 additions & 59 deletions .buildkite/lm-eval-harness/run-tests.sh

This file was deleted.

Loading
Loading