Skip to content
This repository was archived by the owner on Sep 4, 2025. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
238 commits
Select commit Hold shift + click to select a range
dfb1a15
[ci][frontend] deduplicate tests (#7101)
youkaichao Aug 5, 2024
789937a
[Doc] [SpecDecode] Update MLPSpeculator documentation (#7100)
tdoublep Aug 5, 2024
89b8db6
[Bugfix] Specify device when loading LoRA and embedding tensors (#7129)
jischein Aug 5, 2024
ef527be
[MISC] Use non-blocking transfer in prepare_input (#7172)
comaniac Aug 5, 2024
360bd67
[Core] Support loading GGUF model (#5191)
Isotr0py Aug 5, 2024
e3c664b
[Build] Add initial conditional testing spec (#6841)
simon-mo Aug 6, 2024
9118217
[LoRA] Relax LoRA condition (#7146)
jeejeelee Aug 6, 2024
1f26efb
[Model] Support SigLIP encoder and alternative decoders for LLaVA mod…
DarkLight1337 Aug 6, 2024
a3bbbfa
[BugFix] Fix DeepSeek remote code (#7178)
dsikka Aug 6, 2024
541c185
[ BugFix ] Fix ZMQ when `VLLM_PORT` is set (#7205)
robertgshaw2-redhat Aug 6, 2024
00afc78
[Bugfix] add gguf dependency (#7198)
kpapis Aug 6, 2024
5c60c8c
[SpecDecode] [Minor] Fix spec decode sampler tests (#7183)
LiuXiaoxuanPKU Aug 6, 2024
8d59dbb
[Kernel] Add per-tensor and per-token AZP epilogues (#5941)
ProExpertProg Aug 6, 2024
660470e
[Core] Optimize evictor-v2 performance (#7193)
xiaobochen123 Aug 6, 2024
fd95e02
[Core] Subclass ModelRunner to support cross-attention & encoder sequ…
afeldman-nm Aug 6, 2024
f9a5600
[Bugfix] Fix GPTQ and GPTQ Marlin CPU Offloading (#7225)
mgoin Aug 7, 2024
9a3f49a
[BugFix] Overhaul async request cancellation (#7111)
njhill Aug 7, 2024
2385c8f
[Doc] Mock new dependencies for documentation (#7245)
ywang96 Aug 7, 2024
7b26109
[BUGFIX]: top_k is expected to be an integer. (#7227)
Atllkks10 Aug 7, 2024
66d617e
[Frontend] Gracefully handle missing chat template and fix CI failure…
DarkLight1337 Aug 7, 2024
639159b
[distributed][misc] add specialized method for cuda platform (#7249)
youkaichao Aug 7, 2024
0f7052b
[Misc] Refactor linear layer weight loading; introduce `BasevLLMParam…
dsikka Aug 7, 2024
5649857
[ BugFix ] Move `zmq` frontend to IPC instead of TCP (#7222)
robertgshaw2-redhat Aug 7, 2024
ab0f5e2
Fixes typo in function name (#7275)
rafvasq Aug 7, 2024
b764547
[Bugfix] Fix input processor for InternVL2 model (#7164)
Isotr0py Aug 7, 2024
80cbe10
[OpenVINO] migrate to latest dependencies versions (#7251)
ilya-lavrenov Aug 7, 2024
0e12cd6
[Doc] add online speculative decoding example (#7243)
stas00 Aug 7, 2024
fde47d3
[BugFix] Fix frontend multiprocessing hang (#7217)
maxdebayser Aug 7, 2024
5223199
[Bugfix][FP8] Fix dynamic FP8 Marlin quantization (#7219)
mgoin Aug 7, 2024
469b3bc
[ci] Make building wheels per commit optional (#7278)
khluu Aug 7, 2024
311f743
[Bugfix] Fix gptq failure on T4s (#7264)
LucasWilkinson Aug 7, 2024
fc1493a
[FrontEnd] Make `merge_async_iterators` `is_cancelled` arg optional (…
njhill Aug 7, 2024
6d94420
[Doc] Update supported_hardware.rst (#7276)
mgoin Aug 7, 2024
e53dfd3
[Kernel] Fix Flashinfer Correctness (#7284)
LiuXiaoxuanPKU Aug 7, 2024
7467096
[Misc] Fix typos in scheduler.py (#7285)
ruisearch42 Aug 8, 2024
48abee9
[Frontend] remove max_num_batched_tokens limit for lora (#7288)
NiuBlibing Aug 8, 2024
6dffa4b
[Bugfix] Fix LoRA with PP (#7292)
andoorve Aug 8, 2024
757ac70
[Model] Rename MiniCPMVQwen2 to MiniCPMV2.6 (#7273)
jeejeelee Aug 8, 2024
5fb4a3f
[Bugfix][Kernel] Increased atol to fix failing tests (#7305)
ProExpertProg Aug 8, 2024
21b9c49
[Frontend] Kill the server on engine death (#6594)
joerunde Aug 8, 2024
782e53a
[Bugfix][fast] Fix the get_num_blocks_touched logic (#6849)
zachzzc Aug 8, 2024
e14fb22
[Doc] Put collect_env issue output in a <detail> block (#7310)
mgoin Aug 8, 2024
e904576
[CI/Build] Dockerfile.cpu improvements (#7298)
dtrifiro Aug 8, 2024
8334c39
[Bugfix] Fix new Llama3.1 GGUF model loading (#7269)
Isotr0py Aug 8, 2024
a049b10
[Misc] Temporarily resolve the error of BitAndBytes (#7308)
jeejeelee Aug 8, 2024
5923532
Add Skywork AI as Sponsor (#7314)
simon-mo Aug 8, 2024
0fa1490
[TPU] Add Load-time W8A16 quantization for TPU Backend (#7005)
lsy323 Aug 9, 2024
7eb4a51
[Core] Support serving encoder/decoder models (#7258)
DarkLight1337 Aug 9, 2024
73388c0
[TPU] Fix dockerfile.tpu (#7331)
WoosukKwon Aug 9, 2024
e02ac55
[Performance] Optimize e2e overheads: Reduce python allocations (#7162)
alexm-redhat Aug 9, 2024
99b4cf5
[Bugfix] Fix speculative decoding with MLPSpeculator with padded voca…
tjohnson31415 Aug 9, 2024
57b7be0
[Speculative decoding] [Multi-Step] decouple should_modify_greedy_pro…
SolitaryThinker Aug 9, 2024
b4e9528
[Core] Streamline stream termination in `AsyncLLMEngine` (#7336)
njhill Aug 9, 2024
07ab160
[Model][Jamba] Mamba cache single buffer (#6739)
mzusman Aug 9, 2024
67abdbb
[VLM][Doc] Add `stop_token_ids` to InternVL example (#7354)
Isotr0py Aug 9, 2024
fc7b8d1
[Performance] e2e overheads reduction: Small followup diff (#7364)
alexm-redhat Aug 9, 2024
74af2bb
[Bugfix] Fix reinit procedure in ModelInputForGPUBuilder (#7360)
alexm-redhat Aug 9, 2024
249b882
[Frontend] Support embeddings in the run_batch API (#7132)
pooyadavoodi Aug 9, 2024
70d268a
[Bugfix] Fix ITL recording in serving benchmark (#7372)
ywang96 Aug 9, 2024
933790c
[Core] Add span metrics for model_forward, scheduler and sampler time…
sfc-gh-mkeralapura Aug 9, 2024
5c6c54d
[Bugfix] Fix `PerTensorScaleParameter` weight loading for fused model…
dsikka Aug 9, 2024
999ef0b
[Misc] Add numpy implementation of `compute_slot_mapping` (#7377)
Yard1 Aug 9, 2024
baa2402
[Core] Fix edge case in chunked prefill + block manager v2 (#7380)
cadedaniel Aug 9, 2024
4c5d8e8
[Bugfix] Fix phi3v batch inference when images have different aspect …
Isotr0py Aug 10, 2024
90bab18
[TPU] Use mark_dynamic to reduce compilation time (#7340)
WoosukKwon Aug 11, 2024
4fb7b52
Updating LM Format Enforcer version to v0.10.6 (#7189)
noamgat Aug 11, 2024
c08e2b3
[core] [2/N] refactor worker_base input preparation for multi-step (#…
SolitaryThinker Aug 11, 2024
3860879
[CI/Build] build on empty device for better dev experience (#4773)
tomeras91 Aug 11, 2024
02b1988
[Doc] building vLLM with VLLM_TARGET_DEVICE=empty (#7403)
tomeras91 Aug 11, 2024
6c8e595
[misc] add commit id in collect env (#7405)
youkaichao Aug 11, 2024
f020a62
[Docs] Update readme (#7316)
simon-mo Aug 12, 2024
86ab567
[CI/Build] Minor refactoring for vLLM assets (#7407)
ywang96 Aug 12, 2024
ec2affa
[Kernel] Flashinfer correctness fix for v0.1.3 (#7319)
LiuXiaoxuanPKU Aug 12, 2024
e6e42e4
[Core][VLM] Support image embeddings as input (#6613)
ywang96 Aug 12, 2024
24154f8
[Frontend] Disallow passing `model` as both argument and option (#7347)
DarkLight1337 Aug 12, 2024
d2bc451
[CI/Build] bump Dockerfile.neuron image base, use public ECR (#6832)
dtrifiro Aug 12, 2024
cfba4de
[Bugfix] Fix logit soft cap in flash-attn backend (#7425)
WoosukKwon Aug 12, 2024
65950e8
[ci] Entrypoints run upon changes in vllm/ (#7423)
khluu Aug 12, 2024
9b3e2ed
[ci] Cancel fastcheck run when PR is marked ready (#7427)
khluu Aug 12, 2024
1137f34
[ci] Cancel fastcheck when PR is ready (#7433)
khluu Aug 12, 2024
6aa33cb
[Misc] Use scalar type to dispatch to different `gptq_marlin` kernels…
LucasWilkinson Aug 12, 2024
4ddc474
[Core] Consolidate `GB` constant and enable float GB arguments (#7416)
DarkLight1337 Aug 12, 2024
a046f86
[Core/Bugfix] Add FP8 K/V Scale and dtype conversion for prefix/prefi…
jon-chuang Aug 12, 2024
91294d5
[Bugfix] Handle PackageNotFoundError when checking for xpu version (#…
sasha0552 Aug 12, 2024
774cd1d
[CI/Build] bump minimum cmake version (#6999)
dtrifiro Aug 12, 2024
198d6a2
[Core] Shut down aDAG workers with clean async llm engine exit (#7224)
ruisearch42 Aug 13, 2024
9ba85bc
[mypy] Misc. typing improvements (#7417)
DarkLight1337 Aug 13, 2024
97a6be9
[Misc] improve logits processors logging message (#7435)
aw632 Aug 13, 2024
5469146
[ci] Remove fast check cancel workflow (#7455)
khluu Aug 13, 2024
7025b11
[Bugfix] Fix weight loading for Chameleon when TP>1 (#7410)
DarkLight1337 Aug 13, 2024
4d2dc50
[hardware] unify usage of is_tpu to current_platform.is_tpu() (#7102)
youkaichao Aug 13, 2024
d6e634f
[TPU] Suppress import custom_ops warning (#7458)
WoosukKwon Aug 13, 2024
e20233d
Revert "[Doc] Update supported_hardware.rst (#7276)" (#7467)
WoosukKwon Aug 13, 2024
00c3d68
[Frontend][Core] Add plumbing to support audio language models (#7446)
petersalas Aug 13, 2024
181abbc
[Misc] Update LM Eval Tolerance (#7473)
dsikka Aug 13, 2024
fb377d7
[Misc] Update `gptq_marlin` to use new vLLMParameters (#7281)
dsikka Aug 13, 2024
d3bdfd3
[Misc] Update Fused MoE weight loading (#7334)
dsikka Aug 13, 2024
b1e5afc
[Misc] Update `awq` and `awq_marlin` to use `vLLMParameters` (#7422)
dsikka Aug 13, 2024
c5c7768
Announce NVIDIA Meetup (#7483)
simon-mo Aug 13, 2024
33e5d7e
[frontend] spawn engine process from api server process (#7484)
youkaichao Aug 13, 2024
373538f
[Misc] `compressed-tensors` code reuse (#7277)
kylesayrs Aug 13, 2024
16422ea
[misc][plugin] add plugin system implementation (#7426)
youkaichao Aug 13, 2024
a08df83
[TPU] Support multi-host inference (#7457)
WoosukKwon Aug 13, 2024
59edd0f
[Bugfix][CI] Import ray under guard (#7486)
WoosukKwon Aug 14, 2024
9799280
[CI/Build]Reduce the time consumption for LoRA tests (#7396)
jeejeelee Aug 14, 2024
ea49e6a
[misc][ci] fix cpu test with plugins (#7489)
youkaichao Aug 14, 2024
dd164d7
[Bugfix][Docs] Update list of mock imports (#7493)
DarkLight1337 Aug 14, 2024
199adbb
[doc] update test script to include cudagraph (#7501)
youkaichao Aug 14, 2024
c134a46
Fix empty output when temp is too low (#2937)
CatherineSue Aug 14, 2024
d3d9cb6
[ci] fix model tests (#7507)
youkaichao Aug 14, 2024
67d115d
[Bugfix][Frontend] Disable embedding API for chat models (#7504)
QwertyJack Aug 14, 2024
70b746e
[Misc] Deprecation Warning when setting --engine-use-ray (#7424)
wallashss Aug 14, 2024
3f674a4
[VLM][Core] Support profiling with multiple multi-modal inputs per pr…
DarkLight1337 Aug 14, 2024
2ecf7b1
[core] [3/N] multi-step args and sequence.py (#7452)
SolitaryThinker Aug 14, 2024
951fdd6
[TPU] Set per-rank XLA cache (#7533)
WoosukKwon Aug 14, 2024
f55a9ae
[Misc] Revert `compressed-tensors` code reuse (#7521)
kylesayrs Aug 14, 2024
22b39e1
llama_index serving integration documentation (#6973)
pavanjava Aug 14, 2024
fc93e56
[Bugfix][TPU] Correct env variable for XLA cache path (#7544)
WoosukKwon Aug 15, 2024
9c1f78d
[Bugfix] update neuron for version > 0.5.0 (#7175)
omrishiv Aug 15, 2024
f4da5f7
[Misc] Update dockerfile for CPU to cover protobuf installation (#7182)
PHILO-HE Aug 15, 2024
21313e0
[Bugfix] Fix default weight loading for scalars (#7534)
mgoin Aug 15, 2024
9c8e2d1
[Bugfix][Harmless] Fix float16 dtype for model_is_embedding (#7566)
mgoin Aug 16, 2024
b67ae00
[Misc] Add quantization config support for speculative model. (#7343)
ShangmingCai Aug 16, 2024
f878c8f
[Feature]: Add OpenAI server prompt_logprobs support #6508 (#7453)
gnpinkert Aug 16, 2024
4cd7d47
[ci/test] rearrange tests and make adag test soft fail (#7572)
youkaichao Aug 16, 2024
3b19e39
Chat method for offline llm (#5049)
nunjunj Aug 16, 2024
e165528
[CI] Move quantization cpu offload tests out of fastcheck (#7574)
mgoin Aug 16, 2024
50b8d08
[Misc/Testing] Use `torch.testing.assert_close` (#7324)
jon-chuang Aug 16, 2024
54bd9a0
register custom op for flash attn and use from torch.ops (#7536)
youkaichao Aug 16, 2024
9587b05
[Core] Use uvloop with zmq-decoupled front-end (#7570)
njhill Aug 16, 2024
6fc5b0f
[CI] Fix crashes of performance benchmark (#7500)
KuntaiDu Aug 16, 2024
0e39a33
[Bugfix][Hardware][AMD][Frontend] add quantization param to embedding…
gongdao123 Aug 16, 2024
ec724a7
support tqdm in notebooks (#7510)
fzyzcjy Aug 16, 2024
e837b62
[Feature][Hardware][Amd] Add fp8 Linear Layer for Rocm (#7210)
charlifu Aug 16, 2024
7fc23be
[Kernel] W8A16 Int8 inside FusedMoE (#7415)
mzusman Aug 16, 2024
855866c
[Kernel] Add tuned triton configs for ExpertsInt8 (#7601)
mgoin Aug 16, 2024
f366f63
[spec decode] [4/N] Move update_flash_attn_metadata to attn backend (…
SolitaryThinker Aug 16, 2024
93478b6
[Core] Fix tracking of model forward time in case of PP>1 (#7440)
sfc-gh-mkeralapura Aug 16, 2024
b3f4e17
[Doc] Add docs for llmcompressor INT8 and FP8 checkpoints (#7444)
mgoin Aug 16, 2024
d4f0f17
[Doc] Update quantization supported hardware table (#7595)
mgoin Aug 16, 2024
9f69856
[Kernel] register punica functions as torch ops (#7591)
bnellnm Aug 16, 2024
7759ae9
[Kernel][Misc] dynamo support for ScalarType (#7594)
bnellnm Aug 16, 2024
37fd47e
[Kernel] fix types used in aqlm and ggml kernels to support dynamo (#…
bnellnm Aug 16, 2024
44f26a9
[Model] Align nemotron config with final HF state and fix lm-eval-sma…
mgoin Aug 16, 2024
e680349
[Bugfix] Fix custom_ar support check (#7617)
bnellnm Aug 17, 2024
6bd1955
.[Build/CI] Enabling passing AMD tests. (#7610)
Alexei-V-Ivanov-AMD Aug 17, 2024
bae888c
[Bugfix] Clear engine reference in AsyncEngineRPCServer (#7618)
ruisearch42 Aug 17, 2024
4706eb6
[aDAG] Unflake aDAG + PP tests (#7600)
rkooo567 Aug 17, 2024
7c0b7ea
[Bugfix] add >= 1.0 constraint for openai dependency (#7612)
metasyn Aug 17, 2024
eed020f
[misc] use nvml to get consistent device name (#7582)
youkaichao Aug 17, 2024
5bf45db
[ci][test] fix engine/logger test (#7621)
youkaichao Aug 17, 2024
d95cc0a
[core][misc] update libcudart finding (#7620)
youkaichao Aug 17, 2024
e73f76e
[Model] Pipeline parallel support for JAIS (#7603)
mrbesher Aug 17, 2024
832163b
[ci][test] allow longer wait time for api server (#7629)
youkaichao Aug 17, 2024
1ef13cf
[Misc]Fix BitAndBytes exception messages (#7626)
jeejeelee Aug 17, 2024
bbf55c4
[VLM] Refactor `MultiModalConfig` initialization and profiling (#7530)
ywang96 Aug 17, 2024
ce14335
[TPU] Skip creating empty tensor (#7630)
WoosukKwon Aug 17, 2024
0c2fa50
[TPU] Use mark_dynamic only for dummy run (#7634)
WoosukKwon Aug 18, 2024
ab7165f
[TPU] Optimize RoPE forward_native2 (#7636)
WoosukKwon Aug 18, 2024
e3b3182
[ Bugfix ] Fix Prometheus Metrics With `zeromq` Frontend (#7279)
robertgshaw2-redhat Aug 18, 2024
40e1360
[CI/Build] Add text-only test for Qwen models (#7475)
alex-jw-brooks Aug 18, 2024
200a2ff
[Misc] Refactor Llama3 RoPE initialization (#7637)
WoosukKwon Aug 19, 2024
ff7ec82
[Core] Optimize SPMD architecture with delta + serialization optimiza…
rkooo567 Aug 19, 2024
f710fb5
[Core] Use flashinfer sampling kernel when available (#7137)
peng1999 Aug 19, 2024
1a36287
[Bugfix] Fix xpu build (#7644)
jikunshang Aug 19, 2024
df845b2
[Misc] Remove Gemma RoPE (#7638)
WoosukKwon Aug 19, 2024
3ac50b4
[MISC] Add prefix cache hit rate to metrics (#7606)
comaniac Aug 19, 2024
dad961e
[Bugfix] fix lora_dtype value type in arg_utils.py - part 2 (#5428)
c3-ali Aug 19, 2024
47b65a5
[core] Multi Step Scheduling (#7000)
SolitaryThinker Aug 19, 2024
7601cb0
[Core] Support tensor parallelism for GGUF quantization (#7520)
Isotr0py Aug 19, 2024
da11523
[Bugfix] Don't disable existing loggers (#7664)
a-ys Aug 19, 2024
43735bf
[TPU] Remove redundant input tensor cloning (#7660)
WoosukKwon Aug 19, 2024
67e02fa
[Bugfix] use StoreBoolean instead of type=bool for --disable-logprobs…
tjohnson31415 Aug 20, 2024
e54ebc2
[doc] fix doc build error caused by msgspec (#7659)
youkaichao Aug 20, 2024
312f761
[Speculative Decoding] Fixing hidden states handling in batch expansi…
abhigoyal1997 Aug 20, 2024
0df7ec0
[ci] Install Buildkite test suite analysis (#7667)
khluu Aug 20, 2024
f4fc733
[Bugfix] support `tie_word_embeddings` for all models (#5724)
zijian-hu Aug 20, 2024
3d8a5f0
[CI] Organizing performance benchmark files (#7616)
KuntaiDu Aug 20, 2024
c4be16e
[misc] add nvidia related library in collect env (#7674)
youkaichao Aug 20, 2024
e6d811d
[XPU] fallback to native implementation for xpu custom op (#7670)
jianyizh Aug 20, 2024
ad28a74
[misc][cuda] add warning for pynvml user (#7675)
youkaichao Aug 20, 2024
b6f99a6
[Core] Refactor executor classes for easier inheritance (#7673)
jikunshang Aug 20, 2024
5288c06
[Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kern…
LucasWilkinson Aug 20, 2024
398521a
[OpenVINO] Updated documentation (#7687)
ilya-lavrenov Aug 20, 2024
aae6927
[VLM][Model] Add test for InternViT vision encoder (#7409)
Isotr0py Aug 20, 2024
c42590f
[Hardware] [Intel GPU] refactor xpu worker/executor (#7686)
jikunshang Aug 20, 2024
2aa00d5
[CI/Build] Pin OpenTelemetry versions and make errors clearer (#7266)
ronensc Aug 20, 2024
c6af027
[Misc] Add jinja2 as an explicit build requirement (#7695)
LucasWilkinson Aug 20, 2024
3b68217
[Core] Add `AttentionState` abstraction (#7663)
Yard1 Aug 20, 2024
6e4658c
[Intel GPU] fix xpu not support punica kernel (which use torch.librar…
jikunshang Aug 20, 2024
9e51b6a
[ci][test] adjust max wait time for cpu offloading test (#7709)
youkaichao Aug 21, 2024
66a9e71
[Core] Pipe `worker_class_fn` argument in Executor (#7707)
Yard1 Aug 21, 2024
b74a125
[ci] try to log process using the port to debug the port usage (#7711)
youkaichao Aug 21, 2024
12e1c65
[Model] Add AWQ quantization support for InternVL2 model (#7187)
Isotr0py Aug 21, 2024
4506641
[Doc] Section for Multimodal Language Models (#7719)
ywang96 Aug 21, 2024
baaedfd
[mypy] Enable following imports for entrypoints (#7248)
DarkLight1337 Aug 21, 2024
dd3fa0e
[Bugfix] Mirror jinja2 in pyproject.toml (#7723)
sasha0552 Aug 21, 2024
c75363f
[BugFix] Avoid premature async generator exit and raise all exception…
njhill Aug 21, 2024
53328d7
[BUG] fix crash on flashinfer backend with cudagraph disabled, when a…
learninmou Aug 21, 2024
6925cdb
[Bugfix][Hardware][CPU] Fix `mm_limits` initialization for CPU backen…
Isotr0py Aug 21, 2024
9b73a2f
[Spec Decoding] Use target model max length as default for draft mode…
njhill Aug 21, 2024
d3c002e
[Bugfix] chat method add_generation_prompt param (#7734)
brian14708 Aug 21, 2024
f7e3b0c
[Bugfix][Frontend] Fix Issues Under High Load With `zeromq` Frontend …
robertgshaw2-redhat Aug 21, 2024
1b32e02
[Bugfix] Pass PYTHONPATH from setup.py to CMake (#7730)
sasha0552 Aug 21, 2024
91f4522
[multi-step] Raise error if not using async engine (#7703)
SolitaryThinker Aug 21, 2024
970dfdc
[Frontend] Improve Startup Failure UX (#7716)
robertgshaw2-redhat Aug 21, 2024
dd53c4b
[misc] Add Torch profiler support (#7451)
SolitaryThinker Aug 21, 2024
1ca0d4f
[Model] Add UltravoxModel and UltravoxConfig (#7615)
petersalas Aug 21, 2024
5844017
[ci] [multi-step] narrow multi-step test dependency paths (#7760)
SolitaryThinker Aug 21, 2024
8678a69
[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7…
dsikka Aug 21, 2024
7eebe8c
[distributed][misc] error on same VLLM_HOST_IP setting (#7756)
youkaichao Aug 21, 2024
9984605
[AMD][CI/Build] Disambiguation of the function call for ROCm 6.2 head…
gshtras Aug 21, 2024
7937009
[Kernel] Replaced `blockReduce[...]` functions with `cub::BlockReduce…
ProExpertProg Aug 22, 2024
df1a211
[Model] Fix Phi-3.5-vision-instruct 'num_crops' issue (#7710)
zifeitong Aug 22, 2024
cde9183
[Bug][Frontend] Improve ZMQ client robustness (#7443)
joerunde Aug 22, 2024
aae74ef
Revert "[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Ke…
mgoin Aug 22, 2024
eeee1c3
[TPU] Avoid initializing TPU runtime in is_tpu (#7763)
WoosukKwon Aug 22, 2024
8c6f694
[ci] refine dependency for distributed tests (#7776)
youkaichao Aug 22, 2024
b3856be
[Misc] Use torch.compile for GemmaRMSNorm (#7642)
WoosukKwon Aug 22, 2024
a3fce56
[Speculative Decoding] EAGLE Implementation with Top-1 proposer (#6830)
abhigoyal1997 Aug 22, 2024
4f419c0
Fix ShardedStateLoader for vllm fp8 quantization (#7708)
sfc-gh-zhwang Aug 22, 2024
55d63b1
[Bugfix] Don't build machete on cuda <12.0 (#7757)
LucasWilkinson Aug 22, 2024
955b519
[Misc] update fp8 to use `vLLMParameter` (#7437)
dsikka Aug 22, 2024
cc0eaf1
[Bugfix] spec decode handle None entries in topk args in create_seque…
tjohnson31415 Aug 22, 2024
d3b5b98
[Misc] Enhance prefix-caching benchmark tool (#6568)
Jeffwan Aug 22, 2024
57792ed
[Doc] Fix incorrect docs from #7615 (#7788)
petersalas Aug 22, 2024
15310b5
[Bugfix] Use LoadFormat values for `vllm serve --load-format` (#7784)
mgoin Aug 22, 2024
666ad0a
[ci] Cleanup & refactor Dockerfile to pass different Python versions …
khluu Aug 22, 2024
a152246
[Misc] fix typo in triton import warning (#7794)
lsy323 Aug 22, 2024
b903e1b
[Frontend] error suppression cleanup (#7786)
joerunde Aug 22, 2024
c01a6cb
[Ray backend] Better error when pg topology is bad. (#7584)
rkooo567 Aug 23, 2024
fc5ebbd
[Hardware][Intel GPU] refactor xpu_model_runner for tp (#7712)
jikunshang Aug 23, 2024
faeddb5
[misc] Add Torch profiler support for CPU-only devices (#7806)
DamonFool Aug 23, 2024
e25fee5
[BugFix] Fix server crash on empty prompt (#7746)
maxdebayser Aug 23, 2024
35ee2ad
[github][misc] promote asking llm first (#7809)
youkaichao Aug 23, 2024
f1df5db
[Misc] Update `marlin` to use vLLMParameters (#7803)
dsikka Aug 23, 2024
09c7792
Bump version to v0.5.5 (#7823)
simon-mo Aug 23, 2024
fcd968c
Merge commit '09c7792610ada9f88bbf87d32b472dd44bf23cc2' into sync_vllm
vaibhavjainwiz Aug 26, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,4 @@ tasks:
value: 0.664
limit: 1000
num_fewshot: 5
trust_remote_code: True
4 changes: 2 additions & 2 deletions .buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-QQQ.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@ tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.409
value: 0.419
- name: "exact_match,flexible-extract"
value: 0.406
value: 0.416
limit: 1000
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nvidia/Minitron-4B-Base -b auto -l 1000 -f 5 -t 1
model_name: "nvidia/Minitron-4B-Base"
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m mgoin/Minitron-4B-Base-FP8 -b auto -l 1000 -f 5 -t 1
model_name: "mgoin/Minitron-4B-Base-FP8"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.252
value: 0.233
- name: "exact_match,flexible-extract"
value: 0.252
value: 0.236
limit: 1000
num_fewshot: 5
2 changes: 1 addition & 1 deletion .buildkite/lm-eval-harness/configs/models-small.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Meta-Llama-3-8B-Instruct-FP8-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-INT8-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-nonuniform-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-Channelwise-compressed-tensors.yaml
Minitron-4B-Base.yaml
Minitron-4B-Base-FP8.yaml
Qwen2-1.5B-Instruct-INT8-compressed-tensors.yaml
Qwen2-1.5B-Instruct-FP8W8.yaml
Meta-Llama-3-8B-QQQ.yaml
7 changes: 5 additions & 2 deletions .buildkite/lm-eval-harness/test_lm_eval_correctness.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
import numpy
import yaml

RTOL = 0.02
RTOL = 0.05
TEST_DATA_FILE = os.environ.get(
"LM_EVAL_TEST_DATA_FILE",
".buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct.yaml")
Expand All @@ -23,9 +23,12 @@


def launch_lm_eval(eval_config):
trust_remote_code = eval_config.get('trust_remote_code', False)

model_args = f"pretrained={eval_config['model_name']}," \
f"tensor_parallel_size={TP_SIZE}," \
f"add_bos_token=true"
f"add_bos_token=true," \
f"trust_remote_code={trust_remote_code}"

results = lm_eval.simple_evaluate(
model="vllm",
Expand Down
9 changes: 5 additions & 4 deletions .buildkite/nightly-benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,17 +34,18 @@ See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performan

Performance benchmark will be triggered when:
- A PR being merged into vllm.
- Every commit for those PRs with `perf-benchmarks` label.
- Every commit for those PRs with `perf-benchmarks` label AND `ready` label.

Nightly benchmark will be triggered when:
- Every commit for those PRs with `nightly-benchmarks` label.
- Every commit for those PRs with `perf-benchmarks` label and `nightly-benchmarks` label.




## Performance benchmark details

See [descriptions.md](tests/descriptions.md) for detailed descriptions, and use `tests/latency-tests.json`, `tests/throughput-tests.json`, `tests/serving-tests.json` to configure the test cases.

See [performance-benchmarks-descriptions.md](performance-benchmarks-descriptions.md) for detailed descriptions, and use `tests/latency-tests.json`, `tests/throughput-tests.json`, `tests/serving-tests.json` to configure the test cases.


#### Latency test
Expand All @@ -68,7 +69,7 @@ Here is an example of one test inside `latency-tests.json`:

In this example:
- The `test_name` attributes is a unique identifier for the test. In `latency-tests.json`, it must start with `latency_`.
- The `parameters` attribute control the command line arguments to be used for `benchmark_latency.py`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-benchmarks-suite.sh` will convert the underline to dash when feeding the arguments to `benchmark_latency.py`. For example, the corresponding command line arguments for `benchmark_latency.py` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`
- The `parameters` attribute control the command line arguments to be used for `benchmark_latency.py`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-performance-benchmarks.sh` will convert the underline to dash when feeding the arguments to `benchmark_latency.py`. For example, the corresponding command line arguments for `benchmark_latency.py` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`

Note that the performance numbers are highly sensitive to the value of the parameters. Please make sure the parameters are set correctly.

Expand Down
2 changes: 1 addition & 1 deletion .buildkite/nightly-benchmarks/benchmark-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ steps:
containers:
- image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
command:
- bash .buildkite/nightly-benchmarks/run-benchmarks-suite.sh
- bash .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
resources:
limits:
nvidia.com/gpu: 8
Expand Down
Original file line number Diff line number Diff line change
@@ -1,47 +1,42 @@

## Latency tests

This test suite aims to test vllm's end-to-end latency under a controlled setup.

- Input length: 32 tokens.
- Output length: 128 tokens.
- Batch size: fixed (8).
- Models: llama-3 8B, llama-3 70B, mixtral 8x7B.
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- Evaluation metrics: end-to-end latency (mean, median, p99).

### Latency benchmarking results

{latency_tests_markdown_table}

## Throughput tests

This test suite aims to test vllm's throughput.
## Throughput tests

- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm to achieve maximum throughput.
- Models: llama-3 8B, llama-3 70B, mixtral 8x7B.
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- Evaluation metrics: throughput.

### Throughput benchmarking results

{throughput_tests_markdown_table}

## Serving tests

This test suite aims to test vllm's real serving metrics.
## Serving tests

- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
- **Average QPS (query per second)**: 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
- Models: llama-3 8B, llama-3 70B, mixtral 8x7B.
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- We also added a speculative decoding test for llama-3 70B, under QPS 2
- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).

### Serving benchmarking results

{serving_tests_markdown_table}


## json version of the benchmarking tables

This section contains the data of the markdown tables above in JSON format.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -174,8 +174,8 @@ def results_to_json(latency, throughput, serving):
# document the result
with open(results_folder / "benchmark_results.md", "w") as f:

results = read_markdown(
"../.buildkite/nightly-benchmarks/tests/descriptions.md")
results = read_markdown("../.buildkite/nightly-benchmarks/" +
"performance-benchmarks-descriptions.md")
results = results.format(
latency_tests_markdown_table=latency_md_table,
throughput_tests_markdown_table=throughput_md_table,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -37,9 +37,9 @@ check_hf_token() {
ensure_sharegpt_downloaded() {
local FILE=ShareGPT_V3_unfiltered_cleaned_split.json
if [ ! -f "$FILE" ]; then
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/$FILE
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/$FILE
else
echo "$FILE already exists."
echo "$FILE already exists."
fi
}

Expand Down Expand Up @@ -68,35 +68,38 @@ wait_for_server() {
done' && return 0 || return 1
}

kill_gpu_processes() {
# kill all processes on GPU.
pids=$(nvidia-smi --query-compute-apps=pid --format=csv,noheader)
if [ -z "$pids" ]; then
echo "No GPU processes found."
kill_processes_launched_by_current_bash() {
# Kill all python processes launched from current bash script
current_shell_pid=$$
processes=$(ps -eo pid,ppid,command | awk -v ppid="$current_shell_pid" -v proc="$1" '$2 == ppid && $3 ~ proc {print $1}')
if [ -n "$processes" ]; then
echo "Killing the following processes matching '$1':"
echo "$processes"
echo "$processes" | xargs kill -9
else
for pid in $pids; do
kill -9 "$pid"
echo "Killed process with PID: $pid"
done

echo "All GPU processes have been killed."
echo "No processes found matching '$1'."
fi
}

kill_gpu_processes() {

# waiting for GPU processes to be fully killed
# loop while nvidia-smi returns any processes
while [ -n "$(nvidia-smi --query-compute-apps=pid --format=csv,noheader)" ]; do
ps -aux
lsof -t -i:8000 | xargs -r kill -9
pkill -f pt_main_thread
# this line doesn't work now
# ps aux | grep python | grep openai | awk '{print $2}' | xargs -r kill -9
pkill -f python3
pkill -f /usr/bin/python3


# wait until GPU memory usage smaller than 1GB
while [ $(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits | head -n 1) -ge 1000 ]; do
sleep 1
echo "Waiting for GPU processes to be killed"
done

# remove vllm config file
rm -rf ~/.config/vllm

# Print the GPU memory usage
# so that we know if all GPU processes are killed.
gpu_memory_usage=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits -i 0)
# The memory usage should be 0 MB.
echo "GPU 0 Memory Usage: $gpu_memory_usage MB"
}

upload_to_buildkite() {
Expand All @@ -114,7 +117,7 @@ upload_to_buildkite() {
fi

# Use the determined command to annotate and upload artifacts
$BUILDKITE_AGENT_COMMAND annotate --style "info" --context "$BUILDKITE_LABEL-benchmark-results" < $RESULTS_FOLDER/benchmark_results.md
$BUILDKITE_AGENT_COMMAND annotate --style "info" --context "$BUILDKITE_LABEL-benchmark-results" <$RESULTS_FOLDER/benchmark_results.md
$BUILDKITE_AGENT_COMMAND artifact upload "$RESULTS_FOLDER/*"
}

Expand Down Expand Up @@ -166,7 +169,7 @@ run_latency_tests() {
latency_command: $latency,
gpu_type: $gpu
}')
echo "$jq_output" > "$RESULTS_FOLDER/$test_name.commands"
echo "$jq_output" >"$RESULTS_FOLDER/$test_name.commands"

# run the benchmark
eval "$latency_command"
Expand All @@ -176,7 +179,6 @@ run_latency_tests() {
done
}


run_throughput_tests() {
# run throughput tests using `benchmark_throughput.py`
# $1: a json file specifying throughput test cases
Expand Down Expand Up @@ -224,7 +226,7 @@ run_throughput_tests() {
throughput_command: $command,
gpu_type: $gpu
}')
echo "$jq_output" > "$RESULTS_FOLDER/$test_name.commands"
echo "$jq_output" >"$RESULTS_FOLDER/$test_name.commands"

# run the benchmark
eval "$throughput_command"
Expand Down Expand Up @@ -256,7 +258,6 @@ run_serving_tests() {
continue
fi


# get client and server arguments
server_params=$(echo "$params" | jq -r '.server_parameters')
client_params=$(echo "$params" | jq -r '.client_parameters')
Expand Down Expand Up @@ -334,7 +335,7 @@ run_serving_tests() {
client_command: $client,
gpu_type: $gpu
}')
echo "$jq_output" > "$RESULTS_FOLDER/${new_test_name}.commands"
echo "$jq_output" >"$RESULTS_FOLDER/${new_test_name}.commands"

done

Expand All @@ -351,6 +352,7 @@ main() {
# dependencies
(which wget && which curl) || (apt-get update && apt-get install -y wget curl)
(which jq) || (apt-get update && apt-get -y install jq)
(which lsof) || (apt-get update && apt-get install -y lsof)

# get the current IP address, required by benchmark_serving.py
export VLLM_HOST_IP=$(hostname -I | awk '{print $1}')
Expand All @@ -369,7 +371,6 @@ main() {
run_latency_tests $QUICK_BENCHMARK_ROOT/tests/latency-tests.json
run_throughput_tests $QUICK_BENCHMARK_ROOT/tests/throughput-tests.json


# postprocess benchmarking results
pip install tabulate pandas
python3 $QUICK_BENCHMARK_ROOT/scripts/convert-results-json-to-markdown.py
Expand Down
4 changes: 2 additions & 2 deletions .buildkite/nightly-benchmarks/tests/latency-tests.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
{
"test_name": "latency_llama8B_tp1",
"parameters": {
"model": "meta-llama/Meta-Llama-3-8B",
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"tensor_parallel_size": 1,
"load_format": "dummy",
"num_iters_warmup": 5,
Expand All @@ -12,7 +12,7 @@
{
"test_name": "latency_llama70B_tp4",
"parameters": {
"model": "meta-llama/Meta-Llama-3-70B-Instruct",
"model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
"tensor_parallel_size": 4,
"load_format": "dummy",
"num-iters-warmup": 5,
Expand Down
12 changes: 6 additions & 6 deletions .buildkite/nightly-benchmarks/tests/serving-tests.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,15 @@
"test_name": "serving_llama8B_tp1_sharegpt",
"qps_list": [1, 4, 16, "inf"],
"server_parameters": {
"model": "meta-llama/Meta-Llama-3-8B",
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"tensor_parallel_size": 1,
"swap_space": 16,
"disable_log_stats": "",
"disable_log_requests": "",
"load_format": "dummy"
},
"client_parameters": {
"model": "meta-llama/Meta-Llama-3-8B",
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"backend": "vllm",
"dataset_name": "sharegpt",
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
Expand All @@ -22,15 +22,15 @@
"test_name": "serving_llama70B_tp4_sharegpt",
"qps_list": [1, 4, 16, "inf"],
"server_parameters": {
"model": "meta-llama/Meta-Llama-3-70B-Instruct",
"model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
"tensor_parallel_size": 4,
"swap_space": 16,
"disable_log_stats": "",
"disable_log_requests": "",
"load_format": "dummy"
},
"client_parameters": {
"model": "meta-llama/Meta-Llama-3-70B-Instruct",
"model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
"backend": "vllm",
"dataset_name": "sharegpt",
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
Expand Down Expand Up @@ -60,7 +60,7 @@
"test_name": "serving_llama70B_tp4_sharegpt_specdecode",
"qps_list": [2],
"server_parameters": {
"model": "meta-llama/Meta-Llama-3-70B-Instruct",
"model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
"disable_log_requests": "",
"tensor_parallel_size": 4,
"swap_space": 16,
Expand All @@ -70,7 +70,7 @@
"use_v2_block_manager": ""
},
"client_parameters": {
"model": "meta-llama/Meta-Llama-3-70B-Instruct",
"model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
"backend": "vllm",
"dataset_name": "sharegpt",
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
Expand Down
4 changes: 2 additions & 2 deletions .buildkite/nightly-benchmarks/tests/throughput-tests.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
{
"test_name": "throughput_llama8B_tp1",
"parameters": {
"model": "meta-llama/Meta-Llama-3-8B",
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"tensor_parallel_size": 1,
"load_format": "dummy",
"dataset": "./ShareGPT_V3_unfiltered_cleaned_split.json",
Expand All @@ -13,7 +13,7 @@
{
"test_name": "throughput_llama70B_tp4",
"parameters": {
"model": "meta-llama/Meta-Llama-3-70B-Instruct",
"model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
"tensor_parallel_size": 4,
"load_format": "dummy",
"dataset": "./ShareGPT_V3_unfiltered_cleaned_split.json",
Expand Down
Loading