Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
104 commits
Select commit Hold shift + click to select a range
6559a9e
add Per-token-activation per-channel-weight on-the-fly quantization fp8
kliuae Jan 21, 2025
798c07e
add ptpc fp8 unittests
tjtanaa Jan 21, 2025
63f9657
remove is_navi check for now
tjtanaa Jan 22, 2025
6dc4604
update rocm gpu installation readme; remove navi check
tjtanaa Jan 28, 2025
30f0ecd
update PyTorch version to enable torch._scaled_mm rowwise
tjtanaa Jan 28, 2025
be57b24
[Misc] Rename `MultiModalInputsV2 -> MultiModalInputs` (#12244)
DarkLight1337 Jan 21, 2025
66d6dd2
[Misc]Add BNB quantization for PaliGemmaForConditionalGeneration (#1…
jeejeelee Jan 21, 2025
e9ddeda
[Misc] Remove redundant TypeVar from base model (#12248)
DarkLight1337 Jan 21, 2025
0572080
[Bugfix] Fix mm_limits access for merged multi-modal processor (#12252)
DarkLight1337 Jan 21, 2025
29b95c6
[torch.compile] transparent compilation with more logging (#12246)
youkaichao Jan 21, 2025
b559fa6
[V1][Bugfix] Fix data item ordering in mixed-modality inference (#12259)
ywang96 Jan 21, 2025
6cfb7ac
Remove pytorch comments for outlines + compressed-tensors (#12260)
tdoublep Jan 21, 2025
27530bb
[Platform] improve platforms getattr (#12264)
MengqingCao Jan 21, 2025
91b7860
[ci/build] update nightly torch for gh200 test (#12270)
youkaichao Jan 21, 2025
98b8414
[Bugfix] fix race condition that leads to wrong order of token return…
joennlae Jan 21, 2025
e4564cb
[Kernel] fix moe_align_block_size error condition (#12239)
jinzhen-lin Jan 21, 2025
36077d4
[v1][stats][1/n] Add RequestStatsUpdate and RequestStats types (#10907)
rickyyx Jan 21, 2025
049885f
[Bugfix] Multi-sequence broken (#11898)
andylolu2 Jan 21, 2025
0db6a75
[Misc] Remove experimental dep from tracing.py (#12007)
codefromthecrypt Jan 21, 2025
cbe2a73
[Misc] Set default backend to SDPA for get_vit_attn_backend (#12235)
wangxiyuan Jan 21, 2025
7980828
[Core] Free CPU pinned memory on environment cleanup (#10477)
janimo Jan 21, 2025
10611d8
[BUGFIX] When skip_tokenize_init and multistep are set, execution cra…
maleksan85 Jan 21, 2025
fb43dee
[Documentation][AMD] Add information about prebuilt ROCm vLLM docker …
hongxiayang Jan 21, 2025
4f2fc00
[VLM] Simplify post-processing of replacement info (#12269)
DarkLight1337 Jan 22, 2025
6bcfac0
[ci/lint] Add back default arg for pre-commit (#12279)
khluu Jan 22, 2025
4b713d3
[CI] add docker volume prune to neuron CI (#12291)
liangfu Jan 22, 2025
1356039
[Ci/Build] Fix mypy errors on main (#12296)
DarkLight1337 Jan 22, 2025
9149efa
[Benchmark] More accurate TPOT calc in `benchmark_serving.py` (#12288)
njhill Jan 22, 2025
63df778
[core] separate builder init and builder prepare for each batch (#12253)
youkaichao Jan 22, 2025
a3a6605
[Build] update requirements of no-device (#12299)
MengqingCao Jan 22, 2025
8a8edd5
[Core] Support fully transparent sleep mode (#11743)
youkaichao Jan 22, 2025
b5f00e2
[VLM] Avoid unnecessary tokenization (#12310)
DarkLight1337 Jan 22, 2025
627d6be
[Model][Bugfix]: correct Aria model output (#12309)
xffxff Jan 22, 2025
63586d6
[Bugfix][VLM] Fix mixed-modality inference backward compatibility for…
ywang96 Jan 22, 2025
854e740
[Doc] Add docs for prompt replacement (#12318)
DarkLight1337 Jan 22, 2025
442e38e
[Misc] Fix the error in the tip for the --lora-modules parameter (#12…
WangErXiao Jan 22, 2025
663f758
[Misc] Improve the readability of BNB error messages (#12320)
jeejeelee Jan 22, 2025
851e8a9
[Bugfix] Fix HPU multiprocessing executor (#12167)
kzawora-intel Jan 22, 2025
5b5dffb
[Core] Support `reset_prefix_cache` (#12284)
comaniac Jan 22, 2025
d57c673
[Frontend][V1] Online serving performance improvements (#12287)
njhill Jan 22, 2025
7b79dad
[AMD][Quantization] Add TritonScaledMMLinearKernel since int8 is brok…
rasmith Jan 23, 2025
d7f15ad
[Bugfix] Fixing AMD LoRA CI test. (#12329)
Alexei-V-Ivanov-AMD Jan 23, 2025
fdb07fa
[Docs] Update FP8 KV Cache documentation (#12238)
mgoin Jan 23, 2025
8dab4e9
[Docs] Document vulnerability disclosure process (#12326)
russellb Jan 23, 2025
7e5655a
[V1] Add `uncache_blocks` (#12333)
comaniac Jan 23, 2025
23ae785
[doc] explain common errors around torch.compile (#12340)
youkaichao Jan 23, 2025
cb968a3
[Hardware][Gaudi][BugFix] Fix dataclass error due to triton package u…
zhenwei-intel Jan 23, 2025
7da0408
[Bugfix] Fix k_proj's bias for whisper self attention (#12342)
Isotr0py Jan 23, 2025
7e40b3d
[Kernel] Flash Attention 3 Support (#12093)
LucasWilkinson Jan 23, 2025
d95bfc2
[Doc] Troubleshooting errors during model inspection (#12351)
DarkLight1337 Jan 23, 2025
a62058d
[V1] Simplify M-RoPE (#12352)
ywang96 Jan 23, 2025
73fbc9c
[Bugfix] Fix broken internvl2 inference with v1 (#12360)
Isotr0py Jan 23, 2025
f10e75d
[core] add wake_up doc and some sanity check (#12361)
youkaichao Jan 23, 2025
1f664ef
[torch.compile] decouple compile sizes and cudagraph sizes (#12243)
youkaichao Jan 23, 2025
18b678d
[FP8][Kernel] Dynamic kv cache scaling factors computation (#11906)
gshtras Jan 23, 2025
e1e96e2
[TPU] Update TPU CI to use torchxla nightly on 20250122 (#12334)
lsy323 Jan 23, 2025
00dbfa7
[Docs] Document Phi-4 support (#12362)
Isotr0py Jan 23, 2025
fa914be
[BugFix] Fix parameter names and `process_after_weight_loading` for W…
dsikka Jan 23, 2025
f97fcf4
[Misc] Fix OpenAI API Compatibility Issues in Benchmark Script (#12357)
jsato8094 Jan 23, 2025
0b37b55
[Docs] Add meetup slides (#12345)
WoosukKwon Jan 23, 2025
9e31cd9
[Docs] Update spec decode + structured output in compat matrix (#12373)
russellb Jan 24, 2025
7f5281c
[V1][Frontend] Coalesce bunched `RequestOutput`s (#12298)
njhill Jan 24, 2025
c0e786e
Set weights_only=True when using torch.load() (#12366)
russellb Jan 24, 2025
837673f
[Bugfix] Path join when building local path for S3 clone (#12353)
omer-dayan Jan 24, 2025
04a9ed3
Update compressed-tensors version (#12367)
dsikka Jan 24, 2025
9313039
[V1] Increase default batch size for H100/H200 (#12369)
WoosukKwon Jan 24, 2025
404466b
[perf] fix perf regression from #12253 (#12380)
youkaichao Jan 24, 2025
a93fa1c
[Misc] Use VisionArena Dataset for VLM Benchmarking (#12389)
ywang96 Jan 24, 2025
c6b9d47
[ci/build] fix wheel size check (#12396)
youkaichao Jan 24, 2025
2c8d8f8
[Hardware][Gaudi][Doc] Add missing step in setup instructions (#12382)
MohitIntel Jan 24, 2025
e4e455d
[ci/build] sync default value for wheel size (#12398)
youkaichao Jan 24, 2025
238f125
[Misc] Enable proxy support in benchmark script (#12356)
jsato8094 Jan 24, 2025
b168424
[Bugfix][Kernel] Fix CUDA 11.8 being broken by FA3 build (#12375)
LucasWilkinson Jan 24, 2025
7a37f5b
[Misc] Remove deprecated code (#12383)
DarkLight1337 Jan 24, 2025
949a71b
[Bugfix][Kernel] FA3 Fix - RuntimeError: This flash attention build o…
LucasWilkinson Jan 24, 2025
f725805
[Bugfix] Fix BLIP-2 processing (#12412)
DarkLight1337 Jan 25, 2025
5deb923
[ROCm][MoE] MI300 tuned configs Mixtral-8x(7B,22B) | fp16, fp8 (#12408)
divakar-amd Jan 25, 2025
e516889
[Misc] Add FA2 support to ViT MHA layer (#12355)
Isotr0py Jan 25, 2025
09ebc9c
[TPU][CI] Update torchxla version in requirement-tpu.txt (#12422)
lsy323 Jan 25, 2025
2bc60ba
[Misc][Bugfix] FA3 support to ViT MHA layer (#12435)
ywang96 Jan 26, 2025
4388fac
[V1][Perf] Reduce scheduling overhead in model runner after cuda sync…
youngkent Jan 26, 2025
b2d17f7
[V1][Bugfix] Fix assertion when mm hashing is turned off (#12439)
ywang96 Jan 26, 2025
2192644
[Misc] Revert FA on ViT #12355 and #12435 (#12445)
ywang96 Jan 26, 2025
48260e5
[Frontend] generation_config.json for maximum tokens(#12242)
mhendrey Jan 26, 2025
c43632c
[Bugfix] Disable w16a16 2of4 sparse CompressedTensors24 (#12417)
tlrmchlsmth Jan 26, 2025
9b79bce
[Bugfix/CI] Fix broken kernels/test_mha.py (#12450)
tlrmchlsmth Jan 26, 2025
5d6cbd0
[Bugfix][Kernel] Fix perf regression caused by PR #12405 (#12434)
LucasWilkinson Jan 26, 2025
b74bb57
[Build/CI] Fix libcuda.so linkage (#12424)
tlrmchlsmth Jan 26, 2025
fe8f6a9
[Frontend] Rerank API (Jina- and Cohere-compatible API) (#12376)
K-Mistele Jan 27, 2025
5087684
[DOC] Add link to vLLM blog (#12460)
terrytangyuan Jan 27, 2025
7b52511
[V1] Avoid list creation in input preparation (#12457)
WoosukKwon Jan 27, 2025
729cf0d
[Frontend] Support scores endpoint in run_batch (#12430)
pooyadavoodi Jan 27, 2025
4176918
[Bugfix] Fix Granite 3.0 MoE model loading (#12446)
DarkLight1337 Jan 27, 2025
7a6cded
[Bugfix] Fix missing seq_start_loc in xformers prefill metadata (#12464)
Isotr0py Jan 27, 2025
bd69c90
[V1][Minor] Minor optimizations for update_from_output (#12454)
WoosukKwon Jan 27, 2025
899cea0
[Bugfix] Fix gpt2 GGUF inference (#12467)
Isotr0py Jan 27, 2025
2fa4f8e
[Build] Only build 9.0a for scaled_mm and sparse kernels (#12339)
LucasWilkinson Jan 27, 2025
1253304
[V1][Metrics] Add initial Prometheus logger (#12416)
markmc Jan 27, 2025
0f2a9ce
[V1][CI/Test] Do basic test for top-p & top-k sampling (#12469)
WoosukKwon Jan 27, 2025
45844a3
[FlashInfer] Upgrade to 0.2.0 (#11194)
abmfy Jan 27, 2025
411e0d2
[Feature] [Spec decode]: Enable MLPSpeculator/Medusa and `prompt_logp…
NickLucche Jan 27, 2025
0ae8f3e
Update `pre-commit` hooks (#12475)
hmellor Jan 28, 2025
008891b
[Neuron][Kernel] NKI-based flash-attention kernel with paged KV cache…
liangfu Jan 28, 2025
79151e0
Fix bad path in prometheus example (#12481)
mgoin Jan 28, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
7 changes: 5 additions & 2 deletions .buildkite/check-wheel-size.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,11 @@
import sys
import zipfile

# Read the VLLM_MAX_SIZE_MB environment variable, defaulting to 250 MB
VLLM_MAX_SIZE_MB = int(os.environ.get('VLLM_MAX_SIZE_MB', 250))
# Read the VLLM_MAX_SIZE_MB environment variable, defaulting to 300 MiB
# Note that we have 400 MiB quota, please use it wisely.
# See https://github.com/pypi/support/issues/3792 .
# Please also sync the value with the one in Dockerfile.
VLLM_MAX_SIZE_MB = int(os.environ.get('VLLM_MAX_SIZE_MB', 300))


def print_top_10_largest_files(zip_file):
Expand Down
7 changes: 5 additions & 2 deletions .buildkite/run-neuron-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,11 @@ if [ -f /tmp/neuron-docker-build-timestamp ]; then
last_build=$(cat /tmp/neuron-docker-build-timestamp)
current_time=$(date +%s)
if [ $((current_time - last_build)) -gt 86400 ]; then
# Remove dangling images (those that are not tagged and not used by any container)
docker image prune -f
docker system prune -f
# Remove unused volumes / force the system prune for old images as well.
docker volume prune -f && docker system prune -f
# Remove huggingface model artifacts and compiler cache
rm -rf "${HF_MOUNT:?}/*"
rm -rf "${NEURON_COMPILE_CACHE_MOUNT:?}/*"
echo "$current_time" > /tmp/neuron-docker-build-timestamp
Expand All @@ -51,4 +54,4 @@ docker run --rm -it --device=/dev/neuron0 --device=/dev/neuron1 --network host \
-e "NEURON_COMPILE_CACHE_URL=${NEURON_COMPILE_CACHE_MOUNT}" \
--name "${container_name}" \
${image_name} \
/bin/bash -c "python3 /workspace/vllm/examples/offline_inference/neuron.py"
/bin/bash -c "python3 /workspace/vllm/examples/offline_inference/neuron.py && python3 -m pytest /workspace/vllm/tests/neuron/ -v --capture=tee-sys"
13 changes: 12 additions & 1 deletion .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,9 @@ steps:
- tests/basic_correctness/test_basic_correctness
- tests/basic_correctness/test_cpu_offload
- tests/basic_correctness/test_preemption
- tests/basic_correctness/test_cumem.py
commands:
- pytest -v -s basic_correctness/test_cumem.py
- pytest -v -s basic_correctness/test_basic_correctness.py
- pytest -v -s basic_correctness/test_cpu_offload.py
- VLLM_TEST_ENABLE_ARTIFICIAL_PREEMPT=1 pytest -v -s basic_correctness/test_preemption.py
Expand Down Expand Up @@ -181,7 +183,16 @@ steps:
- vllm/
- tests/v1
commands:
- VLLM_USE_V1=1 pytest -v -s v1
# split the test to avoid interference
- VLLM_USE_V1=1 pytest -v -s v1/core
- VLLM_USE_V1=1 pytest -v -s v1/engine
- VLLM_USE_V1=1 pytest -v -s v1/sample
- VLLM_USE_V1=1 pytest -v -s v1/worker
- VLLM_USE_V1=1 pytest -v -s v1/test_stats.py
- VLLM_USE_V1=1 pytest -v -s v1/test_utils.py
# TODO: accuracy does not match, whether setting
# VLLM_USE_FLASHINFER_SAMPLER or not on H100.
- VLLM_USE_V1=1 pytest -v -s v1/e2e

- label: Examples Test # 25min
working_dir: "/vllm-workspace/examples"
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/pre-commit.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,4 @@ jobs:
- run: echo "::add-matcher::.github/workflows/matchers/actionlint.json"
- uses: pre-commit/action@2c7b3805fd2a0fd8c1884dcaebf91fc102a13ecd # v3.0.1
with:
extra_args: --hook-stage manual
extra_args: --all-files --hook-stage manual
10 changes: 5 additions & 5 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,18 +3,18 @@ default_stages:
- manual # Run in CI
repos:
- repo: https://github.com/google/yapf
rev: v0.32.0
rev: v0.43.0
hooks:
- id: yapf
args: [--in-place, --verbose]
additional_dependencies: [toml] # TODO: Remove when yapf is upgraded
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.6.5
rev: v0.9.3
hooks:
- id: ruff
args: [--output-format, github]
- repo: https://github.com/codespell-project/codespell
rev: v2.3.0
rev: v2.4.0
hooks:
- id: codespell
exclude: 'benchmarks/sonnet.txt|(build|tests/(lora/data|models/fixtures|prompts))/.*'
Expand All @@ -23,7 +23,7 @@ repos:
hooks:
- id: isort
- repo: https://github.com/pre-commit/mirrors-clang-format
rev: v18.1.5
rev: v19.1.7
hooks:
- id: clang-format
exclude: 'csrc/(moe/topk_softmax_kernels.cu|quantization/gguf/(ggml-common.h|dequantize.cuh|vecdotq.cuh|mmq.cuh|mmvq.cuh))'
Expand All @@ -35,7 +35,7 @@ repos:
- id: pymarkdown
files: docs/.*
- repo: https://github.com/rhysd/actionlint
rev: v1.7.6
rev: v1.7.7
hooks:
- id: actionlint
- repo: local
Expand Down
82 changes: 53 additions & 29 deletions CMakeLists.txt
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -24,9 +24,6 @@ include(${CMAKE_CURRENT_LIST_DIR}/cmake/utils.cmake)
# Suppress potential warnings about unused manually-specified variables
set(ignoreMe "${VLLM_PYTHON_PATH}")

# Prevent installation of dependencies (cutlass) by default.
install(CODE "set(CMAKE_INSTALL_LOCAL_ONLY TRUE)" ALL_COMPONENTS)

#
# Supported python versions. These versions will be searched in order, the
# first match will be selected. These should be kept in sync with setup.py.
Expand Down Expand Up @@ -181,6 +178,31 @@ message(STATUS "FetchContent base directory: ${FETCHCONTENT_BASE_DIR}")
# Define other extension targets
#

#
# cumem_allocator extension
#

set(VLLM_CUMEM_EXT_SRC
"csrc/cumem_allocator.cpp")

set_gencode_flags_for_srcs(
SRCS "${VLLM_CUMEM_EXT_SRC}"
CUDA_ARCHS "${CUDA_ARCHS}")

if(VLLM_GPU_LANG STREQUAL "CUDA")
message(STATUS "Enabling cumem allocator extension.")
# link against cuda driver library
list(APPEND CUMEM_LIBS cuda)
define_gpu_extension_target(
cumem_allocator
DESTINATION vllm
LANGUAGE CXX
SOURCES ${VLLM_CUMEM_EXT_SRC}
LIBRARIES ${CUMEM_LIBS}
USE_SABI 3.8
WITH_SOABI)
endif()

#
# _C extension
#
Expand Down Expand Up @@ -253,7 +275,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
# Only build Marlin kernels if we are building for at least some compatible archs.
# Keep building Marlin for 9.0 as there are some group sizes and shapes that
# are not supported by Machete yet.
cuda_archs_loose_intersection(MARLIN_ARCHS "8.0;8.6;8.7;8.9;9.0" ${CUDA_ARCHS})
cuda_archs_loose_intersection(MARLIN_ARCHS "8.0;8.6;8.7;8.9;9.0" "${CUDA_ARCHS}")
if (MARLIN_ARCHS)
set(MARLIN_SRCS
"csrc/quantization/fp8/fp8_marlin.cu"
Expand All @@ -274,8 +296,8 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
endif()

# The cutlass_scaled_mm kernels for Hopper (c3x, i.e. CUTLASS 3.x) require
# CUDA 12.0 or later (and only work on Hopper, 9.0/9.0a for now).
cuda_archs_loose_intersection(SCALED_MM_3X_ARCHS "9.0;9.0a" "${CUDA_ARCHS}")
# CUDA 12.0 or later (and only work on Hopper, 9.0a for now).
cuda_archs_loose_intersection(SCALED_MM_3X_ARCHS "9.0a" "${CUDA_ARCHS}")
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.0 AND SCALED_MM_3X_ARCHS)
set(SRCS "csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu")
set_gencode_flags_for_srcs(
Expand Down Expand Up @@ -329,7 +351,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
# 2:4 Sparse Kernels

# The 2:4 sparse kernels cutlass_scaled_sparse_mm and cutlass_compressor
# require CUDA 12.2 or later (and only work on Hopper, 9.0/9.0a for now).
# require CUDA 12.2 or later (and only work on Hopper, 9.0a for now).
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.2 AND SCALED_MM_3X_ARCHS)
set(SRCS "csrc/sparse/cutlass/sparse_compressor_c3x.cu"
"csrc/sparse/cutlass/sparse_scaled_mm_c3x.cu")
Expand Down Expand Up @@ -424,6 +446,9 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
endif()

message(STATUS "Enabling C extension.")
if(VLLM_GPU_LANG STREQUAL "CUDA")
list(APPEND VLLM_C_LIBS cuda)
endif()
define_gpu_extension_target(
_C
DESTINATION vllm
Expand All @@ -432,6 +457,7 @@ define_gpu_extension_target(
COMPILE_FLAGS ${VLLM_GPU_FLAGS}
ARCHITECTURES ${VLLM_GPU_ARCHES}
INCLUDE_DIRECTORIES ${CUTLASS_INCLUDE_DIR};${CUTLASS_TOOLS_UTIL_INCLUDE_DIR}
LIBRARIES ${VLLM_C_LIBS}
USE_SABI 3
WITH_SOABI)

Expand Down Expand Up @@ -510,7 +536,7 @@ if(VLLM_GPU_LANG STREQUAL "HIP")
endif()

# vllm-flash-attn currently only supported on CUDA
if (NOT VLLM_TARGET_DEVICE STREQUAL "cuda")
if (NOT VLLM_GPU_LANG STREQUAL "CUDA")
return()
endif ()

Expand All @@ -533,7 +559,7 @@ endif()
# They should be identical but if they aren't, this is a massive footgun.
#
# The vllm-flash-attn install rules are nested under vllm to make sure the library gets installed in the correct place.
# To only install vllm-flash-attn, use --component vllm_flash_attn_c.
# To only install vllm-flash-attn, use --component _vllm_fa2_C (for FA2) or --component _vllm_fa3_C (for FA3).
# If no component is specified, vllm-flash-attn is still installed.

# If VLLM_FLASH_ATTN_SRC_DIR is set, vllm-flash-attn is installed from that directory instead of downloading.
Expand All @@ -545,43 +571,41 @@ if (DEFINED ENV{VLLM_FLASH_ATTN_SRC_DIR})
endif()

if(VLLM_FLASH_ATTN_SRC_DIR)
FetchContent_Declare(vllm-flash-attn SOURCE_DIR ${VLLM_FLASH_ATTN_SRC_DIR})
FetchContent_Declare(
vllm-flash-attn SOURCE_DIR
${VLLM_FLASH_ATTN_SRC_DIR}
BINARY_DIR ${CMAKE_BINARY_DIR}/vllm-flash-attn
)
else()
FetchContent_Declare(
vllm-flash-attn
GIT_REPOSITORY https://github.com/vllm-project/flash-attention.git
GIT_TAG 96266b1111111f3d11aabefaf3bacbab6a89d03c
GIT_TAG d4e09037abf588af1ec47d0e966b237ee376876c
GIT_PROGRESS TRUE
# Don't share the vllm-flash-attn build between build types
BINARY_DIR ${CMAKE_BINARY_DIR}/vllm-flash-attn
)
endif()

# Set the parent build flag so that the vllm-flash-attn library does not redo compile flag and arch initialization.
set(VLLM_PARENT_BUILD ON)

# Ensure the vllm/vllm_flash_attn directory exists before installation
install(CODE "file(MAKE_DIRECTORY \"\${CMAKE_INSTALL_PREFIX}/vllm/vllm_flash_attn\")" COMPONENT vllm_flash_attn_c)

# Make sure vllm-flash-attn install rules are nested under vllm/
install(CODE "set(CMAKE_INSTALL_LOCAL_ONLY FALSE)" COMPONENT vllm_flash_attn_c)
install(CODE "set(OLD_CMAKE_INSTALL_PREFIX \"\${CMAKE_INSTALL_PREFIX}\")" COMPONENT vllm_flash_attn_c)
install(CODE "set(CMAKE_INSTALL_PREFIX \"\${CMAKE_INSTALL_PREFIX}/vllm/\")" COMPONENT vllm_flash_attn_c)

# Fetch the vllm-flash-attn library
FetchContent_MakeAvailable(vllm-flash-attn)
message(STATUS "vllm-flash-attn is available at ${vllm-flash-attn_SOURCE_DIR}")

# Restore the install prefix
install(CODE "set(CMAKE_INSTALL_PREFIX \"\${OLD_CMAKE_INSTALL_PREFIX}\")" COMPONENT vllm_flash_attn_c)
install(CODE "set(CMAKE_INSTALL_LOCAL_ONLY TRUE)" COMPONENT vllm_flash_attn_c)
# Copy over the vllm-flash-attn python files (duplicated for fa2 and fa3, in
# case only one is built, in the case both are built redundant work is done)
install(
DIRECTORY ${vllm-flash-attn_SOURCE_DIR}/vllm_flash_attn/
DESTINATION vllm_flash_attn
COMPONENT _vllm_fa2_C
FILES_MATCHING PATTERN "*.py"
)

# Copy over the vllm-flash-attn python files
install(
DIRECTORY ${vllm-flash-attn_SOURCE_DIR}/vllm_flash_attn/
DESTINATION vllm/vllm_flash_attn
COMPONENT vllm_flash_attn_c
FILES_MATCHING PATTERN "*.py"
DIRECTORY ${vllm-flash-attn_SOURCE_DIR}/vllm_flash_attn/
DESTINATION vllm_flash_attn
COMPONENT _vllm_fa3_C
FILES_MATCHING PATTERN "*.py"
)

# Nothing after vllm-flash-attn, see comment about macros above
29 changes: 24 additions & 5 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ WORKDIR /workspace
# after this step
RUN --mount=type=cache,target=/root/.cache/pip \
if [ "$TARGETPLATFORM" = "linux/arm64" ]; then \
python3 -m pip install --index-url https://download.pytorch.org/whl/nightly/cu124 "torch==2.6.0.dev20241210+cu124" "torchvision==0.22.0.dev20241215"; \
python3 -m pip install --index-url https://download.pytorch.org/whl/nightly/cu126 "torch==2.7.0.dev20250121+cu126" "torchvision==0.22.0.dev20250121"; \
fi

COPY requirements-common.txt requirements-common.txt
Expand Down Expand Up @@ -126,8 +126,8 @@ RUN --mount=type=cache,target=/root/.cache/ccache \

# Check the size of the wheel if RUN_WHEEL_CHECK is true
COPY .buildkite/check-wheel-size.py check-wheel-size.py
# Default max size of the wheel is 250MB
ARG VLLM_MAX_SIZE_MB=250
# sync the default value with .buildkite/check-wheel-size.py
ARG VLLM_MAX_SIZE_MB=300
ENV VLLM_MAX_SIZE_MB=$VLLM_MAX_SIZE_MB
ARG RUN_WHEEL_CHECK=true
RUN if [ "$RUN_WHEEL_CHECK" = "true" ]; then \
Expand All @@ -149,7 +149,8 @@ RUN --mount=type=cache,target=/root/.cache/pip \

#################### vLLM installation IMAGE ####################
# image with vLLM installed
FROM nvidia/cuda:${CUDA_VERSION}-base-ubuntu22.04 AS vllm-base
# TODO: Restore to base image after FlashInfer AOT wheel fixed
FROM nvidia/cuda:${CUDA_VERSION}-devel-ubuntu22.04 AS vllm-base
ARG CUDA_VERSION=12.4.1
ARG PYTHON_VERSION=3.12
WORKDIR /vllm-workspace
Expand Down Expand Up @@ -194,12 +195,30 @@ RUN --mount=type=bind,from=build,src=/workspace/dist,target=/vllm-workspace/dist
--mount=type=cache,target=/root/.cache/pip \
python3 -m pip install dist/*.whl --verbose

# How to build this FlashInfer wheel:
# $ export FLASHINFER_ENABLE_AOT=1
# $ # Note we remove 7.0 from the arch list compared to the list below, since FlashInfer only supports sm75+
# $ export TORCH_CUDA_ARCH_LIST='7.5 8.0 8.6 8.9 9.0+PTX'
# $ git clone https://github.com/flashinfer-ai/flashinfer.git --recursive
# $ cd flashinfer
# $ git checkout 524304395bd1d8cd7d07db083859523fcaa246a4
# $ python3 setup.py bdist_wheel --dist-dir=dist --verbose

RUN --mount=type=cache,target=/root/.cache/pip \
. /etc/environment && \
if [ "$TARGETPLATFORM" != "linux/arm64" ]; then \
python3 -m pip install https://github.com/flashinfer-ai/flashinfer/releases/download/v0.1.6/flashinfer-0.1.6+cu121torch2.4-cp${PYTHON_VERSION_STR}-cp${PYTHON_VERSION_STR}-linux_x86_64.whl; \
python3 -m pip install https://wheels.vllm.ai/flashinfer/524304395bd1d8cd7d07db083859523fcaa246a4/flashinfer_python-0.2.0.post1-cp${PYTHON_VERSION_STR}-cp${PYTHON_VERSION_STR}-linux_x86_64.whl; \
fi
COPY examples examples

# Although we build Flashinfer with AOT mode, there's still
# some issues w.r.t. JIT compilation. Therefore we need to
# install build dependencies for JIT compilation.
# TODO: Remove this once FlashInfer AOT wheel is fixed
COPY requirements-build.txt requirements-build.txt
RUN --mount=type=cache,target=/root/.cache/pip \
python3 -m pip install -r requirements-build.txt

#################### vLLM installation IMAGE ####################

#################### TEST IMAGE ####################
Expand Down
3 changes: 2 additions & 1 deletion Dockerfile.rocm
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,8 @@ COPY --from=build_vllm ${COMMON_WORKDIR}/vllm /vllm-workspace
RUN cd /vllm-workspace \
&& rm -rf vllm \
&& python3 -m pip install -e tests/vllm_test_utils \
&& python3 -m pip install lm-eval[api]==0.4.4
&& python3 -m pip install lm-eval[api]==0.4.4 \
&& python3 -m pip install pytest-shard

# -----------------------
# Final vLLM image
Expand Down
2 changes: 1 addition & 1 deletion Dockerfile.rocm_base
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ ARG RCCL_BRANCH="648a58d"
ARG RCCL_REPO="https://github.com/ROCm/rccl"
ARG TRITON_BRANCH="e5be006"
ARG TRITON_REPO="https://github.com/triton-lang/triton.git"
ARG PYTORCH_BRANCH="8d4926e"
ARG PYTORCH_BRANCH="3a585126"
ARG PYTORCH_VISION_BRANCH="v0.19.1"
ARG PYTORCH_REPO="https://github.com/pytorch/pytorch.git"
ARG PYTORCH_VISION_REPO="https://github.com/pytorch/vision.git"
Expand Down
2 changes: 1 addition & 1 deletion Dockerfile.tpu
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
ARG NIGHTLY_DATE="20241017"
ARG NIGHTLY_DATE="20250124"
ARG BASE_IMAGE="us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:nightly_3.10_tpuvm_$NIGHTLY_DATE"

FROM $BASE_IMAGE
Expand Down
5 changes: 1 addition & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,8 @@ Easy, fast, and cheap LLM serving for everyone

---

The first vLLM meetup in 2025 is happening on January 22nd, Wednesday, with Google Cloud in San Francisco! We will talk about vLLM's performant V1 architecture, Q1 roadmap, Google Cloud's innovation around vLLM: networking, Cloud Run, Vertex, and TPU! [Register Now](https://lu.ma/zep56hui)

---

*Latest News* 🔥
- [2025/01] We hosted [the eighth vLLM meetup](https://lu.ma/zep56hui) with Google Cloud! Please find the meetup slides from vLLM team [here](https://docs.google.com/presentation/d/1epVkt4Zu8Jz_S5OhEHPc798emsYh2BwYfRuDDVEF7u4/edit?usp=sharing).
- [2024/12] vLLM joins [pytorch ecosystem](https://pytorch.org/blog/vllm-joins-pytorch)! Easy, Fast, and Cheap LLM Serving for Everyone!
- [2024/11] We hosted [the seventh vLLM meetup](https://lu.ma/h0qvrajz) with Snowflake! Please find the meetup slides from vLLM team [here](https://docs.google.com/presentation/d/1e3CxQBV3JsfGp30SwyvS3eM_tW-ghOhJ9PAJGK6KR54/edit?usp=sharing), and Snowflake team [here](https://docs.google.com/presentation/d/1qF3RkDAbOULwz9WK5TOltt2fE9t6uIc_hVNLFAaQX6A/edit?usp=sharing).
- [2024/10] We have just created a developer slack ([slack.vllm.ai](https://slack.vllm.ai)) focusing on coordinating contributions and discussing features. Please feel free to join us there!
Expand Down
Loading