Skip to content

Commit a0646da

Browse files
Address upstream PR code review comments (vllm-project#133)
* formatting fixes * Upstream CR update
1 parent 03e3ce3 commit a0646da

File tree

4 files changed

+45
-29
lines changed

4 files changed

+45
-29
lines changed

docs/source/getting_started/gaudi-installation.rst

Lines changed: 39 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,7 @@
1-
vLLM with Intel® Gaudi® 2 AI Accelerators
1+
vLLM with Intel® Gaudi® AI Accelerators
22
=========================================
33

4-
This README provides instructions on running vLLM with Intel Gaudi
5-
devices.
4+
This README provides instructions on running vLLM with Intel Gaudi devices.
65

76
Requirements and Installation
87
=============================
@@ -13,17 +12,13 @@ to set up the environment. To achieve the best performance, please
1312
follow the methods outlined in the `Optimizing Training Platform
1413
Guide <https://docs.habana.ai/en/latest/PyTorch/Model_Optimization_PyTorch/Optimization_in_Training_Platform.html>`__.
1514

16-
.. note::
17-
In this release (1.16.0), we are only targeting functionality
18-
and accuracy. Performance will be improved in next releases.
19-
2015
Requirements
2116
------------
2217

2318
- OS: Ubuntu 22.04 LTS
2419
- Python: 3.10
25-
- Intel Gaudi 2 accelerator
26-
- Intel Gaudi software version 1.16.0
20+
- Intel Gaudi accelerator
21+
- Intel Gaudi software version 1.16.0 or newer
2722

2823
To verify that the Intel Gaudi software was correctly installed, run:
2924

@@ -49,20 +44,30 @@ Use the following commands to run a Docker image:
4944

5045
.. code:: console
5146
52-
$ docker pull vault.habana.ai/gaudi-docker/1.16.0/ubuntu22.04/habanalabs/pytorch-installer-2.2.0:latest
53-
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.16.0/ubuntu22.04/habanalabs/pytorch-installer-2.2.0:latest
47+
$ docker pull vault.habana.ai/gaudi-docker/1.16.2/ubuntu22.04/habanalabs/pytorch-installer-2.2.2:latest
48+
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.16.2/ubuntu22.04/habanalabs/pytorch-installer-2.2.2:latest
5449
55-
Build and Install vLLM-fork
50+
Build and Install vLLM
5651
---------------------------
5752

58-
To build and install vLLM-fork from source, run:
53+
To build and install vLLM from source, run:
54+
55+
.. code:: console
56+
57+
$ git clone https://github.com/vllm-project/vllm.git
58+
$ cd vllm
59+
$ python setup.py develop
60+
61+
62+
Currently, the latest features and performance optimizations are developed in Gaudi's `vLLM-fork <https://github.com/HabanaAI/vllm-fork>`__ and we periodically upstream them to vLLM main repo. To install latest `HabanaAI/vLLM-fork <https://github.com/HabanaAI/vllm-fork>`__, run the following:
5963

6064
.. code:: console
6165
6266
$ git clone https://github.com/HabanaAI/vllm-fork.git
6367
$ cd vllm-fork
64-
# git checkout v0.4.2-Gaudi-1.16.0
65-
$ pip install -e . # This may take 5-10 minutes.
68+
$ git checkout habana_main
69+
$ python setup.py develop
70+
6671
6772
Supported Features
6873
==================
@@ -72,13 +77,12 @@ Supported Features
7277
- Online inference via `OpenAI-Compatible
7378
Server <https://docs.vllm.ai/en/latest/getting_started/quickstart.html#openai-compatible-server>`__
7479
- HPU autodetection - no need to manually select device within vLLM
75-
- Paged KV cache with algorithms enabled for Intel Gaudi 2 accelerators
80+
- Paged KV cache with algorithms enabled for Intel Gaudi accelerators
7681
- Custom Intel Gaudi implementations of Paged Attention, KV cache ops,
7782
prefill attention, Root Mean Square Layer Normalization, Rotary
7883
Positional Encoding
7984
- Tensor parallelism support for multi-card inference
80-
- Inference with `HPU
81-
Graphs <https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_HPU_Graphs.html>`__
85+
- Inference with `HPU Graphs <https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_HPU_Graphs.html>`__
8286
for accelerating low-batch latency and throughput
8387

8488
Unsupported Features
@@ -94,20 +98,32 @@ Supported Configurations
9498
========================
9599

96100
The following configurations have been validated to be function with
97-
Gaudi devices. Configurations that are not listed may or may not work.
101+
Gaudi2 devices. Configurations that are not listed may or may not work.
98102

99103
- `meta-llama/Llama-2-7b <https://huggingface.co/meta-llama/Llama-2-7b>`__
100104
on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16
101105
datatype with random or greedy sampling
102106
- `meta-llama/Llama-2-7b-chat-hf <https://huggingface.co/meta-llama/Llama-2-7b-chat-hf>`__
103107
on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16
104108
datatype with random or greedy sampling
109+
- `meta-llama/Meta-Llama-3-8B <https://huggingface.co/meta-llama/Meta-Llama-3-8B>`__
110+
on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16
111+
datatype with random or greedy sampling
112+
- `meta-llama/Meta-Llama-3-8B-Instruct <https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct>`__
113+
on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16
114+
datatype with random or greedy sampling
105115
- `meta-llama/Llama-2-70b <https://huggingface.co/meta-llama/Llama-2-70b>`__
106-
with tensor parallelism on 8x HPU, BF16 datatype with random or
107-
greedy sampling
116+
with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling
108117
- `meta-llama/Llama-2-70b-chat-hf <https://huggingface.co/meta-llama/Llama-2-70b-chat-hf>`__
109-
with tensor parallelism 8x HPU, BF16 datatype with random or greedy
110-
sampling
118+
with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling
119+
- `meta-llama/Meta-Llama-3-70B <https://huggingface.co/meta-llama/Meta-Llama-3-70B>`__
120+
with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling
121+
- `meta-llama/Meta-Llama-3-70B-Instruct <https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct>`__
122+
with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling
123+
- `mistralai/Mistral-7B-Instruct-v0.3 <https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3>`__
124+
on single HPU or with tensor parallelism on 2x HPU, BF16 datatype with random or greedy sampling
125+
- `mistralai/Mixtral-8x7B-Instruct-v0.1 <https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1>`__
126+
with tensor parallelism on 2x HPU, BF16 datatype with random or greedy sampling
111127

112128
Performance Tips
113129
================

vllm/model_executor/layers/logits_processor.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,8 @@ def _prune_hidden_states(
9393
hidden_states: torch.Tensor,
9494
sampling_metadata: SamplingMetadata,
9595
) -> torch.Tensor:
96+
# NOTE(kzawora): This is needed for Gaudi - in some scenarios (warmup,
97+
# profile_run) we might not have selected_token_indices, so we skip pruning.
9698
if sampling_metadata.selected_token_indices is not None:
9799
return hidden_states.index_select(
98100
0, sampling_metadata.selected_token_indices)

vllm/model_executor/layers/vocab_parallel_embedding.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -329,9 +329,10 @@ def weight_loader(self, param: Parameter, loaded_weight: torch.Tensor):
329329
# Copy the data.
330330
loaded_weight = loaded_weight.narrow(output_dim, start_idx, shard_size)
331331

332-
# FIXME(kzawora): Weight copy with slicing bugs out on Gaudi here, so
333-
# we're using a workaround. Remove this when fixed in HPU PT bridge.
334332
if is_hpu():
333+
# FIXME(kzawora): Weight copy with slicing bugs out on Gaudi here,
334+
# so we're using a workaround. Remove this when fixed in
335+
# HPU PT bridge.
335336
padded_weight = torch.cat([
336337
loaded_weight,
337338
torch.zeros(param.shape[0] - loaded_weight.shape[0],

vllm/worker/cache_engine.py

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,12 +6,9 @@
66
from vllm.attention import get_attn_backend
77
from vllm.config import CacheConfig, DeviceConfig, ModelConfig, ParallelConfig
88
from vllm.logger import init_logger
9-
from vllm.utils import (STR_DTYPE_TO_TORCH_DTYPE, get_dtype_size, is_hpu,
9+
from vllm.utils import (STR_DTYPE_TO_TORCH_DTYPE, get_dtype_size,
1010
is_pin_memory_available)
1111

12-
if is_hpu():
13-
pass
14-
1512
logger = init_logger(__name__)
1613

1714

0 commit comments

Comments
 (0)