1- vLLM with Intel® Gaudi® 2 AI Accelerators
1+ vLLM with Intel® Gaudi® AI Accelerators
22=========================================
33
4- This README provides instructions on running vLLM with Intel Gaudi
5- devices.
4+ This README provides instructions on running vLLM with Intel Gaudi devices.
65
76Requirements and Installation
87=============================
@@ -13,17 +12,13 @@ to set up the environment. To achieve the best performance, please
1312follow the methods outlined in the `Optimizing Training Platform
1413Guide <https://docs.habana.ai/en/latest/PyTorch/Model_Optimization_PyTorch/Optimization_in_Training_Platform.html> `__.
1514
16- .. note ::
17- In this release (1.16.0), we are only targeting functionality
18- and accuracy. Performance will be improved in next releases.
19-
2015Requirements
2116------------
2217
2318- OS: Ubuntu 22.04 LTS
2419- Python: 3.10
25- - Intel Gaudi 2 accelerator
26- - Intel Gaudi software version 1.16.0
20+ - Intel Gaudi accelerator
21+ - Intel Gaudi software version 1.16.0 or newer
2722
2823To verify that the Intel Gaudi software was correctly installed, run:
2924
@@ -49,20 +44,30 @@ Use the following commands to run a Docker image:
4944
5045.. code :: console
5146
52- $ docker pull vault.habana.ai/gaudi-docker/1.16.0 /ubuntu22.04/habanalabs/pytorch-installer-2.2.0 :latest
53- $ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.16.0 /ubuntu22.04/habanalabs/pytorch-installer-2.2.0 :latest
47+ $ docker pull vault.habana.ai/gaudi-docker/1.16.2 /ubuntu22.04/habanalabs/pytorch-installer-2.2.2 :latest
48+ $ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.16.2 /ubuntu22.04/habanalabs/pytorch-installer-2.2.2 :latest
5449
55- Build and Install vLLM-fork
50+ Build and Install vLLM
5651---------------------------
5752
58- To build and install vLLM-fork from source, run:
53+ To build and install vLLM from source, run:
54+
55+ .. code :: console
56+
57+ $ git clone https://github.com/vllm-project/vllm.git
58+ $ cd vllm
59+ $ python setup.py develop
60+
61+
62+ Currently, the latest features and performance optimizations are developed in Gaudi's `vLLM-fork <https://github.com/HabanaAI/vllm-fork >`__ and we periodically upstream them to vLLM main repo. To install latest `HabanaAI/vLLM-fork <https://github.com/HabanaAI/vllm-fork >`__, run the following:
5963
6064.. code :: console
6165
6266 $ git clone https://github.com/HabanaAI/vllm-fork.git
6367 $ cd vllm-fork
64- # git checkout v0.4.2-Gaudi-1.16.0
65- $ pip install -e . # This may take 5-10 minutes.
68+ $ git checkout habana_main
69+ $ python setup.py develop
70+
6671
6772 Supported Features
6873==================
@@ -72,13 +77,12 @@ Supported Features
7277- Online inference via `OpenAI-Compatible
7378 Server <https://docs.vllm.ai/en/latest/getting_started/quickstart.html#openai-compatible-server> `__
7479- HPU autodetection - no need to manually select device within vLLM
75- - Paged KV cache with algorithms enabled for Intel Gaudi 2 accelerators
80+ - Paged KV cache with algorithms enabled for Intel Gaudi accelerators
7681- Custom Intel Gaudi implementations of Paged Attention, KV cache ops,
7782 prefill attention, Root Mean Square Layer Normalization, Rotary
7883 Positional Encoding
7984- Tensor parallelism support for multi-card inference
80- - Inference with `HPU
81- Graphs <https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_HPU_Graphs.html> `__
85+ - Inference with `HPU Graphs <https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_HPU_Graphs.html >`__
8286 for accelerating low-batch latency and throughput
8387
8488Unsupported Features
@@ -94,20 +98,32 @@ Supported Configurations
9498========================
9599
96100The following configurations have been validated to be function with
97- Gaudi devices. Configurations that are not listed may or may not work.
101+ Gaudi2 devices. Configurations that are not listed may or may not work.
98102
99103- `meta-llama/Llama-2-7b <https://huggingface.co/meta-llama/Llama-2-7b >`__
100104 on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16
101105 datatype with random or greedy sampling
102106- `meta-llama/Llama-2-7b-chat-hf <https://huggingface.co/meta-llama/Llama-2-7b-chat-hf >`__
103107 on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16
104108 datatype with random or greedy sampling
109+ - `meta-llama/Meta-Llama-3-8B <https://huggingface.co/meta-llama/Meta-Llama-3-8B >`__
110+ on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16
111+ datatype with random or greedy sampling
112+ - `meta-llama/Meta-Llama-3-8B-Instruct <https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct >`__
113+ on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16
114+ datatype with random or greedy sampling
105115- `meta-llama/Llama-2-70b <https://huggingface.co/meta-llama/Llama-2-70b >`__
106- with tensor parallelism on 8x HPU, BF16 datatype with random or
107- greedy sampling
116+ with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling
108117- `meta-llama/Llama-2-70b-chat-hf <https://huggingface.co/meta-llama/Llama-2-70b-chat-hf >`__
109- with tensor parallelism 8x HPU, BF16 datatype with random or greedy
110- sampling
118+ with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling
119+ - `meta-llama/Meta-Llama-3-70B <https://huggingface.co/meta-llama/Meta-Llama-3-70B >`__
120+ with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling
121+ - `meta-llama/Meta-Llama-3-70B-Instruct <https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct >`__
122+ with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling
123+ - `mistralai/Mistral-7B-Instruct-v0.3 <https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3 >`__
124+ on single HPU or with tensor parallelism on 2x HPU, BF16 datatype with random or greedy sampling
125+ - `mistralai/Mixtral-8x7B-Instruct-v0.1 <https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1 >`__
126+ with tensor parallelism on 2x HPU, BF16 datatype with random or greedy sampling
111127
112128Performance Tips
113129================
0 commit comments