-
Notifications
You must be signed in to change notification settings - Fork 5.1k
[CPU] Add tutorial docs for SGL on CPU #8000
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
26 commits
Select commit
Hold shift + click to select a range
057172c
Add BKC doc for running on CPU
ZailiWang ade6b39
table format fix
ZailiWang 2b10c42
Merge branch 'sgl-project:main' into main
ZailiWang 9aa64df
updates and add DS-R1 example
ZailiWang 231161e
Merge branch 'sgl-project:main' into main
ZailiWang 9056a90
Append supported models
ZailiWang f1b0454
Merge branch 'sgl-project:main' into main
ZailiWang 2e17d27
Update docs/references/cpu.md
ZailiWang 6d71293
Update docs/references/cpu.md
ZailiWang c542465
Update docs/references/cpu.md
ZailiWang b27717e
Update docs/start/install.md
ZailiWang 77463e2
Update docs/references/cpu.md
ZailiWang 2ada4d6
correct SGLANG_CPU_OMP_THREADS_BIND usage
ZailiWang dd99fd2
update expression
ZailiWang 35c736a
update support table
ZailiWang 7d8616b
Add command list for bare metal env setup
ZailiWang 23a466a
Merge branch 'main' into main
ZailiWang 0077248
update
ZailiWang af5ffc6
Merge branch 'main' of https://github.com/ZailiWang/sglang
ZailiWang 8ffcd03
Merge branch 'main' into main
ZailiWang 289a442
Improve expressions per comments
ZailiWang 2da60e6
Merge branch 'main' of https://github.com/ZailiWang/sglang
ZailiWang 95967e1
Merge branch 'main' into main
ZailiWang 3fc3cc6
improve expressions
ZailiWang 400f653
Merge branch 'main' into main
ZailiWang fe68e8a
Merge branch 'sgl-project:main' into main
ZailiWang File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,197 @@ | ||
| # SGLang on CPU | ||
|
|
||
| The document addresses how to set up the [SGLang](https://github.com/sgl-project/sglang) environment and run LLM inference on CPU servers. | ||
| Specifically, SGLang is well optimized on the CPUs equipped with Intel® AMX® Instructions, | ||
| which are 4th generation or newer Intel® Xeon® Scalable Processors. | ||
|
|
||
| ## Optimized Model List | ||
|
|
||
| A list of popular LLMs are optimized and run efficiently on CPU, | ||
| including the most notable open-source models like Llama series, Qwen series, | ||
| and the phenomenal high-quality reasoning model DeepSeek-R1. | ||
|
|
||
| | Model Name | BF16 | w8a8_int8 | FP8 | | ||
| |:---:|:---:|:---:|:---:| | ||
| | DeepSeek-R1 | | [meituan/DeepSeek-R1-Channel-INT8](https://huggingface.co/meituan/DeepSeek-R1-Channel-INT8) | [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) | | ||
| | Llama-3.2-3B | [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) | [RedHatAI/Llama-3.2-3B-quantized.w8a8](https://huggingface.co/RedHatAI/Llama-3.2-3B-Instruct-quantized.w8a8) | | | ||
| | Llama-3.1-8B | [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) | [RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8](https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8) | | | ||
| | QwQ-32B | | [RedHatAI/QwQ-32B-quantized.w8a8](https://huggingface.co/RedHatAI/QwQ-32B-quantized.w8a8) | | | ||
| | DeepSeek-Distilled-Llama | | [RedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8](https://huggingface.co/RedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8) | | | ||
| | Qwen3-235B | | | [Qwen/Qwen3-235B-A22B-FP8](https://huggingface.co/Qwen/Qwen3-235B-A22B-FP8) | | ||
|
|
||
| **Note:** The model identifiers listed in the table above | ||
| have been verified on 6th Gen Intel® Xeon® P-core platforms. | ||
|
|
||
| ## Installation | ||
|
|
||
| ### Install Using Docker | ||
|
|
||
| It is recommended to use Docker for setting up the SGLang environment. | ||
| A [Dockerfile](https://github.com/sgl-project/sglang/blob/main/docker/Dockerfile.xeon) is provided to facilitate the installation. | ||
| Replace `<secret>` below with your [HuggingFace access token](https://huggingface.co/docs/hub/en/security-tokens). | ||
|
|
||
| ```bash | ||
| # Clone the SGLang repository | ||
| git clone https://github.com/sgl-project/sglang.git | ||
| cd sglang/docker | ||
|
|
||
| # Build the docker image | ||
| docker build -t sglang-cpu:main -f Dockerfile.xeon . | ||
|
|
||
| # Initiate a docker container | ||
| docker run \ | ||
| -it \ | ||
| --privileged \ | ||
| --ipc=host \ | ||
| --network=host \ | ||
| -v /dev/shm:/dev/shm \ | ||
| -v ~/.cache/huggingface:/root/.cache/huggingface \ | ||
| -p 30000:30000 \ | ||
| -e "HF_TOKEN=<secret>" \ | ||
| sglang-cpu:main /bin/bash | ||
| ``` | ||
|
|
||
| ### Install From Source | ||
|
|
||
| If you'd prefer to install SGLang in a bare metal environment, | ||
| the command list is as below. | ||
| It is worth noting that the environment variable `SGLANG_USE_CPU_ENGINE=1` | ||
| is required to enable SGLang service with CPU engine. | ||
|
|
||
| ```bash | ||
| # Create and activate a conda environment | ||
| conda create -n sgl-cpu python=3.12 -y | ||
| conda activate sgl-cpu | ||
|
|
||
| # Optional: Set PyTorch CPU as primary pip install channel to avoid installing CUDA version | ||
| pip config set global.index-url https://download.pytorch.org/whl/cpu | ||
| pip config set global.extra-index-url https://pypi.org/simple | ||
|
|
||
| # Check if some conda related environment variables have been set | ||
| env | grep -i conda | ||
| # The following environment variable settings are required | ||
| # if they have not been set properly | ||
| export CONDA_EXE=$(which conda) | ||
| export CONDA_ROOT=${CONDA_EXE}/../.. | ||
| export CONDA_PREFIX=${CONDA_ROOT}/envs/sgl-cpu | ||
| export PATH=${PATH}:${CONDA_ROOT}/bin:${CONDA_ROOT}/condabin | ||
|
|
||
| # Clone the SGLang code | ||
| git clone https://github.com/sgl-project/sglang.git | ||
| cd sglang | ||
| git checkout <YOUR-DESIRED-VERSION> | ||
|
|
||
| # Install SGLang dependent libs, and build SGLang main package | ||
| pip install --upgrade pip setuptools | ||
| conda install -y libsqlite==3.48.0 gperftools tbb libnuma numactl | ||
| pip install intel-openmp | ||
| pip install -e "python[all_cpu]" | ||
|
|
||
| # Build the CPU backend kernels | ||
| cd sgl-kernel | ||
| cp pyproject_cpu.toml pyproject.toml | ||
| pip install -v . | ||
|
|
||
| # Other required environment variables | ||
| # Recommend to set these in ~/.bashrc in order not to set every time in a new terminal | ||
| export SGLANG_USE_CPU_ENGINE=1 | ||
| export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libiomp5.so:${CONDA_PREFIX}/lib/libtcmalloc.so:${CONDA_PREFIX}/lib/libtbbmalloc.so.2 | ||
| ``` | ||
|
|
||
| ## Launch of the Serving Engine | ||
|
|
||
| Example command to launch SGLang serving: | ||
|
|
||
| ```bash | ||
| python -m sglang.launch_server \ | ||
| --model <MODEL_ID_OR_PATH> \ | ||
| --trust-remote-code \ | ||
| --disable-overlap-schedule \ | ||
| --device cpu \ | ||
| --host 0.0.0.0 \ | ||
| --tp 6 | ||
| ``` | ||
|
|
||
| Notes: | ||
|
|
||
| 1. For running W8A8 quantized models, please add the flag `--quantization w8a8_int8`. | ||
|
|
||
| 2. The flag `--tp 6` specifies that tensor parallelism will be applied using 6 ranks (TP6). | ||
| The number of TP specified is how many TP ranks will be used during the execution. | ||
| In a CPU platform, a TP rank means a sub-NUMA cluster (SNC). | ||
| Usually we can get the SNC information (How many available) from Operation System. | ||
| User can specify TP to be no more than the total available SNCs in current system. | ||
|
|
||
| If the specified TP rank number differs from the total SNC count, | ||
| the system will automatically utilize the first `n` SNCs. | ||
| Note that `n` cannot exceed the total SNC number, doing so will result in an error. | ||
|
|
||
| To specify the cores to be used, we need to explicitly set the environment variable `SGLANG_CPU_OMP_THREADS_BIND`. | ||
| For example, if we want to run the SGLang service using the first 40 cores of each SNC on a Xeon® 6980P server, | ||
| which has 43-43-42 cores on the 3 SNCs of a socket, we should set: | ||
|
|
||
| ```bash | ||
| export SGLANG_CPU_OMP_THREADS_BIND="0-39|43-82|86-125|128-167|171-210|214-253" | ||
ZailiWang marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| ``` | ||
|
|
||
| 3. A warmup step is automatically triggered when the service is started. | ||
| The server is ready when you see the log `The server is fired up and ready to roll!`. | ||
|
|
||
| ## Benchmarking with Requests | ||
|
|
||
| You can benchmark the performance via the `bench_serving` script. | ||
| Run the command in another terminal. | ||
|
|
||
| ```bash | ||
| python -m sglang.bench_serving \ | ||
| --dataset-name random \ | ||
| --random-input-len 1024 \ | ||
| --random-output-len 1024 \ | ||
| --num-prompts 1 \ | ||
| --request-rate inf \ | ||
| --random-range-ratio 1.0 | ||
| ``` | ||
|
|
||
| The detail explanations of the parameters can be looked up by the command: | ||
|
|
||
| ```bash | ||
| python -m sglang.bench_serving -h | ||
| ``` | ||
|
|
||
| Additionally, the requests can be formed with | ||
| [OpenAI Completions API](https://docs.sglang.ai/backend/openai_api_completions.html) | ||
| and sent via the command line (e.g. using `curl`) or via your own script. | ||
|
|
||
| ## Example: Running DeepSeek-R1 | ||
|
|
||
| An example command to launch service for W8A8 DeepSeek-R1 on a Xeon® 6980P server | ||
|
|
||
| ```bash | ||
| python -m sglang.launch_server \ | ||
| --model meituan/DeepSeek-R1-Channel-INT8 \ | ||
| --trust-remote-code \ | ||
| --disable-overlap-schedule \ | ||
| --device cpu \ | ||
| --quantization w8a8_int8 \ | ||
| --host 0.0.0.0 \ | ||
| --mem-fraction-static 0.8 \ | ||
| --max-total-token 65536 \ | ||
| --tp 6 | ||
| ``` | ||
|
|
||
| Similarly, an example command to launch service for FP8 DeepSeek-R1 would be | ||
|
|
||
| ```bash | ||
| python -m sglang.launch_server \ | ||
| --model deepseek-ai/DeepSeek-R1 \ | ||
| --trust-remote-code \ | ||
| --disable-overlap-schedule \ | ||
| --device cpu \ | ||
| --host 0.0.0.0 \ | ||
| --mem-fraction-static 0.8 \ | ||
| --max-total-token 65536 \ | ||
| --tp 6 | ||
| ``` | ||
|
|
||
| Then you can test with `bench_serving` command or construct your own command or script | ||
| following [the benchmarking example](#benchmarking-with-requests). | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -5,3 +5,4 @@ Hardware Supports | |
|
|
||
| amd.md | ||
| nvidia_jetson.md | ||
| cpu.md | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.