Skip to content
Merged
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
057172c
Add BKC doc for running on CPU
ZailiWang Jul 10, 2025
ade6b39
table format fix
ZailiWang Jul 10, 2025
2b10c42
Merge branch 'sgl-project:main' into main
ZailiWang Jul 10, 2025
9aa64df
updates and add DS-R1 example
ZailiWang Jul 11, 2025
231161e
Merge branch 'sgl-project:main' into main
ZailiWang Jul 11, 2025
9056a90
Append supported models
ZailiWang Jul 11, 2025
f1b0454
Merge branch 'sgl-project:main' into main
ZailiWang Jul 13, 2025
2e17d27
Update docs/references/cpu.md
ZailiWang Jul 13, 2025
6d71293
Update docs/references/cpu.md
ZailiWang Jul 13, 2025
c542465
Update docs/references/cpu.md
ZailiWang Jul 13, 2025
b27717e
Update docs/start/install.md
ZailiWang Jul 13, 2025
77463e2
Update docs/references/cpu.md
ZailiWang Jul 13, 2025
2ada4d6
correct SGLANG_CPU_OMP_THREADS_BIND usage
ZailiWang Jul 14, 2025
dd99fd2
update expression
ZailiWang Jul 14, 2025
35c736a
update support table
ZailiWang Jul 14, 2025
7d8616b
Add command list for bare metal env setup
ZailiWang Jul 16, 2025
23a466a
Merge branch 'main' into main
ZailiWang Jul 16, 2025
0077248
update
ZailiWang Jul 18, 2025
af5ffc6
Merge branch 'main' of https://github.com/ZailiWang/sglang
ZailiWang Jul 18, 2025
8ffcd03
Merge branch 'main' into main
ZailiWang Jul 18, 2025
289a442
Improve expressions per comments
ZailiWang Jul 21, 2025
2da60e6
Merge branch 'main' of https://github.com/ZailiWang/sglang
ZailiWang Jul 21, 2025
95967e1
Merge branch 'main' into main
ZailiWang Jul 21, 2025
3fc3cc6
improve expressions
ZailiWang Jul 21, 2025
400f653
Merge branch 'main' into main
ZailiWang Jul 21, 2025
fe68e8a
Merge branch 'sgl-project:main' into main
ZailiWang Jul 24, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
198 changes: 198 additions & 0 deletions docs/references/cpu.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,198 @@
# SGLang on CPU

The document addresses how to set up the [SGLang](https://github.com/sgl-project/sglang) environment and run LLM inference on CPU servers.
Specifically, the model service is well optimized on the CPUs equipped with Intel® AMX® Instructions,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace 'the model service' with 'SGLang'

which are 4th generation or newer Intel® Xeon® Scalable Processors.

## Optimized Model List

A list of popular LLMs are optimized and run efficiently on CPU,
including the most notable open-source models like Llama series, Qwen series,
and the phenomenal high-quality reasoning model DeepSeek-R1.

| Model Name | BF16 | w8a8_int8 | FP8 |
|:---:|:---:|:---:|:---:|
| DeepSeek-R1 | | [meituan/DeepSeek-R1-Channel-INT8](https://huggingface.co/meituan/DeepSeek-R1-Channel-INT8) | [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) |
| Llama-3.2-3B | [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) | [RedHatAI/Llama-3.2-3B-quantized.w8a8](https://huggingface.co/RedHatAI/Llama-3.2-3B-Instruct-quantized.w8a8) | |
| Llama-3.1-8B | [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) | [RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8](https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8) | |
| QwQ-32B | | [RedHatAI/QwQ-32B-quantized.w8a8](https://huggingface.co/RedHatAI/QwQ-32B-quantized.w8a8) | |
| DeepSeek-Distilled-Llama | | [RedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8](https://huggingface.co/RedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8) | |
| Qwen3-235B | | | [Qwen/Qwen3-235B-A22B-FP8](https://huggingface.co/Qwen/Qwen3-235B-A22B-FP8) |

**Note:** In the above table, if the model identifier is listed in the grid,
it means the model is verified.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be replaced like: model identifiers in above table have been verified on Intel Xeon 6 P-core platform.


## Installation

### Install Using Docker

It is recommended to use Docker for setting up the SGLang environment.
A [Dockerfile](https://github.com/sgl-project/sglang/blob/main/docker/Dockerfile.xeon) is provided to facilitate the installation.
Replace `<secret>` below with your [HuggingFace access token](https://huggingface.co/docs/hub/en/security-tokens).

```bash
# Clone the SGLang repository
git clone https://github.com/sgl-project/sglang.git
cd sglang/docker

# Build the docker image
docker build -t sglang-cpu:main -f Dockerfile.xeon .

# Initiate a docker container
docker run \
-it \
--privileged \
--ipc=host \
--network=host \
-v /dev/shm:/dev/shm \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 30000:30000 \
-e "HF_TOKEN=<secret>" \
sglang-cpu:main /bin/bash
```

### Install From Source

If you'd prefer to install SGLang in a bare metal environment,
the command list is similar with the procedure in the Dockerfile.
It is worth noting that the environment variable `SGLANG_USE_CPU_ENGINE=1`
is required to be set to enable SGLang service with CPU engine.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove 'to be set'


```bash
# Create and activate a conda environment
conda create -n sgl-cpu python=3.12 -y
conda activate sgl-cpu

# Optional: Set PyTorch CPU as primary pip install channel to avoid installing CUDA version
pip config set global.index-url https://download.pytorch.org/whl/cpu
pip config set global.extra-index-url https://pypi.org/simple

# Check if some conda related environment variables have been set
env | grep -i conda
# The following environment variable settings are required
# if they have not been set properly
export CONDA_EXE=$(which conda)
export CONDA_ROOT=${CONDA_EXE}/../..
export CONDA_PREFIX=${CONDA_ROOT}/envs/sgl-cpu
export PATH=${PATH}:${CONDA_ROOT}/bin:${CONDA_ROOT}/condabin

# Clone the SGLang code
git clone https://github.com/sgl-project/sglang.git
cd sglang
git checkout <YOUR-DESIRED-VERSION>

# Install SGLang main package and its dependencies
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be 'Install SGLang dependent libs, and build SGLang main package'

pip install --upgrade pip setuptools
conda install -y libsqlite==3.48.0 gperftools tbb libnuma numactl
pip install intel-openmp
pip install -e "python[all_cpu]"

# Build the CPU backend kernels
cd sgl-kernel
cp pyproject_cpu.toml pyproject.toml
pip install -v .

# Other required environment variables
# Recommend to set these in ~/.bashrc in order not to set every time in a new terminal
export SGLANG_USE_CPU_ENGINE=1
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libiomp5.so:${CONDA_PREFIX}/lib/libtcmalloc.so:${CONDA_PREFIX}/lib/libtbbmalloc.so.2
```

## Launch of the Serving Engine

An example command for LLM serving engine launching would be:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace to be "Example command to launch SGLang serving:"


```bash
python -m sglang.launch_server \
--model <MODEL_ID_OR_PATH> \
--trust-remote-code \
--disable-overlap-schedule \
--device cpu \
--host 0.0.0.0 \
--tp 6
```

Notes:

1. For running W8A8 quantized models, please add the flag `--quantization w8a8_int8`.

2. The flag `--tp 6` specifies that tensor parallelism will be applied using 6 ranks (TP6).
In general, the number of TP ranks should correspond to the total number of sub-NUMA clusters (SNCs) on the server.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace line 120/121 with "The number of TP specified is how many TP ranks will be used during the execution. In a CPU platform, a TP rank means a sub-NUMA cluster (SNC). Usually we can get the SNC information (How many available) from Operation System. User can specify TP to be no more than the total available SNCs in current system."

For instance, TP6 is suitable for a server with 2 sockets configured with SNC3.

If the desired TP rank number differs from the total SNC count, the system will automatically
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace 'desired' to be 'specified'.
Remove ', provided --tp n is set and n is less than the total SNC count'

utilize the first `n` SNCs, provided `--tp n` is set and `n` is less than the total SNC count.
Note that `n` cannot exceed the total SNC number, doing so will result in an error.

To specify the cores to be used, we need to explicitly set the environment variable `SGLANG_CPU_OMP_THREADS_BIND`.
For example, if we want to run the SGLang service using the first 40 cores of each SNC on a Xeon® 6980P server,
which has 43-43-42 cores on the 3 SNCs of a socket (reserving the remaining 16 cores for other tasks), we should set:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove '(reserving the remaining 16 cores for other tasks)'


```bash
export SGLANG_CPU_OMP_THREADS_BIND="0-39|43-82|86-125|128-167|171-210|214-253"
```

Additionally, include `--tp 6` in the `launch_server` command.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the sentence here?


3. A warmup step is automatically triggered when the service is started.
When `The server is fired up and ready to roll!` is echoed,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace line 138/139 with 'The server is ready when you see the log 'The server is fired up and ready to roll!''

the server is ready to handle the incoming requests.

## Benchmarking with Requests

We can benchmark the performance via the `bench_serving` script.
Run the command in another terminal.

```bash
python -m sglang.bench_serving \
--dataset-name random \
--random-input-len 1024 \
--random-output-len 1024 \
--num-prompts 1 \
--request-rate inf \
--random-range-ratio 1.0
```

The detail explanations of the parameters can be looked up by the command:

```bash
python -m sglang.bench_serving -h
```

Additionally, the requests can be formed with
[OpenAI Completions API](https://docs.sglang.ai/backend/openai_api_completions.html)
and sent via the command line (e.g. using `curl`) or via your own script.

## Example: Running DeepSeek-R1

The example W8A8 model service launching command in the container on a Xeon® 6980P server would be
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace 'The example W8A8 model service launching command in the container on a Xeon® 6980P server' with 'An example command to launch service for W8A8 DeepSeek-R1 on a Xeon® 6980P server within container'


```bash
python -m sglang.launch_server \
--model meituan/DeepSeek-R1-Channel-INT8 \
--trust-remote-code \
--disable-overlap-schedule \
--device cpu \
--quantization w8a8_int8 \
--host 0.0.0.0 \
--mem-fraction-static 0.8 \
--max-total-token 65536 \
--tp 6
```

Similarly, the example FP8 model service launching command would be
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace 'the example FP8 model service launching command' with 'an example command to launch service for FP8 DeepSeek-R1'


```bash
python -m sglang.launch_server \
--model deepseek-ai/DeepSeek-R1 \
--trust-remote-code \
--disable-overlap-schedule \
--device cpu \
--host 0.0.0.0 \
--mem-fraction-static 0.8 \
--max-total-token 65536 \
--tp 6
```

Then we can test with `bench_serving` command or construct our own command or script
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be 'you' instead of 'we'

following [the benchmarking example](#benchmarking-with-requests).
3 changes: 3 additions & 0 deletions docs/references/deepseek.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ To run DeepSeek V3/R1 models, the requirements are as follows:
| **Full precision FP8**<br>*(recommended)* | 8 x H200 |
| | 8 x MI300X |
| | 2 x 8 x H100/800/20 |
| | Xeon 6980P CPU |
| **Full precision BF16** | 2 x 8 x H200 |
| | 2 x 8 x MI300X |
| | 4 x 8 x H100/800/20 |
Expand All @@ -22,6 +23,7 @@ To run DeepSeek V3/R1 models, the requirements are as follows:
| | 8 x A100/A800 |
| **Quantized weights (int8)** | 16 x A100/800 |
| | 32 x L40S |
| | Xeon 6980P CPU |

<style>
.md-typeset__table {
Expand Down Expand Up @@ -61,6 +63,7 @@ Detailed commands for reference:
- [8 x A100 (AWQ)](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-8-a100a800-with-awq-quantization)
- [16 x A100 (int8)](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-16-a100a800-with-int8-quantization)
- [32 x L40S (int8)](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-32-l40s-with-int8-quantization)
- [Xeon 6980P CPU](https://docs.sglang.ai/references/cpu.html#example-running-deepseek-r1)

### Download Weights
If you encounter errors when starting the server, ensure the weights have finished downloading. It's recommended to download them beforehand or restart multiple times until all weights are downloaded. Please refer to [DeepSeek V3](https://huggingface.co/deepseek-ai/DeepSeek-V3-Base#61-inference-with-deepseek-infer-demo-example-only) official guide to download the weights.
Expand Down
1 change: 1 addition & 0 deletions docs/references/hardware.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,4 @@ Hardware Supports

amd.md
nvidia_jetson.md
cpu.md
6 changes: 6 additions & 0 deletions docs/start/install.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,9 @@ cd ..
pip install -e "python[all_hip]"
```

Note: Please refer to [the CPU environment setup command list](../references/cpu.md#install-from-source)
to set up the SGLang environment for running the models with CPU servers.

## Method 3: Using docker

The docker images are available on Docker Hub as [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile](https://github.com/sgl-project/sglang/tree/main/docker).
Expand Down Expand Up @@ -87,6 +90,9 @@ drun -p 30000:30000 \
drun v0.4.9.post2-rocm630 python3 -m sglang.bench_one_batch --batch-size 32 --input 1024 --output 128 --model amd/Meta-Llama-3.1-8B-Instruct-FP8-KV --tp 8 --quantization fp8
```

Note: Please refer to [the CPU installation guide using Docker](../references/cpu.md#install-using-docker)
to set up the SGLang environment for running the models with CPU servers.

## Method 4: Using docker compose

<details>
Expand Down