From 057172cc051c4272f040b039b7f2720fd5da00de Mon Sep 17 00:00:00 2001 From: ZailiWang Date: Thu, 10 Jul 2025 11:10:32 +0800 Subject: [PATCH 01/16] Add BKC doc for running on CPU --- docs/references/cpu.md | 109 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 109 insertions(+) create mode 100644 docs/references/cpu.md diff --git a/docs/references/cpu.md b/docs/references/cpu.md new file mode 100644 index 000000000000..8e48435786dd --- /dev/null +++ b/docs/references/cpu.md @@ -0,0 +1,109 @@ +# SGLang on CPU + +The document addresses how to set up the [SGLang](https://github.com/sgl-project/sglang) environment and run LLM inference on CPU servers. +Specifically, the model service is well optimized on the CPUs equipped with Intel® AMX® Instructions, +which are the 4th or newer Gen of Intel® Xeon® Scalable Processors. + +## Optimized Model List + +A list of popular LLMs are optimized and run efficiently on CPU, +including the most notable open-source models like Llama series, Qwen series, +and the phenomenal high-quality reasoning model DeepSeek-R1. + +| Model Name | BF16 | w8a8_int8 | w4a16 | FP8 | +|:---:|:---:|:---:|:---:| +| DeepSeek-R1 | | meituan/DeepSeek-R1-Channel-INT8 | | deepseek-ai/DeepSeek-R1 | +| Llama-3.2-3B | meta-llama/Llama-3.2-3B-Instruct | RedHatAI/Llama-3.2-3B-quantized.w8a8 | AMead10/Llama-3.2-3B-Instruct-AWQ | | + +**Note:** In the above table, if the model ID is exhibited in the grid, +it means the model is verified. + +## Installation + +It is recommended to use Docker for setting up the SGLang environment. +A [Dockerfile](https://github.com/sgl-project/sglang/blob/main/docker/Dockerfile.xeon) is provided to facilitate the installation. +Replace `` below with your [HuggingFace access token](https://huggingface.co/docs/hub/en/security-tokens). + +```bash +# Clone the SGLang repository +git clone https://github.com/sgl-project/sglang.git +cd sglang/docker + +# Build the docker image +docker build -t sglang-cpu:main -f Dockerfile.xeon . + +# Initiate a docker container +docker run \ + -it \ + --privileged \ + --ipc=host \ + -v /dev/shm:/dev/shm \ + -v ~/.cache/huggingface:/root/.cache/huggingface \ + -p 30000:30000 \ + -e "HF_TOKEN=" \ + sglang-cpu:main /bin/bash +``` + +If you'd prefer to install SGLang in a bare metal environment, +please take the command list in the Dockerfile as reference. + +## Launch of the Serving Engine + +An example command for LLM serving engine launching would be: + +```bash +python -m sglang.launch_server \ + --model \ + --trust-remote-code \ + --disable-overlap-schedule \ + --device cpu \ + --host 0.0.0.0 \ + --tp 6 +``` + +Notes: + +1. For running INT8 quantized models, please add the flag `--quantization w8a8_int8`. + +2. The flag `--tp 6` indicates that we will apply tensor parallel with 6 ranks (TP6). +In general the TP rank number should be in line with the total number of sub-numa clusters (SNCs) on the server +(e.g. TP6 should be applied on a server having 2 sockets with SNC3 configuration). + + If the desired TP rank number is not the same with total SNC number, an explicit setting of env variable + `SGLANG_CPU_OMP_THREADS_BIND` is needed. For example, if we want to run TP3 on the 1st socket of the server + with 120cc x 2 sockets, which has totally 6 SNCs, we need to set + + ```bash + export SGLANG_CPU_OMP_THREADS_BIND="0-39|40-79|80-119" + ``` + + and set `--tp 3` in the `launch_server` command. + +3. An warmup step is automatically triggered when the service is started. +When `The server is fired up and ready to roll!` is echoed, +the server is ready to handle the incoming requests. + +## Benchmarking with Requests + +We can benchmark the performance via the `bench_serving` script. +Run the command in another terminal. + +```bash +python -m sglang.bench_serving \ + --dataset-name random \ + --random-input-len 1024 \ + --random-output-len 1024 \ + --num-prompts 1 \ + --request-rate inf \ + --random-range-ratio 1.0 +``` + +The detail explanations of the parameters can be looked up by the command: + +```bash +python -m sglang.bench_serving -h +``` + +Additionally, the requests can be formed with +[OpenAI Completions API](https://docs.sglang.ai/backend/openai_api_completions.html) +and sent via the command line (e.g. using `curl`) or via your own script. From ade6b39bd4eb7032f3202d479a8931d99631b4af Mon Sep 17 00:00:00 2001 From: ZailiWang Date: Thu, 10 Jul 2025 11:14:11 +0800 Subject: [PATCH 02/16] table format fix --- docs/references/cpu.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/references/cpu.md b/docs/references/cpu.md index 8e48435786dd..9b68e46cf5bf 100644 --- a/docs/references/cpu.md +++ b/docs/references/cpu.md @@ -11,7 +11,7 @@ including the most notable open-source models like Llama series, Qwen series, and the phenomenal high-quality reasoning model DeepSeek-R1. | Model Name | BF16 | w8a8_int8 | w4a16 | FP8 | -|:---:|:---:|:---:|:---:| +|:---:|:---:|:---:|:---:|:---:| | DeepSeek-R1 | | meituan/DeepSeek-R1-Channel-INT8 | | deepseek-ai/DeepSeek-R1 | | Llama-3.2-3B | meta-llama/Llama-3.2-3B-Instruct | RedHatAI/Llama-3.2-3B-quantized.w8a8 | AMead10/Llama-3.2-3B-Instruct-AWQ | | From 9aa64df16f8da066c489abc2553b0dd2263c5932 Mon Sep 17 00:00:00 2001 From: ZailiWang Date: Fri, 11 Jul 2025 12:08:23 +0800 Subject: [PATCH 03/16] updates and add DS-R1 example --- docs/references/cpu.md | 49 ++++++++++++++++++++++++++++++------ docs/references/deepseek.md | 3 +++ docs/references/hardware.rst | 1 + docs/start/install.md | 3 +++ 4 files changed, 49 insertions(+), 7 deletions(-) diff --git a/docs/references/cpu.md b/docs/references/cpu.md index 9b68e46cf5bf..f48f67232109 100644 --- a/docs/references/cpu.md +++ b/docs/references/cpu.md @@ -10,13 +10,13 @@ A list of popular LLMs are optimized and run efficiently on CPU, including the most notable open-source models like Llama series, Qwen series, and the phenomenal high-quality reasoning model DeepSeek-R1. -| Model Name | BF16 | w8a8_int8 | w4a16 | FP8 | -|:---:|:---:|:---:|:---:|:---:| -| DeepSeek-R1 | | meituan/DeepSeek-R1-Channel-INT8 | | deepseek-ai/DeepSeek-R1 | -| Llama-3.2-3B | meta-llama/Llama-3.2-3B-Instruct | RedHatAI/Llama-3.2-3B-quantized.w8a8 | AMead10/Llama-3.2-3B-Instruct-AWQ | | +| Model Name | BF16 | w8a8_int8 | FP8 | +|:---:|:---:|:---:|:---:| +| DeepSeek-R1 | | [meituan/DeepSeek-R1-Channel-INT8](https://huggingface.co/meituan/DeepSeek-R1-Channel-INT8) | [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) | +| Llama-3.2-3B | [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) | [RedHatAI/Llama-3.2-3B-quantized.w8a8](https://huggingface.co/RedHatAI/Llama-3.2-3B-Instruct-quantized.w8a8) | | -**Note:** In the above table, if the model ID is exhibited in the grid, -it means the model is verified. +**Note:** In the above table, if the model identifier is listed in the grid, +it means the model is verified. ## Installation @@ -37,6 +37,7 @@ docker run \ -it \ --privileged \ --ipc=host \ + --network=host \ -v /dev/shm:/dev/shm \ -v ~/.cache/huggingface:/root/.cache/huggingface \ -p 30000:30000 \ @@ -63,7 +64,7 @@ python -m sglang.launch_server \ Notes: -1. For running INT8 quantized models, please add the flag `--quantization w8a8_int8`. +1. For running W8A8 quantized models, please add the flag `--quantization w8a8_int8`. 2. The flag `--tp 6` indicates that we will apply tensor parallel with 6 ranks (TP6). In general the TP rank number should be in line with the total number of sub-numa clusters (SNCs) on the server @@ -107,3 +108,37 @@ python -m sglang.bench_serving -h Additionally, the requests can be formed with [OpenAI Completions API](https://docs.sglang.ai/backend/openai_api_completions.html) and sent via the command line (e.g. using `curl`) or via your own script. + +## Example: Running DeepSeek-R1 + +The example W8A8 model service launching command in the container on a Xeon® 6980P server would be + +```bash +python -m sglang.launch_server \ + --model meituan/DeepSeek-R1-Channel-INT8 \ + --trust-remote-code \ + --disable-overlap-schedule \ + --device cpu \ + --quantization w8a8_int8 \ + --host 0.0.0.0 \ + --mem-fraction-static 0.8 \ + --max-total-token 65536 \ + --tp 6 +``` + +Similarly, the example FP8 model service launching command would be + +```bash +python -m sglang.launch_server \ + --model deepseek-ai/DeepSeek-R1 \ + --trust-remote-code \ + --disable-overlap-schedule \ + --device cpu \ + --host 0.0.0.0 \ + --mem-fraction-static 0.8 \ + --max-total-token 65536 \ + --tp 6 +``` + +Then we can test with `bench_serving` command or construct our own command or script +following [the benchmarking example](#benchmarking-with-requests). diff --git a/docs/references/deepseek.md b/docs/references/deepseek.md index 7b464b7c2f69..002b7fbb1265 100644 --- a/docs/references/deepseek.md +++ b/docs/references/deepseek.md @@ -14,6 +14,7 @@ To run DeepSeek V3/R1 models, the requirements are as follows: | **Full precision FP8**
*(recommended)* | 8 x H200 | | | 8 x MI300X | | | 2 x 8 x H100/800/20 | +| | Xeon 6980P CPU | | **Full precision BF16** | 2 x 8 x H200 | | | 2 x 8 x MI300X | | | 4 x 8 x H100/800/20 | @@ -22,6 +23,7 @@ To run DeepSeek V3/R1 models, the requirements are as follows: | | 8 x A100/A800 | | **Quantized weights (int8)** | 16 x A100/800 | | | 32 x L40S | +| | Xeon 6980P CPU |