From 057172cc051c4272f040b039b7f2720fd5da00de Mon Sep 17 00:00:00 2001
From: ZailiWang <zaili.wang@intel.com>
Date: Thu, 10 Jul 2025 11:10:32 +0800
Subject: [PATCH 01/16] Add BKC doc for running on CPU

---
 docs/references/cpu.md | 109 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 109 insertions(+)
 create mode 100644 docs/references/cpu.md
diff --git a/docs/references/cpu.md b/docs/references/cpu.md
new file mode 100644
index 000000000000..8e48435786dd
--- /dev/null
+++ b/docs/references/cpu.md
@@ -0,0 +1,109 @@
+# SGLang on CPU
+
+The document addresses how to set up the [SGLang](https://github.com/sgl-project/sglang) environment and run LLM inference on CPU servers.
+Specifically, the model service is well optimized on the CPUs equipped with Intel® AMX® Instructions,
+which are the 4th or newer Gen of Intel® Xeon® Scalable Processors.
+
+## Optimized Model List
+
+A list of popular LLMs are optimized and run efficiently on CPU,
+including the most notable open-source models like Llama series, Qwen series,
+and the phenomenal high-quality reasoning model DeepSeek-R1.
+
+| Model Name | BF16 | w8a8_int8 | w4a16 | FP8 |
+|:---:|:---:|:---:|:---:|
+| DeepSeek-R1 |   | meituan/DeepSeek-R1-Channel-INT8 |   | deepseek-ai/DeepSeek-R1 |
+| Llama-3.2-3B | meta-llama/Llama-3.2-3B-Instruct | RedHatAI/Llama-3.2-3B-quantized.w8a8 | AMead10/Llama-3.2-3B-Instruct-AWQ |   |
+
+**Note:** In the above table, if the model ID is exhibited in the grid,
+it means the model is verified. 
+
+## Installation
+
+It is recommended to use Docker for setting up the SGLang environment.
+A [Dockerfile](https://github.com/sgl-project/sglang/blob/main/docker/Dockerfile.xeon) is provided to facilitate the installation.
+Replace `<secret>` below with your [HuggingFace access token](https://huggingface.co/docs/hub/en/security-tokens).
+
+```bash
+# Clone the SGLang repository
+git clone https://github.com/sgl-project/sglang.git
+cd sglang/docker
+
+# Build the docker image
+docker build -t sglang-cpu:main -f Dockerfile.xeon .
+
+# Initiate a docker container
+docker run \
+    -it \
+    --privileged \
+    --ipc=host \
+    -v /dev/shm:/dev/shm \
+    -v ~/.cache/huggingface:/root/.cache/huggingface \
+    -p 30000:30000 \
+    -e "HF_TOKEN=<secret>" \
+    sglang-cpu:main /bin/bash
+```
+
+If you'd prefer to install SGLang in a bare metal environment,
+please take the command list in the Dockerfile as reference.
+
+## Launch of the Serving Engine
+
+An example command for LLM serving engine launching would be:
+
+```bash
+python -m sglang.launch_server   \
+    --model <MODEL_ID_OR_PATH>   \
+    --trust-remote-code          \
+    --disable-overlap-schedule   \
+    --device cpu                 \
+    --host 0.0.0.0               \
+    --tp 6
+```
+
+Notes:
+
+1. For running INT8 quantized models, please add the flag `--quantization w8a8_int8`.
+
+2. The flag `--tp 6` indicates that we will apply tensor parallel with 6 ranks (TP6).
+In general the TP rank number should be in line with the total number of sub-numa clusters (SNCs) on the server
+(e.g. TP6 should be applied on a server having 2 sockets with SNC3 configuration).
+
+    If the desired TP rank number is not the same with total SNC number, an explicit setting of env variable
+    `SGLANG_CPU_OMP_THREADS_BIND` is needed. For example, if we want to run TP3 on the 1st socket of the server
+    with 120cc x 2 sockets, which has totally 6 SNCs, we need to set
+
+    ```bash
+    export SGLANG_CPU_OMP_THREADS_BIND="0-39|40-79|80-119"
+    ```
+
+    and set `--tp 3` in the `launch_server` command.
+
+3. An warmup step is automatically triggered when the service is started.
+When `The server is fired up and ready to roll!` is echoed,
+the server is ready to handle the incoming requests.
+
+## Benchmarking with Requests
+
+We can benchmark the performance via the `bench_serving` script.
+Run the command in another terminal.
+
+```bash
+python -m sglang.bench_serving   \
+    --dataset-name random        \
+    --random-input-len 1024      \
+    --random-output-len 1024     \
+    --num-prompts 1              \
+    --request-rate inf           \
+    --random-range-ratio 1.0
+```
+
+The detail explanations of the parameters can be looked up by the command:
+
+```bash
+python -m sglang.bench_serving -h
+```
+
+Additionally, the requests can be formed with
+[OpenAI Completions API](https://docs.sglang.ai/backend/openai_api_completions.html)
+and sent via the command line (e.g. using `curl`) or via your own script.

From ade6b39bd4eb7032f3202d479a8931d99631b4af Mon Sep 17 00:00:00 2001
From: ZailiWang <zaili.wang@intel.com>
Date: Thu, 10 Jul 2025 11:14:11 +0800
Subject: [PATCH 02/16] table format fix

---
 docs/references/cpu.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/references/cpu.md b/docs/references/cpu.md
index 8e48435786dd..9b68e46cf5bf 100644
--- a/docs/references/cpu.md
+++ b/docs/references/cpu.md
@@ -11,7 +11,7 @@ including the most notable open-source models like Llama series, Qwen series,
 and the phenomenal high-quality reasoning model DeepSeek-R1.
 
 | Model Name | BF16 | w8a8_int8 | w4a16 | FP8 |
-|:---:|:---:|:---:|:---:|
+|:---:|:---:|:---:|:---:|:---:|
 | DeepSeek-R1 |   | meituan/DeepSeek-R1-Channel-INT8 |   | deepseek-ai/DeepSeek-R1 |
 | Llama-3.2-3B | meta-llama/Llama-3.2-3B-Instruct | RedHatAI/Llama-3.2-3B-quantized.w8a8 | AMead10/Llama-3.2-3B-Instruct-AWQ |   |
 

From 9aa64df16f8da066c489abc2553b0dd2263c5932 Mon Sep 17 00:00:00 2001
From: ZailiWang <zaili.wang@intel.com>
Date: Fri, 11 Jul 2025 12:08:23 +0800
Subject: [PATCH 03/16] updates and add DS-R1 example

---
 docs/references/cpu.md       | 49 ++++++++++++++++++++++++++++++------
 docs/references/deepseek.md  |  3 +++
 docs/references/hardware.rst |  1 +
 docs/start/install.md        |  3 +++
 4 files changed, 49 insertions(+), 7 deletions(-)

diff --git a/docs/references/cpu.md b/docs/references/cpu.md
index 9b68e46cf5bf..f48f67232109 100644
--- a/docs/references/cpu.md
+++ b/docs/references/cpu.md
@@ -10,13 +10,13 @@ A list of popular LLMs are optimized and run efficiently on CPU,
 including the most notable open-source models like Llama series, Qwen series,
 and the phenomenal high-quality reasoning model DeepSeek-R1.
 
-| Model Name | BF16 | w8a8_int8 | w4a16 | FP8 |
-|:---:|:---:|:---:|:---:|:---:|
-| DeepSeek-R1 |   | meituan/DeepSeek-R1-Channel-INT8 |   | deepseek-ai/DeepSeek-R1 |
-| Llama-3.2-3B | meta-llama/Llama-3.2-3B-Instruct | RedHatAI/Llama-3.2-3B-quantized.w8a8 | AMead10/Llama-3.2-3B-Instruct-AWQ |   |
+| Model Name | BF16 | w8a8_int8 | FP8 |
+|:---:|:---:|:---:|:---:|
+| DeepSeek-R1 |   | [meituan/DeepSeek-R1-Channel-INT8](https://huggingface.co/meituan/DeepSeek-R1-Channel-INT8) | [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) |
+| Llama-3.2-3B | [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) | [RedHatAI/Llama-3.2-3B-quantized.w8a8](https://huggingface.co/RedHatAI/Llama-3.2-3B-Instruct-quantized.w8a8) |   |
 
-**Note:** In the above table, if the model ID is exhibited in the grid,
-it means the model is verified. 
+**Note:** In the above table, if the model identifier is listed in the grid,
+it means the model is verified.
 
 ## Installation
 
@@ -37,6 +37,7 @@ docker run \
     -it \
     --privileged \
     --ipc=host \
+    --network=host \
     -v /dev/shm:/dev/shm \
     -v ~/.cache/huggingface:/root/.cache/huggingface \
     -p 30000:30000 \
@@ -63,7 +64,7 @@ python -m sglang.launch_server   \
 
 Notes:
 
-1. For running INT8 quantized models, please add the flag `--quantization w8a8_int8`.
+1. For running W8A8 quantized models, please add the flag `--quantization w8a8_int8`.
 
 2. The flag `--tp 6` indicates that we will apply tensor parallel with 6 ranks (TP6).
 In general the TP rank number should be in line with the total number of sub-numa clusters (SNCs) on the server
@@ -107,3 +108,37 @@ python -m sglang.bench_serving -h
 Additionally, the requests can be formed with
 [OpenAI Completions API](https://docs.sglang.ai/backend/openai_api_completions.html)
 and sent via the command line (e.g. using `curl`) or via your own script.
+
+## Example: Running DeepSeek-R1
+
+The example W8A8 model service launching command in the container on a Xeon® 6980P server would be
+
+```bash
+python -m sglang.launch_server                 \
+    --model meituan/DeepSeek-R1-Channel-INT8   \
+    --trust-remote-code                        \
+    --disable-overlap-schedule                 \
+    --device cpu                               \
+    --quantization w8a8_int8                   \
+    --host 0.0.0.0                             \
+    --mem-fraction-static 0.8                  \
+    --max-total-token 65536                    \
+    --tp 6
+```
+
+Similarly, the example FP8 model service launching command would be
+
+```bash
+python -m sglang.launch_server                 \
+    --model deepseek-ai/DeepSeek-R1            \
+    --trust-remote-code                        \
+    --disable-overlap-schedule                 \
+    --device cpu                               \
+    --host 0.0.0.0                             \
+    --mem-fraction-static 0.8                  \
+    --max-total-token 65536                    \
+    --tp 6
+```
+
+Then we can test with `bench_serving` command or construct our own command or script
+following [the benchmarking example](#benchmarking-with-requests).
diff --git a/docs/references/deepseek.md b/docs/references/deepseek.md
index 7b464b7c2f69..002b7fbb1265 100644
--- a/docs/references/deepseek.md
+++ b/docs/references/deepseek.md
@@ -14,6 +14,7 @@ To run DeepSeek V3/R1 models, the requirements are as follows:
 | **Full precision FP8**<br>*(recommended)* | 8 x H200 |
 | | 8 x MI300X |
 | | 2 x 8 x H100/800/20 |
+| | Xeon 6980P CPU |
 | **Full precision BF16** | 2 x 8 x H200 |
 | | 2 x 8 x MI300X |
 | | 4 x 8 x H100/800/20 |
@@ -22,6 +23,7 @@ To run DeepSeek V3/R1 models, the requirements are as follows:
 | | 8 x A100/A800 |
 | **Quantized weights (int8)** | 16 x A100/800 |
 | | 32 x L40S |
+| | Xeon 6980P CPU |
 
 <style>
 .md-typeset__table {
@@ -61,6 +63,7 @@ Detailed commands for reference:
 - [8 x A100 (AWQ)](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-8-a100a800-with-awq-quantization)
 - [16 x A100 (int8)](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-16-a100a800-with-int8-quantization)
 - [32 x L40S (int8)](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-32-l40s-with-int8-quantization)
+- [Xeon 6980P CPU](https://docs.sglang.ai/references/cpu.html#example-running-deepseek-r1)
 
 ### Download Weights
 If you encounter errors when starting the server, ensure the weights have finished downloading. It's recommended to download them beforehand or restart multiple times until all weights are downloaded. Please refer to [DeepSeek V3](https://huggingface.co/deepseek-ai/DeepSeek-V3-Base#61-inference-with-deepseek-infer-demo-example-only) official guide to download the weights.
diff --git a/docs/references/hardware.rst b/docs/references/hardware.rst
index 0500e543575f..ea37b2b49ebf 100644
--- a/docs/references/hardware.rst
+++ b/docs/references/hardware.rst
@@ -5,3 +5,4 @@ Hardware Supports
 
    amd.md
    nvidia_jetson.md
+   cpu.md
\ No newline at end of file
diff --git a/docs/start/install.md b/docs/start/install.md
index 54bbd94d443f..04f8112975c2 100644
--- a/docs/start/install.md
+++ b/docs/start/install.md
@@ -87,6 +87,9 @@ drun -p 30000:30000 \
 drun v0.4.9.post1-rocm630 python3 -m sglang.bench_one_batch --batch-size 32 --input 1024 --output 128 --model amd/Meta-Llama-3.1-8B-Instruct-FP8-KV --tp 8 --quantization fp8
 ```
 
+Please refer to [the CPU installation guide](https://github.com/sgl-project/sglang/blob/main/docs/references/cpu.md#installation)
+to set up the SGLang environment for running the models with CPU servers.
+
 ## Method 4: Using docker compose
 
 <details>

From 9056a9031ac55d64f5af4e4a6da807185a092ead Mon Sep 17 00:00:00 2001
From: ZailiWang <zaili.wang@intel.com>
Date: Fri, 11 Jul 2025 15:40:23 +0800
Subject: [PATCH 04/16] Append supported models

---
 docs/references/cpu.md | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/docs/references/cpu.md b/docs/references/cpu.md
index f48f67232109..ba3aee42ce52 100644
--- a/docs/references/cpu.md
+++ b/docs/references/cpu.md
@@ -13,7 +13,11 @@ and the phenomenal high-quality reasoning model DeepSeek-R1.
 | Model Name | BF16 | w8a8_int8 | FP8 |
 |:---:|:---:|:---:|:---:|
 | DeepSeek-R1 |   | [meituan/DeepSeek-R1-Channel-INT8](https://huggingface.co/meituan/DeepSeek-R1-Channel-INT8) | [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) |
+| Llama-3.2-1B | [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) | [RedHatAI/Llama-3.2-1B-quantized.w8a8](https://huggingface.co/RedHatAI/Llama-3.2-1B-Instruct-quantized.w8a8) |   |
 | Llama-3.2-3B | [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) | [RedHatAI/Llama-3.2-3B-quantized.w8a8](https://huggingface.co/RedHatAI/Llama-3.2-3B-Instruct-quantized.w8a8) |   |
+| Llama-3.1-8B | [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) | [RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8](https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8) |   |
+| QwQ-32B |   | [RedHatAI/QwQ-32B-quantized.w8a8](https://huggingface.co/RedHatAI/QwQ-32B-quantized.w8a8) |   |
+| DeepSeek-Distilled-Llama |   | [RedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8](https://huggingface.co/RedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8) |   |
 
 **Note:** In the above table, if the model identifier is listed in the grid,
 it means the model is verified.

From 2e17d2717b574f73bcc354f00b39cc40f4f0ae00 Mon Sep 17 00:00:00 2001
From: Zaili Wang <109502517+ZailiWang@users.noreply.github.com>
Date: Mon, 14 Jul 2025 07:13:36 +0800
Subject: [PATCH 05/16] Update docs/references/cpu.md

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
---
 docs/references/cpu.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/references/cpu.md b/docs/references/cpu.md
index ba3aee42ce52..b39cfb7f484b 100644
--- a/docs/references/cpu.md
+++ b/docs/references/cpu.md
@@ -2,7 +2,7 @@
 
 The document addresses how to set up the [SGLang](https://github.com/sgl-project/sglang) environment and run LLM inference on CPU servers.
 Specifically, the model service is well optimized on the CPUs equipped with Intel® AMX® Instructions,
-which are the 4th or newer Gen of Intel® Xeon® Scalable Processors.
+which are 4th generation or newer Intel® Xeon® Scalable Processors.
 
 ## Optimized Model List
 

From 6d7129313621d4ab88d966b43520123ff8ed6e94 Mon Sep 17 00:00:00 2001
From: Zaili Wang <109502517+ZailiWang@users.noreply.github.com>
Date: Mon, 14 Jul 2025 07:14:06 +0800
Subject: [PATCH 06/16] Update docs/references/cpu.md

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
---
 docs/references/cpu.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/references/cpu.md b/docs/references/cpu.md
index b39cfb7f484b..f5570aa69831 100644
--- a/docs/references/cpu.md
+++ b/docs/references/cpu.md
@@ -71,7 +71,7 @@ Notes:
 1. For running W8A8 quantized models, please add the flag `--quantization w8a8_int8`.
 
 2. The flag `--tp 6` indicates that we will apply tensor parallel with 6 ranks (TP6).
-In general the TP rank number should be in line with the total number of sub-numa clusters (SNCs) on the server
+In general the TP rank number should match the total number of sub-numa clusters (SNCs) on the server
 (e.g. TP6 should be applied on a server having 2 sockets with SNC3 configuration).
 
     If the desired TP rank number is not the same with total SNC number, an explicit setting of env variable

From c542465582c18946bd88ed71abebd05f284be5bf Mon Sep 17 00:00:00 2001
From: Zaili Wang <109502517+ZailiWang@users.noreply.github.com>
Date: Mon, 14 Jul 2025 07:14:28 +0800
Subject: [PATCH 07/16] Update docs/references/cpu.md

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
---
 docs/references/cpu.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/references/cpu.md b/docs/references/cpu.md
index f5570aa69831..ddabab4ce372 100644
--- a/docs/references/cpu.md
+++ b/docs/references/cpu.md
@@ -84,7 +84,7 @@ In general the TP rank number should match the total number of sub-numa clusters
 
     and set `--tp 3` in the `launch_server` command.
 
-3. An warmup step is automatically triggered when the service is started.
+3. A warmup step is automatically triggered when the service is started.
 When `The server is fired up and ready to roll!` is echoed,
 the server is ready to handle the incoming requests.
 

From b27717e32d5f92e09034c7fd2e6a1af1776bb42f Mon Sep 17 00:00:00 2001
From: Zaili Wang <109502517+ZailiWang@users.noreply.github.com>
Date: Mon, 14 Jul 2025 07:15:41 +0800
Subject: [PATCH 08/16] Update docs/start/install.md

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
---
 docs/start/install.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/start/install.md b/docs/start/install.md
index a9d6ff32acde..7dbb57aeefb6 100644
--- a/docs/start/install.md
+++ b/docs/start/install.md
@@ -87,7 +87,7 @@ drun -p 30000:30000 \
 drun v0.4.9.post2-rocm630 python3 -m sglang.bench_one_batch --batch-size 32 --input 1024 --output 128 --model amd/Meta-Llama-3.1-8B-Instruct-FP8-KV --tp 8 --quantization fp8
 ```
 
-Please refer to [the CPU installation guide](https://github.com/sgl-project/sglang/blob/main/docs/references/cpu.md#installation)
+Please refer to [the CPU installation guide](../references/cpu.md#installation)
 to set up the SGLang environment for running the models with CPU servers.
 
 ## Method 4: Using docker compose

From 77463e2fe51b42ef5859e9da5ccca466ef4ca478 Mon Sep 17 00:00:00 2001
From: Zaili Wang <109502517+ZailiWang@users.noreply.github.com>
Date: Mon, 14 Jul 2025 07:17:36 +0800
Subject: [PATCH 09/16] Update docs/references/cpu.md

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
---
 docs/references/cpu.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/references/cpu.md b/docs/references/cpu.md
index ddabab4ce372..f520a0c3afbf 100644
--- a/docs/references/cpu.md
+++ b/docs/references/cpu.md
@@ -76,7 +76,7 @@ In general the TP rank number should match the total number of sub-numa clusters
 
     If the desired TP rank number is not the same with total SNC number, an explicit setting of env variable
     `SGLANG_CPU_OMP_THREADS_BIND` is needed. For example, if we want to run TP3 on the 1st socket of the server
-    with 120cc x 2 sockets, which has totally 6 SNCs, we need to set
+    with 120c x 2 sockets, which has a total of 6 SNCs, we need to set
 
     ```bash
     export SGLANG_CPU_OMP_THREADS_BIND="0-39|40-79|80-119"

From 2ada4d678460fb21f0129ee36ca98873f9f6edbe Mon Sep 17 00:00:00 2001
From: ZailiWang <zaili.wang@intel.com>
Date: Mon, 14 Jul 2025 14:27:52 +0800
Subject: [PATCH 10/16] correct SGLANG_CPU_OMP_THREADS_BIND usage

---
 docs/references/cpu.md | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/docs/references/cpu.md b/docs/references/cpu.md
index f520a0c3afbf..b140d8fee37d 100644
--- a/docs/references/cpu.md
+++ b/docs/references/cpu.md
@@ -51,6 +51,8 @@ docker run \
 
 If you'd prefer to install SGLang in a bare metal environment,
 please take the command list in the Dockerfile as reference.
+Please note that the environmental variable `SGLANG_USE_CPU_ENGINE=1`
+is required to be set to enable SGLang service with CPU engine.
 
 ## Launch of the Serving Engine
 
@@ -70,19 +72,23 @@ Notes:
 
 1. For running W8A8 quantized models, please add the flag `--quantization w8a8_int8`.
 
-2. The flag `--tp 6` indicates that we will apply tensor parallel with 6 ranks (TP6).
-In general the TP rank number should match the total number of sub-numa clusters (SNCs) on the server
-(e.g. TP6 should be applied on a server having 2 sockets with SNC3 configuration).
+2. The flag `--tp 6` specifies that tensor parallelism will be applied using 6 ranks (TP6).
+    In general, the number of TP ranks should correspond to the total number of sub-NUMA clusters (SNCs) on the server.
+    For instance, TP6 is suitable for a server with 2 sockets configured with SNC3.
 
-    If the desired TP rank number is not the same with total SNC number, an explicit setting of env variable
-    `SGLANG_CPU_OMP_THREADS_BIND` is needed. For example, if we want to run TP3 on the 1st socket of the server
-    with 120c x 2 sockets, which has a total of 6 SNCs, we need to set
+    If the desired TP rank number differs from the total SNC count, the system will automatically
+    utilize the first `n` SNCs, provided `--tp n` is set and `n` is less than the total SNC count.
+    Note that `n` cannot exceed the total SNC number, doing so will result in an error.
+
+    To specify the cores to be used, we need to explicitly set the environment variable `SGLANG_CPU_OMP_THREADS_BIND`.
+    For example, if we want to run the SGLang service using the first 40 cores of each SNC on a Xeon® 6980P server,
+    which has 43-43-42 cores per socket (reserving the remaining 16 cores for other tasks), we should set:
 
     ```bash
-    export SGLANG_CPU_OMP_THREADS_BIND="0-39|40-79|80-119"
+    export SGLANG_CPU_OMP_THREADS_BIND="0-39|43-82|86-125|128-167|171-210|214-253"
     ```
 
-    and set `--tp 3` in the `launch_server` command.
+    Additionally, include `--tp 6` in the `launch_server` command.
 
 3. A warmup step is automatically triggered when the service is started.
 When `The server is fired up and ready to roll!` is echoed,

From dd99fd2f0e0cae22947e659dc37d3e0c5394aa8c Mon Sep 17 00:00:00 2001
From: ZailiWang <zaili.wang@intel.com>
Date: Mon, 14 Jul 2025 14:30:34 +0800
Subject: [PATCH 11/16] update expression

---
 docs/references/cpu.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/references/cpu.md b/docs/references/cpu.md
index b140d8fee37d..68c537681f96 100644
--- a/docs/references/cpu.md
+++ b/docs/references/cpu.md
@@ -82,7 +82,7 @@ Notes:
 
     To specify the cores to be used, we need to explicitly set the environment variable `SGLANG_CPU_OMP_THREADS_BIND`.
     For example, if we want to run the SGLang service using the first 40 cores of each SNC on a Xeon® 6980P server,
-    which has 43-43-42 cores per socket (reserving the remaining 16 cores for other tasks), we should set:
+    which has 43-43-42 cores on the 3 SNCs of a socket (reserving the remaining 16 cores for other tasks), we should set:
 
     ```bash
     export SGLANG_CPU_OMP_THREADS_BIND="0-39|43-82|86-125|128-167|171-210|214-253"

From 35c736acc47e14e49bc64519764765ddfe5f7c8f Mon Sep 17 00:00:00 2001
From: ZailiWang <zaili.wang@intel.com>
Date: Mon, 14 Jul 2025 15:38:01 +0800
Subject: [PATCH 12/16] update support table

---
 docs/references/cpu.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/references/cpu.md b/docs/references/cpu.md
index 68c537681f96..50ff2e25fa8d 100644
--- a/docs/references/cpu.md
+++ b/docs/references/cpu.md
@@ -13,11 +13,11 @@ and the phenomenal high-quality reasoning model DeepSeek-R1.
 | Model Name | BF16 | w8a8_int8 | FP8 |
 |:---:|:---:|:---:|:---:|
 | DeepSeek-R1 |   | [meituan/DeepSeek-R1-Channel-INT8](https://huggingface.co/meituan/DeepSeek-R1-Channel-INT8) | [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) |
-| Llama-3.2-1B | [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) | [RedHatAI/Llama-3.2-1B-quantized.w8a8](https://huggingface.co/RedHatAI/Llama-3.2-1B-Instruct-quantized.w8a8) |   |
 | Llama-3.2-3B | [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) | [RedHatAI/Llama-3.2-3B-quantized.w8a8](https://huggingface.co/RedHatAI/Llama-3.2-3B-Instruct-quantized.w8a8) |   |
 | Llama-3.1-8B | [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) | [RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8](https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8) |   |
 | QwQ-32B |   | [RedHatAI/QwQ-32B-quantized.w8a8](https://huggingface.co/RedHatAI/QwQ-32B-quantized.w8a8) |   |
 | DeepSeek-Distilled-Llama |   | [RedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8](https://huggingface.co/RedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8) |   |
+| Qwen3-235B |   |   | [Qwen/Qwen3-235B-A22B-FP8](https://huggingface.co/Qwen/Qwen3-235B-A22B-FP8) |
 
 **Note:** In the above table, if the model identifier is listed in the grid,
 it means the model is verified.
@@ -51,7 +51,7 @@ docker run \
 
 If you'd prefer to install SGLang in a bare metal environment,
 please take the command list in the Dockerfile as reference.
-Please note that the environmental variable `SGLANG_USE_CPU_ENGINE=1`
+Please note that the environment variable `SGLANG_USE_CPU_ENGINE=1`
 is required to be set to enable SGLang service with CPU engine.
 
 ## Launch of the Serving Engine

From 7d8616b2eca30996bfd53d2c1d6fe2e7a89db989 Mon Sep 17 00:00:00 2001
From: ZailiWang <zaili.wang@intel.com>
Date: Wed, 16 Jul 2025 17:58:52 +0800
Subject: [PATCH 13/16] Add command list for bare metal env setup

---
 docs/references/cpu.md | 46 ++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 44 insertions(+), 2 deletions(-)

diff --git a/docs/references/cpu.md b/docs/references/cpu.md
index 50ff2e25fa8d..23b140462e7b 100644
--- a/docs/references/cpu.md
+++ b/docs/references/cpu.md
@@ -50,10 +50,52 @@ docker run \
 ```
 
 If you'd prefer to install SGLang in a bare metal environment,
-please take the command list in the Dockerfile as reference.
-Please note that the environment variable `SGLANG_USE_CPU_ENGINE=1`
+the command list is similar with the procedure in the Dockerfile.
+It is worth noting that the environment variable `SGLANG_USE_CPU_ENGINE=1`
 is required to be set to enable SGLang service with CPU engine.
 
+```bash
+# Create and activate a conda environment
+conda create -n sgl-cpu python=3.12 -y
+conda activate sgl-cpu
+
+# Optional: Set PyTorch CPU as primary pip install channel to avoid installing CUDA version
+pip config set global.index-url https://download.pytorch.org/whl/cpu
+pip config set global.extra-index-url https://pypi.org/simple
+
+# Check if some conda related environment variables have been set
+env | grep -i conda
+# The following environment variable settings are required
+# if they have not been set properly
+export CONDA_EXE=$(which conda)
+export CONDA_ROOT=${CONDA_EXE}/../..
+export CONDA_PREFIX=${CONDA_ROOT}/envs/sgl-cpu
+export PATH=${PATH}:${CONDA_ROOT}/bin:${CONDA_ROOT}/condabin
+
+# Clone the SGLang code
+git clone https://github.com/sgl-project/sglang.git
+cd sglang
+git checkout <YOUR-DESIRED-VERSION>
+
+# Install SGLang main package and its dependencies
+pip install --upgrade pip setuptools
+conda install -y libsqlite==3.48.0 gperftools tbb libnuma numactl
+pip install intel-openmp
+pip install -e "python[all_cpu]"
+
+# Build the CPU backend kernels
+cd sgl-kernel
+cp pyproject_cpu.toml pyproject.toml
+pip install -v .
+
+# Other required environment variables
+# Recommend to set these in ~/.bashrc in order not to set every time in a new terminal
+export SGLANG_USE_CPU_ENGINE=1
+export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libiomp5.so:${CONDA_PREFIX}/lib/libtcmalloc.so:${CONDA_PREFIX}/lib/libtbbmalloc.so.2
+```
+
+
+
 ## Launch of the Serving Engine
 
 An example command for LLM serving engine launching would be:

From 0077248a20440c576843f39c4838a301494dbd43 Mon Sep 17 00:00:00 2001
From: ZailiWang <zaili.wang@intel.com>
Date: Fri, 18 Jul 2025 14:35:38 +0800
Subject: [PATCH 14/16] update

---
 docs/references/cpu.md | 6 ++++--
 docs/start/install.md  | 5 ++++-
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/docs/references/cpu.md b/docs/references/cpu.md
index 23b140462e7b..cd76964ce6cc 100644
--- a/docs/references/cpu.md
+++ b/docs/references/cpu.md
@@ -24,6 +24,8 @@ it means the model is verified.
 
 ## Installation
 
+### Install Using Docker
+
 It is recommended to use Docker for setting up the SGLang environment.
 A [Dockerfile](https://github.com/sgl-project/sglang/blob/main/docker/Dockerfile.xeon) is provided to facilitate the installation.
 Replace `<secret>` below with your [HuggingFace access token](https://huggingface.co/docs/hub/en/security-tokens).
@@ -49,6 +51,8 @@ docker run \
     sglang-cpu:main /bin/bash
 ```
 
+### Install From Source
+
 If you'd prefer to install SGLang in a bare metal environment,
 the command list is similar with the procedure in the Dockerfile.
 It is worth noting that the environment variable `SGLANG_USE_CPU_ENGINE=1`
@@ -94,8 +98,6 @@ export SGLANG_USE_CPU_ENGINE=1
 export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libiomp5.so:${CONDA_PREFIX}/lib/libtcmalloc.so:${CONDA_PREFIX}/lib/libtbbmalloc.so.2
 ```
 
-
-
 ## Launch of the Serving Engine
 
 An example command for LLM serving engine launching would be:
diff --git a/docs/start/install.md b/docs/start/install.md
index 7dbb57aeefb6..6737a5aa8c75 100644
--- a/docs/start/install.md
+++ b/docs/start/install.md
@@ -52,6 +52,9 @@ cd ..
 pip install -e "python[all_hip]"
 ```
 
+Note: Please refer to [the CPU environment setup command list](../references/cpu.md#install-from-source)
+to set up the SGLang environment for running the models with CPU servers.
+
 ## Method 3: Using docker
 
 The docker images are available on Docker Hub as [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile](https://github.com/sgl-project/sglang/tree/main/docker).
@@ -87,7 +90,7 @@ drun -p 30000:30000 \
 drun v0.4.9.post2-rocm630 python3 -m sglang.bench_one_batch --batch-size 32 --input 1024 --output 128 --model amd/Meta-Llama-3.1-8B-Instruct-FP8-KV --tp 8 --quantization fp8
 ```
 
-Please refer to [the CPU installation guide](../references/cpu.md#installation)
+Note: Please refer to [the CPU installation guide using Docker](../references/cpu.md#install-using-docker)
 to set up the SGLang environment for running the models with CPU servers.
 
 ## Method 4: Using docker compose

From 289a44277a592e9e4ae6d07d86f46375b8fa2253 Mon Sep 17 00:00:00 2001
From: ZailiWang <zaili.wang@intel.com>
Date: Mon, 21 Jul 2025 14:15:43 +0800
Subject: [PATCH 15/16] Improve expressions per comments

---
 docs/references/cpu.md | 27 +++++++++++++--------------
 1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/docs/references/cpu.md b/docs/references/cpu.md
index cd76964ce6cc..bd2c3ddc2c5d 100644
--- a/docs/references/cpu.md
+++ b/docs/references/cpu.md
@@ -19,8 +19,8 @@ and the phenomenal high-quality reasoning model DeepSeek-R1.
 | DeepSeek-Distilled-Llama |   | [RedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8](https://huggingface.co/RedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8) |   |
 | Qwen3-235B |   |   | [Qwen/Qwen3-235B-A22B-FP8](https://huggingface.co/Qwen/Qwen3-235B-A22B-FP8) |
 
-**Note:** In the above table, if the model identifier is listed in the grid,
-it means the model is verified.
+**Note:** The model identifiers listed in the table above
+have been verified on 6th Gen Intel® Xeon® P-core platforms.
 
 ## Installation
 
@@ -56,7 +56,7 @@ docker run \
 If you'd prefer to install SGLang in a bare metal environment,
 the command list is similar with the procedure in the Dockerfile.
 It is worth noting that the environment variable `SGLANG_USE_CPU_ENGINE=1`
-is required to be set to enable SGLang service with CPU engine.
+is required to enable SGLang service with CPU engine.
 
 ```bash
 # Create and activate a conda environment
@@ -81,7 +81,7 @@ git clone https://github.com/sgl-project/sglang.git
 cd sglang
 git checkout <YOUR-DESIRED-VERSION>
 
-# Install SGLang main package and its dependencies
+# Install SGLang dependent libs, and build SGLang main package
 pip install --upgrade pip setuptools
 conda install -y libsqlite==3.48.0 gperftools tbb libnuma numactl
 pip install intel-openmp
@@ -100,7 +100,7 @@ export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libiomp5.so:${CONDA_PREFIX}/
 
 ## Launch of the Serving Engine
 
-An example command for LLM serving engine launching would be:
+Example command to launch SGLang serving:
 
 ```bash
 python -m sglang.launch_server   \
@@ -117,26 +117,25 @@ Notes:
 1. For running W8A8 quantized models, please add the flag `--quantization w8a8_int8`.
 
 2. The flag `--tp 6` specifies that tensor parallelism will be applied using 6 ranks (TP6).
-    In general, the number of TP ranks should correspond to the total number of sub-NUMA clusters (SNCs) on the server.
-    For instance, TP6 is suitable for a server with 2 sockets configured with SNC3.
+    The number of TP specified is how many TP ranks will be used during the execution.
+    In a CPU platform, a TP rank means a sub-NUMA cluster (SNC).
+    Usually we can get the SNC information (How many available) from Operation System.
+    User can specify TP to be no more than the total available SNCs in current system.
 
-    If the desired TP rank number differs from the total SNC count, the system will automatically
-    utilize the first `n` SNCs, provided `--tp n` is set and `n` is less than the total SNC count.
+    If the specified TP rank number differs from the total SNC count,
+    the system will automatically utilize the first `n` SNCs.
     Note that `n` cannot exceed the total SNC number, doing so will result in an error.
 
     To specify the cores to be used, we need to explicitly set the environment variable `SGLANG_CPU_OMP_THREADS_BIND`.
     For example, if we want to run the SGLang service using the first 40 cores of each SNC on a Xeon® 6980P server,
-    which has 43-43-42 cores on the 3 SNCs of a socket (reserving the remaining 16 cores for other tasks), we should set:
+    which has 43-43-42 cores on the 3 SNCs of a socket, we should set:
 
     ```bash
     export SGLANG_CPU_OMP_THREADS_BIND="0-39|43-82|86-125|128-167|171-210|214-253"
     ```
 
-    Additionally, include `--tp 6` in the `launch_server` command.
-
 3. A warmup step is automatically triggered when the service is started.
-When `The server is fired up and ready to roll!` is echoed,
-the server is ready to handle the incoming requests.
+The server is ready when you see the log `The server is fired up and ready to roll!`.
 
 ## Benchmarking with Requests
 

From 3fc3cc695c33ead7e65cb6e43be09e734729b341 Mon Sep 17 00:00:00 2001
From: ZailiWang <zaili.wang@intel.com>
Date: Mon, 21 Jul 2025 15:03:47 +0800
Subject: [PATCH 16/16] improve expressions

---
 docs/references/cpu.md | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/docs/references/cpu.md b/docs/references/cpu.md
index bd2c3ddc2c5d..5aa76af32c41 100644
--- a/docs/references/cpu.md
+++ b/docs/references/cpu.md
@@ -1,7 +1,7 @@
 # SGLang on CPU
 
 The document addresses how to set up the [SGLang](https://github.com/sgl-project/sglang) environment and run LLM inference on CPU servers.
-Specifically, the model service is well optimized on the CPUs equipped with Intel® AMX® Instructions,
+Specifically, SGLang is well optimized on the CPUs equipped with Intel® AMX® Instructions,
 which are 4th generation or newer Intel® Xeon® Scalable Processors.
 
 ## Optimized Model List
@@ -54,7 +54,7 @@ docker run \
 ### Install From Source
 
 If you'd prefer to install SGLang in a bare metal environment,
-the command list is similar with the procedure in the Dockerfile.
+the command list is as below.
 It is worth noting that the environment variable `SGLANG_USE_CPU_ENGINE=1`
 is required to enable SGLang service with CPU engine.
 
@@ -139,7 +139,7 @@ The server is ready when you see the log `The server is fired up and ready to ro
 
 ## Benchmarking with Requests
 
-We can benchmark the performance via the `bench_serving` script.
+You can benchmark the performance via the `bench_serving` script.
 Run the command in another terminal.
 
 ```bash
@@ -164,7 +164,7 @@ and sent via the command line (e.g. using `curl`) or via your own script.
 
 ## Example: Running DeepSeek-R1
 
-The example W8A8 model service launching command in the container on a Xeon® 6980P server would be
+An example command to launch service for W8A8 DeepSeek-R1 on a Xeon® 6980P server
 
 ```bash
 python -m sglang.launch_server                 \
@@ -179,7 +179,7 @@ python -m sglang.launch_server                 \
     --tp 6
 ```
 
-Similarly, the example FP8 model service launching command would be
+Similarly, an example command to launch service for FP8 DeepSeek-R1 would be
 
 ```bash
 python -m sglang.launch_server                 \
@@ -193,5 +193,5 @@ python -m sglang.launch_server                 \
     --tp 6
 ```
 
-Then we can test with `bench_serving` command or construct our own command or script
+Then you can test with `bench_serving` command or construct your own command or script
 following [the benchmarking example](#benchmarking-with-requests).