Skip to content

Commit 0efa00d

Browse files
maryamtahhandevpatelio
authored andcommitted
[Docs] Fix grammar in CPU installation guide (vllm-project#28461)
Signed-off-by: Maryam Tahhan <[email protected]>
1 parent f8bce3e commit 0efa00d

File tree

1 file changed

+7
-7
lines changed
  • docs/getting_started/installation

1 file changed

+7
-7
lines changed

docs/getting_started/installation/cpu.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -93,7 +93,7 @@ Currently, there are no pre-built CPU wheels.
9393

9494
## Related runtime environment variables
9595

96-
- `VLLM_CPU_KVCACHE_SPACE`: specify the KV Cache size (e.g, `VLLM_CPU_KVCACHE_SPACE=40` means 40 GiB space for KV cache), larger setting will allow vLLM running more requests in parallel. This parameter should be set based on the hardware configuration and memory management pattern of users. Default value is `0`.
96+
- `VLLM_CPU_KVCACHE_SPACE`: specify the KV Cache size (e.g, `VLLM_CPU_KVCACHE_SPACE=40` means 40 GiB space for KV cache), larger setting will allow vLLM to run more requests in parallel. This parameter should be set based on the hardware configuration and memory management pattern of users. Default value is `0`.
9797
- `VLLM_CPU_OMP_THREADS_BIND`: specify the CPU cores dedicated to the OpenMP threads, can be set as CPU id lists, `auto` (by default), or `nobind` (to disable binding to individual CPU cores and to inherit user-defined OpenMP variables). For example, `VLLM_CPU_OMP_THREADS_BIND=0-31` means there will be 32 OpenMP threads bound on 0-31 CPU cores. `VLLM_CPU_OMP_THREADS_BIND=0-31|32-63` means there will be 2 tensor parallel processes, 32 OpenMP threads of rank0 are bound on 0-31 CPU cores, and the OpenMP threads of rank1 are bound on 32-63 CPU cores. By setting to `auto`, the OpenMP threads of each rank are bound to the CPU cores in each NUMA node respectively. If set to `nobind`, the number of OpenMP threads is determined by the standard `OMP_NUM_THREADS` environment variable.
9898
- `VLLM_CPU_NUM_OF_RESERVED_CPU`: specify the number of CPU cores which are not dedicated to the OpenMP threads for each rank. The variable only takes effect when VLLM_CPU_OMP_THREADS_BIND is set to `auto`. Default value is `None`. If the value is not set and use `auto` thread binding, no CPU will be reserved for `world_size == 1`, 1 CPU per rank will be reserved for `world_size > 1`.
9999
- `CPU_VISIBLE_MEMORY_NODES`: specify visible NUMA memory nodes for vLLM CPU workers, similar to ```CUDA_VISIBLE_DEVICES```. The variable only takes effect when VLLM_CPU_OMP_THREADS_BIND is set to `auto`. The variable provides more control for the auto thread-binding feature, such as masking nodes and changing nodes binding sequence.
@@ -128,7 +128,7 @@ Note, it is recommended to manually reserve 1 CPU for vLLM front-end process whe
128128

129129
### How to decide `VLLM_CPU_OMP_THREADS_BIND`?
130130

131-
- Default `auto` thread-binding is recommended for most cases. Ideally, each OpenMP thread will be bound to a dedicated physical core respectively, threads of each rank will be bound to a same NUMA node respectively, and 1 CPU per rank will be reserved for other vLLM components when `world_size > 1`. If have any performance problems or unexpected binding behaviours, please try to bind threads as following.
131+
- Default `auto` thread-binding is recommended for most cases. Ideally, each OpenMP thread will be bound to a dedicated physical core respectively, threads of each rank will be bound to the same NUMA node respectively, and 1 CPU per rank will be reserved for other vLLM components when `world_size > 1`. If you have any performance problems or unexpected binding behaviours, please try to bind threads as following.
132132

133133
- On a hyper-threading enabled platform with 16 logical CPU cores / 8 physical CPU cores:
134134

@@ -156,12 +156,12 @@ Note, it is recommended to manually reserve 1 CPU for vLLM front-end process whe
156156
14 0 0 6 6:6:6:0 yes 2401.0000 800.0000 800.000
157157
15 0 0 7 7:7:7:0 yes 2401.0000 800.0000 800.000
158158

159-
# On this platform, it is recommend to only bind openMP threads on logical CPU cores 0-7 or 8-15
159+
# On this platform, it is recommended to only bind openMP threads on logical CPU cores 0-7 or 8-15
160160
$ export VLLM_CPU_OMP_THREADS_BIND=0-7
161161
$ python examples/offline_inference/basic/basic.py
162162
```
163163

164-
- When deploy vLLM CPU backend on a multi-socket machine with NUMA and enable tensor parallel or pipeline parallel, each NUMA node is treated as a TP/PP rank. So be aware to set CPU cores of a single rank on a same NUMA node to avoid cross NUMA node memory access.
164+
- When deploying vLLM CPU backend on a multi-socket machine with NUMA and enable tensor parallel or pipeline parallel, each NUMA node is treated as a TP/PP rank. So be aware to set CPU cores of a single rank on the same NUMA node to avoid cross NUMA node memory access.
165165

166166
### How to decide `VLLM_CPU_KVCACHE_SPACE`?
167167

@@ -171,7 +171,7 @@ This value is 4GB by default. Larger space can support more concurrent requests,
171171

172172
First of all, please make sure the thread-binding and KV cache space are properly set and take effect. You can check the thread-binding by running a vLLM benchmark and observing CPU cores usage via `htop`.
173173

174-
Inference batch size is an important parameter for the performance. Larger batch usually provides higher throughput, smaller batch provides lower latency. Tuning max batch size starts from default value to balance throughput and latency is an effective way to improve vLLM CPU performance on specific platforms. There are two important related parameters in vLLM:
174+
Inference batch size is an important parameter for the performance. A larger batch usually provides higher throughput, a smaller batch provides lower latency. Tuning the max batch size starting from the default value to balance throughput and latency is an effective way to improve vLLM CPU performance on specific platforms. There are two important related parameters in vLLM:
175175

176176
- `--max-num-batched-tokens`, defines the limit of token numbers in a single batch, has more impacts on the first token performance. The default value is set as:
177177
- Offline Inference: `4096 * world_size`
@@ -192,8 +192,8 @@ vLLM CPU supports data parallel (DP), tensor parallel (TP) and pipeline parallel
192192
### (x86 only) What is the purpose of `VLLM_CPU_MOE_PREPACK` and `VLLM_CPU_SGL_KERNEL`?
193193

194194
- Both of them require `amx` CPU flag.
195-
- `VLLM_CPU_MOE_PREPACK` can provides better performance for MoE models
196-
- `VLLM_CPU_SGL_KERNEL` can provides better performance for MoE models and small-batch scenarios.
195+
- `VLLM_CPU_MOE_PREPACK` can provide better performance for MoE models
196+
- `VLLM_CPU_SGL_KERNEL` can provide better performance for MoE models and small-batch scenarios.
197197

198198
### Why do I see `get_mempolicy: Operation not permitted` when running in Docker?
199199

0 commit comments

Comments
 (0)