Skip to content

Commit c875893

Browse files
b8zhongAkshat-Tripathi
authored andcommitted
[Docs] Add pipeline_parallel_size to optimization docs (vllm-project#14059)
Signed-off-by: Brayden Zhong <[email protected]>
1 parent 27aacf9 commit c875893

File tree

1 file changed

+1
-0
lines changed

1 file changed

+1
-0
lines changed

docs/source/performance/optimization.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@ If you frequently encounter preemptions from the vLLM engine, consider the follo
1818
- Increase `gpu_memory_utilization`. The vLLM pre-allocates GPU cache by using gpu_memory_utilization% of memory. By increasing this utilization, you can provide more KV cache space.
1919
- Decrease `max_num_seqs` or `max_num_batched_tokens`. This can reduce the number of concurrent requests in a batch, thereby requiring less KV cache space.
2020
- Increase `tensor_parallel_size`. This approach shards model weights, so each GPU has more memory available for KV cache.
21+
- Increase `pipeline_parallel_size`. This approach distributes model layers across GPUs, reducing the memory needed for model weights on each GPU, which indirectly leaves more memory available for KV cache.
2122

2223
You can also monitor the number of preemption requests through Prometheus metrics exposed by the vLLM. Additionally, you can log the cumulative number of preemption requests by setting disable_log_stats=False.
2324

0 commit comments

Comments
 (0)