Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions OpenAI/GPT-OSS.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ docker run --gpus all \

GPT-OSS works on Ampere devices by default, using the `TRITON_ATTN` attention backend and Marlin MXFP4 MoE:

* `--async-scheduling` can be enabled for higher performance. Currently it is not compatible with structured output.
* `--async-scheduling` can be enabled for higher performance. Note: vLLM >= 0.11.1 has improved async scheduling stability and provides compatibility with structured output.

```
# openai/gpt-oss-20b should run on a single A100
Expand All @@ -53,7 +53,7 @@ vllm serve openai/gpt-oss-120b --tensor-parallel-size 4 --async-scheduling

GPT-OSS works on Hopper devices by default, using the FlashAttention3 backend and Marlin MXFP4 MoE:

* `--async-scheduling` can be enabled for higher performance. Currently it is not compatible with structured output.
* `--async-scheduling` can be enabled for higher performance. Note: vLLM >= 0.11.1 has improved async scheduling stability and provides compatibility with structured output.
* We recommend TP=2 for H100 and H200 as the best performance tradeoff point.

```
Expand Down Expand Up @@ -341,7 +341,7 @@ You can specify the IP address and the port that you would like to run the serve
Below are the config flags that we do not recommend changing or tuning with:

- `compilation-config`: Configuration for vLLM compilation stage. We recommend setting it to `'{"pass_config":{"enable_fi_allreduce_fusion":true,"enable_noop":true}}'` to enable all the necessary fusions for the best performance on Blackwell architecture. However, this feature is not supported on Hopper architecture yet.
- `async-scheduling`: Enable asynchronous scheduling to reduce the host overheads between decoding steps. We recommend always adding this flag for best performance.
- `async-scheduling`: Enable asynchronous scheduling to reduce the host overheads between decoding steps. We recommend always adding this flag for best performance. Note: vLLM >= 0.11.1 has improved async scheduling stability and provides compatibility with structured output.
- `no-enable-prefix-caching`: Disable prefix caching. We recommend always adding this flag if running with synthetic dataset for consistent performance measurement.
- `max-cudagraph-capture-size`: Specify the max size for cuda graphs. We recommend setting this to 2048 to leverage the benefit of cuda graphs while not using too much GPU memory.
- `stream-interval`: The interval between output token streaming responses. We recommend setting this to `20` to maximize the throughput.
Expand Down
2 changes: 1 addition & 1 deletion Qwen/Qwen3-VL.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct \
- It's highly recommended to specify `--limit-mm-per-prompt.video 0` if your inference server will only process image inputs since enabling video inputs consumes more memory reserved for long video embeddings. Alternatively, you can skip memory profiling for multimodal inputs by `--skip-mm-profiling` and lower `--gpu-memory-utilization` accordingly at your own risk.
- To avoid undesirable CPU contention, it's recommended to limit the number of threads allocated to preprocessing by setting the environment variable `OMP_NUM_THREADS=1`. This is particulaly useful and shows significant throughput improvement when deploying multiple vLLM instances on the same host.
- You can set `--max-model-len` to preserve memory. By default the model's context length is 262K, but `--max-model-len 128000` is good for most scenarios.
- Specifying `--async-scheduling` improves the overall system performance by overlapping scheduling overhead with the decoding process. However, this experimental feature **currently does not support speculative decoding, sampling with penalties, structured outputs or other scenarios** where decoding outcome depends on the output from a previous step.
- Specifying `--async-scheduling` improves the overall system performance by overlapping scheduling overhead with the decoding process. **Note: With vLLM >= 0.11.1, compatibility has been improved for structured output and sampling with penalties, but it may still be incompatible with speculative decoding (features merged but not yet released).** Check the latest releases for continued improvements.
- Specifying `--mm-encoder-tp-mode data` deploys the vision encoder in a data-parallel fashion for better performance. This is because the vision encoder is very small, thus tensor parallelism brings little gain but incurs significant communication overhead. Enabling this feature does consume additional memory and may require adjustment on `--gpu-memory-utilization`.
- If your workload involves mostly **unique** multimodal inputs only, it is recommended to pass `--mm-processor-cache-gb 0` to avoid caching overhead. Otherwise, specifying `--mm-processor-cache-type shm` enables this experimental feature which utilizes host shared memory to cache preprocessed input images and/or videos which shows better performance at a high TP setting.
- vLLM supports Expert Parallelism (EP) via `--enable-expert-parallel`, which allows experts in MoE models to be deployed on separate GPUs for better throughput. Check out [Expert Parallelism Deployment](https://docs.vllm.ai/en/latest/serving/expert_parallel_deployment.html) for more details.
Expand Down