Skip to content

Conversation

@jmkuebler
Copy link

Similar to #91 this PR optimizes the FP8 tiles sizes for headdim 128. This applies for example to Qwen3-Coder-30B-A3B-Instruct where it reduces the context-length dependent inter-token latency by ~1.39x against BF16.
With the current mainline config, FP8 decoding is not faster at all than BF16 for Qwen3-Coder-30B-A3B-Instruct.

@jmkuebler
Copy link
Author

@LucasWilkinson could you please check

Copy link
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@LucasWilkinson LucasWilkinson merged commit 07602ad into vllm-project:main Sep 29, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants