[Perf] Optimize the Qwen2.5-Omni Model thinker-to-talker-proj with nn.Linear#825
Conversation
308583d to
45242dd
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 308583dda7
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| self.thinker_to_talker_proj = nn.Linear( | ||
| self.config.embedding_size, | ||
| self.config.hidden_size, | ||
| bias=True, | ||
| gather_output=True, | ||
| skip_bias_add=False, | ||
| quant_config=quant_config, | ||
| ) |
There was a problem hiding this comment.
Keep tensor-parallel/quantized linear for projector
Replacing ColumnParallelLinear with nn.Linear drops both tensor-parallel sharding and quant_config handling. In runs with tensor_parallel_size > 1 or quantized checkpoints, this projector will now instantiate full‑precision weights on every rank (no partitioning/quantized kernels), which can inflate memory or prevent quantized weights from loading correctly. Consider keeping the parallel/quantized linear wrapper or providing an equivalent vLLM-aware linear here.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Meanwhile, Qwen-2.5-Omni-7B is a small model which almost all kinds of NPU/GPU can hold with only one.
|
any accuracy difference? |
With the pure text prompt as input, no difference has been observed so far. More importantly, this is what is used in the corresponding part of the Transformer . |
45242dd to
5898db4
Compare
Swapped out the custom ColumnParallelLinear layer for a standard nn.Linear in Qwen2_5OmniTalkerForConditionalGeneration. Updated the forward pass to match the new layer's output signature, simplifying the projection step. Signed-off-by: John Liu BUAA <[email protected]>
5898db4 to
112fa95
Compare
….Linear (vllm-project#825) Signed-off-by: John Liu BUAA <[email protected]> Signed-off-by: Chen Yang <[email protected]>
|
@kechengliu97 Do you have any idea that where the performance improvements come from? |
Our Device can not transfer the profiling swim lane to the outer Internet, but I noticed that using these like ColumnParallelLinear or RowParallelLinear we found some communication cost, which may cause the higher latency. |
….Linear (vllm-project#825) Signed-off-by: John Liu BUAA <[email protected]>
Summary of Changes
In this submission, we replaced the
thinker_to_talker_projfrom the originalColumnParallelLinearlayer withnn.Linear. This change resulted in a significant performance improvement, particularly in theforwardpass of theQwen2_5OmniTalkerForConditionalGenerationmodel, where the overall execution time was reduced by over 200 microseconds.Performance Improvements
thinker_to_talker_projlatency:283μs 390nsto223μs 840ns, a decrease of approximately 21%.Overall
forwardtime:Configuration Comparison & Reference
In our comparison with the Transformer prototype, we observed that the corresponding linear layers also use
nn.Linearin that model. Based on this observation, the updated implementation aligns with mainstream architectures, ensuring higher compatibility and optimizability.Precision Verification
Using the same prompt, "What is the origin of the United States?", we obtained identical outputs. Both the text and the wav file generated were almost identical, proving that the accuracy has not been affected by this change.
The output text (wav file read the following paragraph):
Conclusion
By replacing
ColumnParallelLinearwithnn.Linear, we not only improved performance but also ensured consistency with Transformer architecture. This optimization will result in lower latency and higher throughput for various inference tasks, with no loss in precision.