-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[Perf] Optimize Qwen2/2.5-VL ViT tensor generating performance #14684
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
18aa10d to
8ceb5c9
Compare
|
Nice work! |
Signed-off-by: imkero <[email protected]>
97dea30 to
9159e8c
Compare
Signed-off-by: imkero <[email protected]>
I tried all possible solutions I know (numba only, torch + numba, triton) for this PR to find the most efficient approach, so I would like to bring the most efficient solution I found for both CPU and CUDA backend here. And it's quite simple (just adding a dispatch function in front of them) |
|
This pull request has merge conflicts that must be resolved before it can be |
This PR optimize the tensor generation performance of Qwen2/2.5-VL ViT (including
rot_pos_ids,window_indices,cu_seqlens,seqlens), by introducing optimized numba / torch impl.What this PR do
image_grid_thwandvideo_grid_thwin CPU all the timenumbaas a common dependencies (up to now it is only required by CUDA / RoCm build)rot_pos_idsgeneration by rewriting the impl with 2 different impl: numba ver (for CPU backend) and torch ver (for other backends, e.g. CUDA)window_indicesgeneration by rewriting the impl with numba + torch togetherBenchmark and profiling
Profiling result
main branch Qwen2.5-VL ViT
[[1, 36, 36]][[10, 36, 36]]this PR Qwen2.5-VL ViT
[[1, 36, 36]][[10, 36, 36]]Piecewise benchmark
Generating
rot_pos_idsfor Qwen2/2.5-VL (GPU)(currently not used)
[[1, 8, 8]][[1, 36, 36]][[10, 36, 36]][[10, 36, 36]]* 10[[10, 36 ± 2, 36 ± 2]]* 10(different frame sizes handled seperately)
NOTE:
torch.empty, maybe we can auto-tune by using the numba impl for smaller batch size?torch.compileon it, and write a similar triton jit kernel. They seems not faster than the optimized torch impl in this PR.Generating
rot_pos_idsfor Qwen2/2.5-VL (CPU)[[1, 8, 8]][[1, 36, 36]][[10, 36, 36]][[10, 36, 36]]* 10[[10, 36 ± 2, 36 ± 2]]* 10(different frame sizes handled seperately)
Generating
window_indicesand so on for Qwen2.5-VL (GPU)[[1, 8, 8]][[1, 36, 36]][[10, 36, 36]][[10, 36, 36]]* 10[[10, 36 ± 2, 36 ± 2]]* 10(different frame sizes handled seperately)
ViT e2e benchmark
Qwen2-VL ViT (GPU)
[[1, 36, 36]][[10, 36, 36]]Qwen2.5-VL ViT (GPU)
[[1, 36, 36]][[10, 36, 36]]