-
-
Notifications
You must be signed in to change notification settings - Fork 11.6k
[Perf]:Optimize qwen2-vl to reduce cudaMemcpyAsync #14377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Isotr0py
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this optimization! Can you please also update qwen2.5-vl as well?
ae09649 to
1fbb69c
Compare
Head branch was pushed to by a user without write access
Signed-off-by: cynthieye <[email protected]>
Signed-off-by: cynthieye <[email protected]>
|
@cynthieye Thank you for making this PR! Can you update this branch with our main branch? I think thr CI error should be fixed on main a while ago. |
ywang96
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments - Otherwise LGTM!
| x: torch.Tensor, | ||
| cu_seqlens: torch.Tensor, | ||
| rotary_pos_emb: torch.Tensor, | ||
| max_seqlen: int = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't max_seqlen be also Optional[int]?
| max_seqlen: int, | ||
| seqlens: list[int], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please modify the typing accordingly
| max_seqlen: int = None, | ||
| seqlens: Optional[list[int]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
| max_seqlen: int, | ||
| seqlens: list[int], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
| max_seqlen: int, | ||
| seqlens: list[int], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's probably a good idea to add a small documentation here to indicate that max_seqlen is only used for FA and seqlens is only used to xformers.
Signed-off-by: cynthieye <[email protected]>
Signed-off-by: cynthieye <[email protected]>
Signed-off-by: cynthieye <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>
Signed-off-by: cynthieye <[email protected]>
Signed-off-by: cynthieye <[email protected]> Signed-off-by: Mu Huai <[email protected]>
qwen2-vl logic optimization: During each forward propagation, the xformer branch of Qwen2VisionTransformer will execute multiple tensor tolist methods (flash attn branch will execute multiple tensor items) to force the GPU tensor to be copied to the CPU, triggering CUDAMemcpyAsync to increase time consumption. Since the input and output are the same multiple times, it will be executed once, and the remaining will reuse the first result. After optimization, the online environment xformer branch QPS can be improved by 15%, and the flash attn branch QPS can be improved by 7%