-
Notifications
You must be signed in to change notification settings - Fork 291
Add joint tensor and KV cache support to USP method #586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
feifeibear
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request is a great step towards unifying attention mechanisms by adding joint tensor and KV-cache support to the USP method. The goal of deprecating the yunchang path to improve torch.compile compatibility is well-motivated. The changes in xfuser/model_executor/layers/usp.py are substantial and well-supported by new unit tests that verify equivalence with the old implementation. The modifications in attention_processor.py and transformer_flux.py to adopt the new USP interface are consistent and correct. Overall, the changes are well-executed. I have one suggestion to improve the structure of the new tests for better maintainability.
jcaraban
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What?
Adds support for KV-cache and joint tensors to USP method
Why?
Currently some models, like Hunyuanvideo have diverging code paths based on the input parameters. The two paths (Yunchang / USP) have different implementations for comms. Yunchang path uses certain features from
torch.distributedthat are not compatible withtorch.compile. As USP method can be fully compiled, this PR aims to make USP support the features the Yunchang path does, allowing us to only use USP.This PR is the first step towards deprecating yunchang in the long term, as discussed in #579 , but does not aim to fully remove it in the short term.
How?
Ported Yunchang features directly to USP method. This includes the joint tensors as well as KV cache for pipeline parallelism. Also changed Hunyuanvideo and Flux to already use only USP rather than Yunchang path.
Tests
Output:
Hunyuanvideo:
Tested both Ring/Ulysses.
Hunyuanvideo already uses USP by default if the input prompt is of a specific shape. The command and output below are from changing the other code path, where it previously used yunchang / hybrid_seq_parallel_attn.
Run command:
hunyuan_test_usp.mp4
Flux:
Flux uses standard USP already by default. Only in the case of pipeline parallelism did it use yunchang:
Run command:
Perf
Hunyuanvideo
The swap from yunchang code path to USP code path improves the performance, as now we can use torch.compile for the attention call as well. Here we have timed three subsequent runs and reported the average:
Before:
193.2sAfter:
188.3sThis now matches the perf of the original USP path.
Automatic tests
Added two new unit tests as well to compare the output of Yunchang / USP.
Other
This doesn't change the standard USP method behaviour, so other models using USP won't be affected. Some models still use Yunchang, so they would need to be changed in future PRs.
For ease of comparison, here's the original USP method: