[Diffusion]: Diffusion Ulysses-Sequence-Parallelism support#189
[Diffusion]: Diffusion Ulysses-Sequence-Parallelism support#189hsliuustc0106 merged 85 commits intovllm-project:mainfrom
Conversation
90138d4 to
d0916d9
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
nice work. is the parallel speedup ratio normal compared to diffusers' ulysses sp? |
|
The DOC check has failed, pls resolve all the warnings locally. Local Documentation Build pip install -e ".[docs]"
mkdocs build
mkdocs serveFirst, make sure there is no warning in the logging messages. |
vllm_omni/diffusion/models/qwen_image/qwen_image_transformer.py
Outdated
Show resolved
Hide resolved
| vllm_config.parallel_config.data_parallel_size = self.od_config.parallel_config.data_parallel_size | ||
|
|
||
| with set_current_omni_diffusion_config(self.od_config): | ||
| with set_current_vllm_config(vllm_config): |
There was a problem hiding this comment.
Why we still need vllm_config since we have our own init function?
There was a problem hiding this comment.
Because VllmConfig is set in set_current_vllm_config, and set_current_omni_diffusion_config only set OmniDiffusionConfig. Any suggestions?
dc569ac to
ecc925c
Compare
Unfortunately, diffusers' ulysses sp on qwen-image has an error. #huggingface/diffusers#12568. Still working in progress. |
|
@gcanlin PTAL |
|
Nice work! Would be better if NPUWoker can be applied with the same changes. But it’s fine to modify only the GPUWorker. I’m thinking of refactoring the common logic between NPU and GPU into a base worker abstraction in the following PR, which should help reduce duplication in future updates :) |
Signed-off-by: Didan Deng <[email protected]>
Signed-off-by: Didan Deng <[email protected]>
Signed-off-by: Didan Deng <[email protected]>
Signed-off-by: Didan Deng <[email protected]>
Signed-off-by: Didan Deng <[email protected]>
Signed-off-by: Didan Deng <[email protected]>
Signed-off-by: Didan Deng <[email protected]>
Signed-off-by: Didan Deng <[email protected]>
Signed-off-by: Didan Deng <[email protected]>
Signed-off-by: Didan Deng <[email protected]>
Signed-off-by: Didan Deng <[email protected]>
Signed-off-by: Didan Deng <[email protected]>
Signed-off-by: Didan Deng <[email protected]>
Signed-off-by: Didan Deng <[email protected]>
Signed-off-by: Didan Deng <[email protected]>
Signed-off-by: Didan Deng <[email protected]>
Signed-off-by: Didan Deng <[email protected]>
Signed-off-by: Didan Deng <[email protected]>
6a42082 to
6f8acc9
Compare
Signed-off-by: Didan Deng <[email protected]>
…ject#189) Signed-off-by: Didan Deng <[email protected]> Signed-off-by: Fanli Lin <[email protected]>
…ject#189) Signed-off-by: Didan Deng <[email protected]> Signed-off-by: wangyu31577 <[email protected]>
…ject#189) Signed-off-by: Didan Deng <[email protected]>
This PR allows user to set Ulysses Attention for diffusion model, e.g., qwen-image and qwen-image-edit. Currently only tested it with SPDA attention on H800 GPUs.
Purpose
To support various parallelism inference algorithms, this PR introduce:
DiffusionParallelConfiginvllm_omni/diffusion/data.py: Configuration for diffusion model distributed execution.vllm_omni/diffusion/distributed: handle the communication groups of different parallel configuration.tests/diffusion/attention/test_ulysses_sequence_parallel.py: UT for ulysses attention and multi-layer ulysses attention.This PR also edits:
vllm_omni/diffusion/attention/layer.py: allowAttentionto accept ulysses attention kwargs and support ulysses attention in forward function;vllm_omni/diffusion/models/qwen_image/qwen_image_transformer.py: chunked hidden_states and image position embedding;vllm_omni/diffusion/worker/gpu_worker.py: replace vllm'sinit_distributed_environmentandinitialize_model_parallelby vllm_omni's equivalents.Test Plan
UTs:
pytest tests/diffusion/attention/test_ulysses_sequence_parallel.pypytest tests/diffusion/distributed/test_comm.pytests/e2e/offline_inference/test_sequence_parallel.pyT2I inference:
python examples/offline_inference/text_to_image/text_to_image.py --ulysses_degree 2Test Result
fast UT
all passed.
T2I inference
I tried to test ulysses attention with diffusers
ContextParallelConfig(ulysses_degree=2)on qwen-image, but got an error. Refer to #huggingface/diffusers#12568. Difffusers commuity is working on solving it.To measure the parallelism methods, we run benchmarks with Qwen/Qwen-Image model generating images (2048x2048 as long sequence input) with 50 inference steps. The hardware devices are NVIDIA H800 GPUs.
sdpais the attention backends.Discussion
vllm_omni/diffusion/distributedtovllm_omni/distributedin this PR?Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)