Skip to content

Comments

[Model] Add Wan2.2 text-to-video support#202

Merged
hsliuustc0106 merged 18 commits intovllm-project:mainfrom
linyueqian:feat/wan2.2
Dec 11, 2025
Merged

[Model] Add Wan2.2 text-to-video support#202
hsliuustc0106 merged 18 commits intovllm-project:mainfrom
linyueqian:feat/wan2.2

Conversation

@linyueqian
Copy link
Contributor

@linyueqian linyueqian commented Dec 4, 2025

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Add support for Wan2.2 text-to-video generation.

Test Plan

python examples/offline_inference/wan22/text_to_video.py \
    --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage." \
    --negative_prompt "色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,
JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不
动的画面,杂乱的背景,三条腿,背景人很多,倒着走" \
    --height 720 \
    --width 1280 \
    --num_frames 32 \
    --guidance_scale 4.0 \
    --guidance_scale_high 3.0 \
    --num_inference_steps 40 \
    --fps 16 \
    --output t2v_out.mp4

Test Result

t2v_out.mp4

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

@chatgpt-codex-connector
Copy link

The account who enabled Codex for this repo no longer has access to Codex. Please contact the admins of this repo to enable Codex again.

@SamitHuang
Copy link
Collaborator

nice work. Can you try to increase --num_inference_steps=10 to like 100 and check whether the video quality become normal?

Copy link
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this, I suggest you try to link this with fastwan proposed in fastvideo project. Let's see how this can accelerate our inference and provide a solution to coordinate with fastvideo.

@linyueqian
Copy link
Contributor Author

nice work. Can you try to increase --num_inference_steps=10 to like 100 and check whether the video quality become normal?

@SamitHuang I try with 40 steps and it takes about five minutes to generate.

wan22_output_50.mp4

@hsliuustc0106
Copy link
Collaborator

nice work. Can you try to increase --num_inference_steps=10 to like 100 and check whether the video quality become normal?

@SamitHuang I try with 40 steps and it takes about five minutes to generate.

wan22_output_50.mp4

we may need some acceleration methods to speed up generation

@hsliuustc0106
Copy link
Collaborator

please add this model to the supported models.md

if isinstance(video_array, np.ndarray) and video_array.ndim == 4:
video_array = list(video_array)

export_to_video(video_array, str(output_path), fps=16)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fps can be 24 too. it's better to be configurable via argparser

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. i change it accordingly

Signed-off-by: linyueqian <[email protected]>
@SamitHuang
Copy link
Collaborator

SamitHuang commented Dec 8, 2025

i think

nice work. Can you try to increase --num_inference_steps=10 to like 100 and check whether the video quality become normal?

@SamitHuang I try with 40 steps and it takes about five minutes to generate.

wan22_output_50.mp4

i think you can update the test method and result with this new video, where num_frames seems increased. btw, diffusers example applies negative_prompt, we should apply it in the test as well to verify CFG works

@hsliuustc0106
Copy link
Collaborator

image this PR only supports t2v, right?

@linyueqian
Copy link
Contributor Author

image this PR only supports t2v, right?

yes.

@linyueqian
Copy link
Contributor Author

i think

nice work. Can you try to increase --num_inference_steps=10 to like 100 and check whether the video quality become normal?

@SamitHuang I try with 40 steps and it takes about five minutes to generate.
wan22_output_50.mp4

i think you can update the test method and result with this new video, where num_frames seems increased. btw, diffusers example applies negative_prompt, we should apply it in the test as well to verify CFG works

I have updated the test result in the first comment.

@hufangjian
Copy link

when diffuser model support TP,CFG,USP and distVAE?

@hsliuustc0106
Copy link
Collaborator

when diffuser model support TP,CFG,USP and distVAE?

TP/USP should be ready by the end of this month, others left to Q1

@hsliuustc0106
Copy link
Collaborator

add the tests, please refer to the qwen-image tests

@linyueqian
Copy link
Contributor Author

add the tests, please refer to the qwen-image tests

got it. i just add the test_video_diffusion_model.py file in a similar fashion.

Signed-off-by: linyueqian <[email protected]>
@hsliuustc0106
Copy link
Collaborator

I think we can get this PR merged now, later we need to open a new issue for a few todo jobs

  • test should be changed according @congw729 please provide instructions for offline tests
  • support image2video& txt-image2video jobs
image - [ ] refactor the examples/offline/video_generation/ which can be used for other video generation models

@hsliuustc0106 hsliuustc0106 merged commit 4128d63 into vllm-project:main Dec 11, 2025
4 checks passed
@congw729
Copy link
Contributor

I think we can get this PR merged now, later we need to open a new issue for a few todo jobs

  • test should be changed according @congw729 please provide instructions for offline tests
  • support image2video& txt-image2video jobs

Got it.

LawJarp-A pushed a commit to LawJarp-A/vllm-omni that referenced this pull request Dec 12, 2025
LawJarp-A pushed a commit to LawJarp-A/vllm-omni that referenced this pull request Dec 12, 2025
@linyueqian linyueqian deleted the feat/wan2.2 branch December 16, 2025 01:50
faaany pushed a commit to faaany/vllm-omni that referenced this pull request Dec 19, 2025
princepride pushed a commit to princepride/vllm-omni that referenced this pull request Jan 10, 2026
@david6666666 david6666666 mentioned this pull request Jan 16, 2026
55 tasks
@pengchengneo
Copy link

Excellent work.
May I ask a question, when implementing the text2video model, there should be some error accumulation due to precision transformation during the forward process.
How do you perform model evaluation to ensure that the accuracy of our implemented model remains consistent with that in the paper?
For example, text models can run on datasets like GSM8K to observe scores; how do we perform this kind of evaluation for video models? Thank you.
@linyueqian @hsliuustc0106

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new model add new model

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants