forked from vllm-project/vllm-omni
-
Notifications
You must be signed in to change notification settings - Fork 0
[Model] Add AniSora T2V and I2V pipeline support #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from 2 commits
Commits
Show all changes
27 commits
Select commit
Hold shift + click to select a range
b792fa7
[Model] Scaffold Index AniSora pipelines and registry (WIP)
e0fbd94
feat: Implement AniSora T2V and I2V pipelines with examples and tests
c60dc64
fix: Resolve all linting errors in AniSora pipelines
994c9e9
remove: Delete documentation markdown files
7e1bd64
docs: Add comprehensive PR validation and testing notebook for Colab
5f70d3a
docs: Add comprehensive PR readiness and deployment guidelines
b6e5c28
docs: Add deployment status summary
dde1d6f
docs: Add quick start guide with direct answers
79c552f
fix: Apply pre-commit formatting (ruff format, trailing whitespace)
7bc3014
Add proper exports to anisora __init__.py following vLLM-Omni convent…
d1655f1
feat: Remove obsolete documentation and add new scripts for AniSora p…
ef08f9e
Support HTTP/HTTPS image URLs in I2V and T2V scripts
7501f0e
feat: Support HTTP/HTTPS image URLs in I2V and T2V scripts
b048fce
Override model_class_name to use AniSoraImageToVideoPipeline instead …
1b6b641
Increase stage_init_timeout to 1200s for model download and initializ…
b55c193
Add detailed phase logging to track progress through generation pipeline
c33fe27
Fix: use init_timeout instead of stage_init_timeout parameter
09b44da
Remove init_timeout parameter - use default 300s
911a272
[Model] Add Index-AniSora I2V support
69e9137
feat: Add AniSora V2/V3 (14B) support with hybrid Wan loading
1dc3dcc
fix: Handle AniSora transformer config mismatch for V2 loading
465bd49
fix: Simplify transformer loading - always use base config + weights
d4af658
fix: Add key name conversion for AniSora->diffusers format
29c1d8b
fix: Complete key name conversion for AniSora V2 -> diffusers
422d5ea
fix: Move all components to device during initialization
d03b142
docs: Add AniSora V1/V2 examples to image-to-video README
81f0eab
chore: Remove demo media files from repo
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
149 changes: 149 additions & 0 deletions
149
examples/offline_inference/image_to_video/anisora_image_to_video.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,149 @@ | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # SPDX-FileCopyrightText: Copyright contributors to the vLLM project | ||
|
|
||
| """ | ||
| AniSora Image-to-Video generation example. | ||
|
|
||
| Usage: | ||
| python anisora_image_to_video.py --model /path/to/anisora-diffusers \ | ||
| --image input.jpg --prompt "A cat playing with yarn" | ||
| """ | ||
|
|
||
| import argparse | ||
| from pathlib import Path | ||
|
|
||
| import numpy as np | ||
| import PIL.Image | ||
| import torch | ||
|
|
||
| from vllm_omni.entrypoints.omni import Omni | ||
| from vllm_omni.outputs import OmniRequestOutput | ||
| from vllm_omni.utils.platform_utils import detect_device_type, is_npu | ||
|
|
||
|
|
||
| def parse_args() -> argparse.Namespace: | ||
| parser = argparse.ArgumentParser(description="Generate a video from an image with AniSora I2V.") | ||
| parser.add_argument("--model", required=True, help="AniSora Diffusers I2V model ID or local path.") | ||
| parser.add_argument("--image", required=True, help="Path to input image.") | ||
| parser.add_argument("--prompt", default="", help="Text prompt describing the desired motion.") | ||
| parser.add_argument("--negative_prompt", default="", help="Negative prompt.") | ||
| parser.add_argument("--seed", type=int, default=42, help="Random seed.") | ||
| parser.add_argument("--guidance_scale", type=float, default=5.0, help="CFG scale.") | ||
| parser.add_argument("--height", type=int, default=None, help="Video height (auto-calculated if not set).") | ||
| parser.add_argument("--width", type=int, default=None, help="Video width (auto-calculated if not set).") | ||
| parser.add_argument("--num_frames", type=int, default=81, help="Number of frames.") | ||
| parser.add_argument("--num_inference_steps", type=int, default=50, help="Sampling steps.") | ||
| parser.add_argument("--flow_shift", type=float, default=5.0, help="Scheduler flow_shift.") | ||
| parser.add_argument("--output", type=str, default="anisora_i2v.mp4", help="Path to save the video (mp4).") | ||
| parser.add_argument("--fps", type=int, default=16, help="Frames per second for the output video.") | ||
| return parser.parse_args() | ||
|
|
||
|
|
||
| def calculate_dimensions(image: PIL.Image.Image, max_area: int = 480 * 832) -> tuple[int, int]: | ||
| aspect_ratio = image.height / image.width | ||
| mod_value = 16 | ||
|
|
||
| height = round(np.sqrt(max_area * aspect_ratio)) // mod_value * mod_value | ||
| width = round(np.sqrt(max_area / aspect_ratio)) // mod_value * mod_value | ||
|
|
||
| return height, width | ||
|
|
||
|
|
||
| def main(): | ||
| args = parse_args() | ||
| device = detect_device_type() | ||
| generator = torch.Generator(device=device).manual_seed(args.seed) | ||
|
|
||
| # Load input image | ||
| image = PIL.Image.open(args.image).convert("RGB") | ||
|
|
||
| # Calculate dimensions if not provided | ||
| height = args.height | ||
| width = args.width | ||
| if height is None or width is None: | ||
| calc_height, calc_width = calculate_dimensions(image, max_area=480 * 832) | ||
| height = height or calc_height | ||
| width = width or calc_width | ||
|
|
||
| # Resize image to target dimensions | ||
| image = image.resize((width, height), PIL.Image.Resampling.LANCZOS) | ||
|
|
||
| # Enable VAE memory optimizations on NPU | ||
| vae_use_slicing = is_npu() | ||
| vae_use_tiling = is_npu() | ||
|
|
||
| omni = Omni( | ||
| model=args.model, | ||
| vae_use_slicing=vae_use_slicing, | ||
| vae_use_tiling=vae_use_tiling, | ||
| flow_shift=args.flow_shift, | ||
| ) | ||
|
|
||
| frames = omni.generate( | ||
| args.prompt, | ||
| negative_prompt=args.negative_prompt, | ||
| pil_image=image, | ||
| height=height, | ||
| width=width, | ||
| generator=generator, | ||
| guidance_scale=args.guidance_scale, | ||
| num_inference_steps=args.num_inference_steps, | ||
| num_frames=args.num_frames, | ||
| ) | ||
|
|
||
| # Extract video frames from OmniRequestOutput | ||
| if isinstance(frames, list) and len(frames) > 0: | ||
| first_item = frames[0] | ||
|
|
||
| if hasattr(first_item, "final_output_type"): | ||
| if first_item.final_output_type != "image": | ||
| raise ValueError( | ||
| f"Unexpected output type '{first_item.final_output_type}', expected 'image' for video generation." | ||
| ) | ||
|
|
||
| if hasattr(first_item, "is_pipeline_output") and first_item.is_pipeline_output: | ||
| if isinstance(first_item.request_output, list) and len(first_item.request_output) > 0: | ||
| inner_output = first_item.request_output[0] | ||
| if isinstance(inner_output, OmniRequestOutput) and hasattr(inner_output, "images"): | ||
| frames = inner_output.images[0] if inner_output.images else None | ||
| if frames is None: | ||
| raise ValueError("No video frames found in output.") | ||
| elif hasattr(first_item, "images") and first_item.images: | ||
| frames = first_item.images | ||
| else: | ||
| raise ValueError("No video frames found in OmniRequestOutput.") | ||
|
|
||
| output_path = Path(args.output) | ||
| output_path.parent.mkdir(parents=True, exist_ok=True) | ||
|
|
||
| try: | ||
| from diffusers.utils import export_to_video | ||
| except ImportError: | ||
| raise ImportError("diffusers is required for export_to_video.") | ||
|
|
||
| if isinstance(frames, torch.Tensor): | ||
| video_tensor = frames.detach().cpu() | ||
| if video_tensor.dim() == 5: | ||
| if video_tensor.shape[1] in (3, 4): | ||
| video_tensor = video_tensor[0].permute(1, 2, 3, 0) | ||
| else: | ||
| video_tensor = video_tensor[0] | ||
| elif video_tensor.dim() == 4 and video_tensor.shape[0] in (3, 4): | ||
| video_tensor = video_tensor.permute(1, 2, 3, 0) | ||
| if video_tensor.is_floating_point(): | ||
| video_tensor = video_tensor.clamp(-1, 1) * 0.5 + 0.5 | ||
| video_array = video_tensor.float().numpy() | ||
| else: | ||
| video_array = frames | ||
| if hasattr(video_array, "shape") and video_array.ndim == 5: | ||
| video_array = video_array[0] | ||
|
|
||
| if isinstance(video_array, np.ndarray) and video_array.ndim == 4: | ||
| video_array = list(video_array) | ||
|
|
||
| export_to_video(video_array, str(output_path), fps=args.fps) | ||
| print(f"Saved generated video to {output_path}") | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| main() |
131 changes: 131 additions & 0 deletions
131
examples/offline_inference/text_to_video/anisora_text_to_video.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,131 @@ | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # SPDX-FileCopyrightText: Copyright contributors to the vLLM project | ||
|
|
||
| import argparse | ||
| from pathlib import Path | ||
|
|
||
| import numpy as np | ||
| import torch | ||
|
|
||
| from vllm_omni.entrypoints.omni import Omni | ||
| from vllm_omni.outputs import OmniRequestOutput | ||
| from vllm_omni.utils.platform_utils import detect_device_type, is_npu | ||
|
|
||
|
|
||
| def parse_args() -> argparse.Namespace: | ||
| parser = argparse.ArgumentParser(description="Generate a video with AniSora T2V.") | ||
| parser.add_argument( | ||
| "--model", | ||
| required=True, | ||
| help="AniSora Diffusers model ID or local path.", | ||
| ) | ||
| parser.add_argument("--prompt", required=True, help="Text prompt.") | ||
| parser.add_argument("--negative_prompt", default="", help="Negative prompt.") | ||
| parser.add_argument("--seed", type=int, default=42, help="Random seed.") | ||
| parser.add_argument("--guidance_scale", type=float, default=4.0, help="CFG scale (applied to low/high).") | ||
| parser.add_argument("--guidance_scale_high", type=float, default=None, help="Optional separate CFG for high-noise.") | ||
| parser.add_argument("--height", type=int, default=720, help="Video height.") | ||
| parser.add_argument("--width", type=int, default=1280, help="Video width.") | ||
| parser.add_argument("--num_frames", type=int, default=81, help="Number of frames.") | ||
| parser.add_argument("--num_inference_steps", type=int, default=40, help="Sampling steps.") | ||
| parser.add_argument("--boundary_ratio", type=float, default=0.875, help="Boundary split ratio for low/high DiT.") | ||
| parser.add_argument( | ||
| "--flow_shift", type=float, default=5.0, help="Scheduler flow_shift (5.0 for 720p, 12.0 for 480p)." | ||
| ) | ||
| parser.add_argument("--output", type=str, default="anisora_t2v.mp4", help="Path to save the video (mp4).") | ||
| parser.add_argument("--fps", type=int, default=24, help="Frames per second for the output video.") | ||
| return parser.parse_args() | ||
|
|
||
|
|
||
| def main(): | ||
| args = parse_args() | ||
| device = detect_device_type() | ||
| generator = torch.Generator(device=device).manual_seed(args.seed) | ||
|
|
||
| # Enable VAE memory optimizations on NPU | ||
| vae_use_slicing = is_npu() | ||
| vae_use_tiling = is_npu() | ||
|
|
||
| omni = Omni( | ||
| model=args.model, | ||
| vae_use_slicing=vae_use_slicing, | ||
| vae_use_tiling=vae_use_tiling, | ||
| boundary_ratio=args.boundary_ratio, | ||
| flow_shift=args.flow_shift, | ||
| ) | ||
|
|
||
| frames = omni.generate( | ||
| args.prompt, | ||
| negative_prompt=args.negative_prompt, | ||
| height=args.height, | ||
| width=args.width, | ||
| generator=generator, | ||
| guidance_scale=args.guidance_scale, | ||
| guidance_scale_2=args.guidance_scale_high, | ||
| num_inference_steps=args.num_inference_steps, | ||
| num_frames=args.num_frames, | ||
| ) | ||
|
|
||
| # Extract video frames from OmniRequestOutput | ||
| if isinstance(frames, list) and len(frames) > 0: | ||
| first_item = frames[0] | ||
|
|
||
| # Check if it's an OmniRequestOutput | ||
| if hasattr(first_item, "final_output_type"): | ||
| if first_item.final_output_type != "image": | ||
| raise ValueError( | ||
| f"Unexpected output type '{first_item.final_output_type}', expected 'image' for video generation." | ||
| ) | ||
|
|
||
| # Pipeline mode: extract from nested request_output | ||
| if hasattr(first_item, "is_pipeline_output") and first_item.is_pipeline_output: | ||
| if isinstance(first_item.request_output, list) and len(first_item.request_output) > 0: | ||
| inner_output = first_item.request_output[0] | ||
| if isinstance(inner_output, OmniRequestOutput) and hasattr(inner_output, "images"): | ||
| frames = inner_output.images[0] if inner_output.images else None | ||
| if frames is None: | ||
| raise ValueError("No video frames found in output.") | ||
| # Diffusion mode: use direct images field | ||
| elif hasattr(first_item, "images") and first_item.images: | ||
| frames = first_item.images | ||
| else: | ||
| raise ValueError("No video frames found in OmniRequestOutput.") | ||
|
|
||
| output_path = Path(args.output) | ||
| output_path.parent.mkdir(parents=True, exist_ok=True) | ||
| try: | ||
| from diffusers.utils import export_to_video | ||
| except ImportError: | ||
| raise ImportError("diffusers is required for export_to_video.") | ||
|
|
||
| # frames may be np.ndarray (preferred) or torch.Tensor | ||
| # export_to_video expects a list of frames with values in [0, 1] | ||
| if isinstance(frames, torch.Tensor): | ||
| video_tensor = frames.detach().cpu() | ||
| if video_tensor.dim() == 5: | ||
| # [B, C, F, H, W] or [B, F, H, W, C] | ||
| if video_tensor.shape[1] in (3, 4): | ||
| video_tensor = video_tensor[0].permute(1, 2, 3, 0) | ||
| else: | ||
| video_tensor = video_tensor[0] | ||
| elif video_tensor.dim() == 4 and video_tensor.shape[0] in (3, 4): | ||
| video_tensor = video_tensor.permute(1, 2, 3, 0) | ||
| # If float, assume [-1,1] and normalize to [0,1] | ||
| if video_tensor.is_floating_point(): | ||
| video_tensor = video_tensor.clamp(-1, 1) * 0.5 + 0.5 | ||
| video_array = video_tensor.float().numpy() | ||
| else: | ||
| video_array = frames | ||
| if hasattr(video_array, "shape") and video_array.ndim == 5: | ||
| video_array = video_array[0] | ||
|
|
||
| # Convert 4D array (frames, H, W, C) to list of frames for export_to_video | ||
| if isinstance(video_array, np.ndarray) and video_array.ndim == 4: | ||
| video_array = list(video_array) | ||
|
|
||
| export_to_video(video_array, str(output_path), fps=args.fps) | ||
| print(f"Saved generated video to {output_path}") | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| main() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,19 @@ | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # SPDX-FileCopyrightText: Copyright contributors to the vLLM project | ||
|
|
||
| from vllm_omni.diffusion.registry import ( | ||
| PIPELINE_REGISTRY, | ||
| DIFFUSION_PRE_PROCESS_MAP, | ||
| DIFFUSION_POST_PROCESS_MAP, | ||
| ) | ||
|
|
||
|
|
||
| def test_anisora_registry_entries_present(): | ||
| assert "AniSoraPipeline" in PIPELINE_REGISTRY | ||
| assert "AniSoraImageToVideoPipeline" in PIPELINE_REGISTRY | ||
|
|
||
| assert "AniSoraPipeline" in DIFFUSION_PRE_PROCESS_MAP | ||
| assert "AniSoraPipeline" in DIFFUSION_POST_PROCESS_MAP | ||
|
|
||
| assert "AniSoraImageToVideoPipeline" in DIFFUSION_PRE_PROCESS_MAP | ||
| assert "AniSoraImageToVideoPipeline" in DIFFUSION_POST_PROCESS_MAP | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,2 @@ | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # SPDX-FileCopyrightText: Copyright contributors to the vLLM project |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.