-
Notifications
You must be signed in to change notification settings - Fork 253
Add FastWan (DMD) distillation method #1695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 2 commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
e2d3822
add DMD (FastWan) distillation
49a74cb
add FastWan (DMD) documentation
1e353d1
Update helpers/distillation/dmd/distiller.py
bghira 485daea
Update helpers/distillation/dmd/distiller.py
bghira 9dd24c5
Update helpers/distillation/dmd/distiller.py
bghira 86d1b49
Update documentation/distillation/FASTWAN_DMD.md
bghira File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,210 @@ | ||
| # DMD Distillation Quickstart (SimpleTuner) | ||
|
|
||
| In this example, we'll be training a **3-step student** using **DMD (Distribution Matching Distillation)** from a large flow-matching teacher model like [Wan 2.1 T2V](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B). | ||
|
|
||
| DMD features: | ||
|
|
||
| * **Generator (Student)**: Learns to match teacher in fewer steps | ||
| * **Fake Score Transformer**: Discriminates between teacher and student outputs | ||
| * **Multi-step simulation**: Optional train-inference consistency mode | ||
|
|
||
| --- | ||
|
|
||
| ## ✅ Hardware Requirements | ||
|
|
||
|
|
||
| ⚠️ DMD is memory-intensive due to the fake score transformer which requires a complete second copy of the base model be retained in-memory. | ||
|
|
||
| It's recommended to attempt LCM or DCM distillation methods for the 14B Wan model instead of DMD, if you do not have the needed VRAM. | ||
|
|
||
| **A B200** may be required currently without sparse attention support when distilling the 14B. | ||
|
|
||
| Using LoRA student training can reduce the requirements substantially, but still quite hefty. | ||
|
|
||
| --- | ||
|
|
||
| ## 📦 Installation | ||
|
|
||
| ```bash | ||
| git clone --branch=release https://github.com/bghira/SimpleTuner.git | ||
| cd SimpleTuner | ||
| python3.12 -m venv .venv && source .venv/bin/activate | ||
| pip install -U poetry pip | ||
| poetry config virtualenvs.create false | ||
| poetry install | ||
| ``` | ||
|
|
||
| > If you're on ROCm or Apple: you must instead use `poetry install -C install/variant` where `variant` is `rocm` or `apple`. | ||
|
|
||
| --- | ||
|
|
||
| ## 📁 Configuration | ||
|
|
||
| Edit your `config/config.json`: | ||
|
|
||
| ```json | ||
| { | ||
| "aspect_bucket_rounding": 2, | ||
| "attention_mechanism": "diffusers", | ||
| "base_model_precision": "int8-quanto", | ||
| "caption_dropout_probability": 0.1, | ||
| "checkpointing_steps": 200, | ||
| "checkpoints_total_limit": 3, | ||
| "compress_disk_cache": true, | ||
| "data_backend_config": "config/wan/multidatabackend.json", | ||
| "delete_problematic_images": false, | ||
| "disable_benchmark": false, | ||
| "disable_bucket_pruning": true, | ||
| "distillation_method": "dmd", | ||
| "distillation_config": { | ||
| "dmd_denoising_steps": "1000,757,522", | ||
| "min_timestep_ratio": 0.02, | ||
| "max_timestep_ratio": 0.98, | ||
| "generator_update_interval": 5, | ||
| "real_score_guidance_scale": 3.0, | ||
| "simulate_generator_forward": false, | ||
| "fake_score_lr": 1e-5, | ||
| "fake_score_lr_scheduler": "cosine_with_min_lr", | ||
| "min_lr_ratio": 0.5 | ||
| }, | ||
| "ema_update_interval": 5, | ||
| "ema_validation": "ema_only", | ||
| "flow_schedule_shift": 5, | ||
| "grad_clip_method": "value", | ||
| "gradient_accumulation_steps": 1, | ||
| "gradient_checkpointing": true, | ||
| "hub_model_id": "wan-disney-DMD-3step", | ||
| "ignore_final_epochs": true, | ||
| "learning_rate": 2e-5, | ||
| "lora_alpha": 128, | ||
| "lora_rank": 128, | ||
| "lora_type": "standard", | ||
| "lr_scheduler": "cosine_with_min_lr", | ||
| "lr_warmup_steps": 100, | ||
| "max_grad_norm": 1.0, | ||
| "max_train_steps": 4000, | ||
| "minimum_image_size": 0, | ||
| "mixed_precision": "bf16", | ||
| "model_family": "wan", | ||
| "model_type": "lora", | ||
| "num_train_epochs": 0, | ||
| "optimizer": "adamw_bf16", | ||
| "output_dir": "output/wan-dmd", | ||
| "pretrained_model_name_or_path": "Wan-AI/Wan2.1-T2V-1.3B-Diffusers", | ||
| "pretrained_t5_model_name_or_path": "Wan-AI/Wan2.1-T2V-1.3B-Diffusers", | ||
| "push_checkpoints_to_hub": true, | ||
| "push_to_hub": true, | ||
| "quantize_via": "cpu", | ||
| "report_to": "wandb", | ||
| "resolution": 480, | ||
| "resolution_type": "pixel_area", | ||
| "resume_from_checkpoint": "latest", | ||
| "seed": 1000, | ||
| "text_encoder_1_precision": "int8-quanto", | ||
| "tracker_project_name": "dmd-training", | ||
| "tracker_run_name": "wan-DMD-3step", | ||
| "train_batch_size": 1, | ||
| "use_ema": true, | ||
| "vae_batch_size": 1, | ||
| "validation_guidance": 1.0, | ||
| "validation_negative_prompt": "色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走", | ||
| "validation_num_inference_steps": 3, | ||
| "validation_num_video_frames": 121, | ||
| "validation_prompt": "A black and white animated scene unfolds featuring a distressed upright cow with prominent horns and expressive eyes, suspended by its legs from a hook on a static background wall. A smaller Mickey Mouse-like character enters, standing near a wooden bench, initiating interaction between the two. The cow's posture changes as it leans, stretches, and falls, while the mouse watches with a concerned expression, its face a mixture of curiosity and worry, in a world devoid of color.", | ||
| "validation_prompt_library": "config/wan/validation_prompts_dmd.json", | ||
| "validation_resolution": "1280x704", | ||
| "validation_seed": 42, | ||
| "validation_steps": 200, | ||
| "webhook_config": "config/wan/webhook.json" | ||
| } | ||
| ``` | ||
|
|
||
| ### Key DMD Settings: | ||
|
|
||
| * **`dmd_denoising_steps`**: Steps to distill to (default: "1000,757,522" for 3-step) | ||
| * **`generator_update_interval`**: Update generator every N steps (balances training) | ||
| * **`simulate_generator_forward`**: Enable multi-step simulation (increases memory) | ||
| * **`fake_score_lr`**: Separate learning rate for fake score transformer | ||
|
|
||
| --- | ||
|
|
||
| ## 🎬 Dataset & Dataloader | ||
|
|
||
| For DMD to work well, you need **diverse, high-quality data**: | ||
|
|
||
| ```json | ||
| { | ||
| "dataset_type": "video", | ||
| "cache_dir": "cache/wan-dmd", | ||
| "resolution_type": "pixel_area", | ||
| "crop": false, | ||
| "num_frames": 121, | ||
| "frame_interval": 1, | ||
| "resolution": 480, | ||
| "minimum_image_size": 0, | ||
| "repeats": 0 | ||
| } | ||
| ``` | ||
|
|
||
| > **Note**: The Disney dataset is **inadequate** for DMD. **DON'T use it!** It's provided merely for illustrative purposes. | ||
|
|
||
| You need: | ||
| > - High volume (10k+ videos minimum) | ||
| > - Diverse content (different styles, motions, subjects) | ||
| > - High quality (no compression artifacts) | ||
|
|
||
| These may be generated from the parent model. | ||
|
|
||
| --- | ||
|
|
||
| ## 🚀 Training Tips | ||
|
|
||
| 1. **Start without simulation**: Set `"simulate_generator_forward": false` initially | ||
| 2. **Monitor both losses**: Watch `dmd_loss` and `fake_score_loss` in wandb | ||
| 3. **Validation frequency**: DMD converges quickly, validate often | ||
| 4. **Memory management**: | ||
| - Use `gradient_checkpointing` | ||
| - Lower `train_batch_size` to 1 | ||
| - Consider `base_model_precision: "int8-quanto"` | ||
|
|
||
| --- | ||
|
|
||
| ## 📌 DMD vs DCM | ||
|
|
||
| | Feature | DCM | DMD | | ||
| |---------|-----|-----| | ||
| | Memory usage | Lower | Higher (fake score model) | | ||
| | Training time | Longer | Shorter (4k steps typical) | | ||
| | Quality | Good | Excellent | | ||
| | Inference steps | 4-8+ | 3-8 | | ||
| | Stability | Stable | Requires tuning | | ||
|
|
||
| --- | ||
|
|
||
| ## 🧩 Troubleshooting | ||
|
|
||
| | Problem | Fix | | ||
| |---------|-----| | ||
| | **OOM errors** | Disable `simulate_generator_forward`, reduce batch size | | ||
| | **Fake score not learning** | Increase `fake_score_lr` or use different scheduler | | ||
| | **Generator overfitting** | Increase `generator_update_interval` to 10 | | ||
| | **Poor 3-step quality** | Try "1000,500" for 2-step first | | ||
| | **Training unstable** | Lower learning rates, check data quality | | ||
|
|
||
| --- | ||
|
|
||
| ## 🔬 Advanced Options | ||
|
|
||
| For brave souls wanting to experiment: | ||
|
|
||
| ```json | ||
| "distillation_config": { | ||
| "dmd_denoising_steps": "1000,666,333", | ||
| "simulate_generator_forward": true, | ||
| "fake_score_use_ema": true, | ||
| "adversarial_weight": 0.1, | ||
| "shift": 7.0 | ||
| } | ||
| ``` | ||
|
|
||
| > ⚠️ It's recommended to use the original FastVideo implementation of DMD for resource-constrained projects as it supports sequence-parallel and video-sparse attention (VSA) for far more efficient runtime usage. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.