bghira · bghira · Oct 21, 2025 · Oct 17, 2025 · Oct 17, 2025 · Oct 18, 2025
diff --git a/README.md b/README.md
@@ -89,6 +89,7 @@ SimpleTuner provides comprehensive training support across multiple diffusion mo
 - **Gradient checkpointing** - Configurable intervals for memory/speed optimization
 - **Loss functions** - L2, Huber, Smooth L1 with scheduling support
 - **SNR weighting** - Min-SNR gamma weighting for improved training dynamics
+- **Group offloading** - Diffusers v0.33+ module-group CPU/disk staging with optional CUDA streams
 
 ### Model-Specific Features
 
@@ -99,6 +100,7 @@ SimpleTuner provides comprehensive training support across multiple diffusion mo
 - **T5 masked training** - Enhanced fine details for Flux and compatible models
 - **QKV fusion** - Memory and speed optimizations (Flux, Lumina2)
 - **TREAD integration** - Selective token routing for Wan and Flux models
+- **Wan 2.x I2V** - High/low stage presets plus a 2.1 time-embedding fallback (see Wan quickstart)
 - **Classifier-free guidance** - Optional CFG reintroduction for distilled models
 
 ### Quickstart Guides

diff --git a/documentation/DATALOADER.md b/documentation/DATALOADER.md
@@ -49,8 +49,8 @@ Here is the most basic example of a dataloader configuration file, as `multidata
 
 ### `dataset_type`
 
-- **Values:** `image` | `video` | `text_embeds` | `image_embeds` | `conditioning`
-- **Description:** `image` and `video` datasets contain your training data. `text_embeds` contain the outputs of the text encoder cache, and `image_embeds` contain the VAE outputs, if the model uses one. When a dataset is marked as `conditioning`, it is possible to pair it to your `image` dataset via [the conditioning_data option](#conditioning_data)
+- **Values:** `image` | `video` | `text_embeds` | `image_embeds` | `conditioning_image_embeds` | `conditioning`
+- **Description:** `image` and `video` datasets contain your training data. `text_embeds` contain the outputs of the text encoder cache, `image_embeds` contain the VAE latents (when a model uses one), and `conditioning_image_embeds` store cached conditioning image embeddings (such as CLIP vision features). When a dataset is marked as `conditioning`, it is possible to pair it to your `image` dataset via [the conditioning_data option](#conditioning_data)
 - **Note:** Text and image embed datasets are defined differently than image datasets are. A text embed dataset stores ONLY the text embed objects. An image dataset stores the training data.
 - **Note:** Don't combine images and video in a **single** dataset. Split them out.
 
@@ -69,6 +69,22 @@ Here is the most basic example of a dataloader configuration file, as `multidata
 - **Only applies to `dataset_type=image`**
 - If unset, the VAE outputs will be stored on the image backend. Otherwise, you may set this to the `id` of an `image_embeds` dataset, and the VAE outputs will be stored there instead. Allows associating the image_embed dataset to the image data.
 
+### `conditioning_image_embeds`
+
+- **Applies to `dataset_type=image` and `dataset_type=video`**
+- When a model reports `requires_conditioning_image_embeds`, set this to the `id` of a `conditioning_image_embeds` dataset to store cached conditioning image embeddings (for example, CLIP vision features for Wan 2.2 I2V). If unset, SimpleTuner writes the cache to `cache/conditioning_image_embeds/<dataset_id>` by default, guaranteeing it no longer collides with the VAE cache.
+- Models that need these embeds must expose an image encoder through their primary pipeline. If the model cannot supply the encoder, preprocessing will fail early instead of silently generating empty files.
+
+#### `cache_dir_conditioning_image_embeds`
+
+- **Optional override for the conditioning image embed cache destination.**
+- Set this when you want to pin the cache to a specific filesystem location or have a dedicated remote backend (`dataset_type=conditioning_image_embeds`). When omitted, the cache path described above is used automatically.
+
+#### `conditioning_image_embed_batch_size`
+
+- **Optional override for the batch size used while generating conditioning image embeds.**
+- Defaults to the `conditioning_image_embed_batch_size` trainer argument or the VAE batch size when not explicitly provided.
+
 ### `type`
 
 - **Values:** `aws` | `local` | `csv` | `huggingface`
@@ -430,7 +446,8 @@ In order, the lines behave as follows:
     "probability": 1.0,
     "repeats": 0,
     "text_embeds": "alt-embed-cache",
-    "image_embeds": "vae-embeds-example"
+    "image_embeds": "vae-embeds-example",
+    "conditioning_image_embeds": "conditioning-embeds-example"
   },
   {
     "id": "another-special-name-for-another-backend",
@@ -451,6 +468,12 @@ In order, the lines behave as follows:
       "dataset_type": "image_embeds",
       "disabled": false,
   },
+  {
+      "id": "conditioning-embeds-example",
+      "type": "local",
+      "dataset_type": "conditioning_image_embeds",
+      "disabled": false
+  },
   {
     "id": "an example backend for text embeds.",
     "dataset_type": "text_embeds",

diff --git a/documentation/OPTIONS.md b/documentation/OPTIONS.md
@@ -52,6 +52,40 @@ Where `foo` is your config environment - or just use `config/config.json` if you
 
 - **What**: Offloads text encoder weights to CPU when VAE caching is going.
 - **Why**: This is useful for large models like HiDream and Wan 2.1, which can OOM when loading the VAE cache. This option does not impact quality of training, but for very large text encoders or slow CPUs, it can extend startup time substantially with many datasets. This is disabled by default due to this reason.
+- **Tip**: Complements the group offloading feature below for especially memory-constrained systems.
+
+### `--enable_group_offload`
+
+- **What**: Enables diffusers' grouped module offloading so model blocks can be staged on CPU (or disk) between forward passes.
+- **Why**: Dramatically reduces peak VRAM usage on large transformers (Flux, Wan, Auraflow, LTXVideo, Cosmos2Image) with minimal performance impact when used with CUDA streams.
+- **Notes**:
+  - Mutually exclusive with `--enable_model_cpu_offload` — pick one strategy per run.
+  - Requires diffusers **v0.33.0** or newer.
+
+### `--group_offload_type`
+
+- **Choices**: `block_level` (default), `leaf_level`
+- **What**: Controls how layers are grouped. `block_level` balances VRAM savings with throughput, while `leaf_level` maximises savings at the cost of more CPU transfers.
+
+### `--group_offload_blocks_per_group`
+
+- **What**: When using `block_level`, the number of transformer blocks to bundle into a single offload group.
+- **Default**: `1`
+- **Why**: Increasing this number reduces transfer frequency (faster) but keeps more parameters resident on the accelerator (uses more VRAM).
+
+### `--group_offload_use_stream`
+
+- **What**: Uses a dedicated CUDA stream to overlap host/device transfers with compute.
+- **Default**: `False`
+- **Notes**:
+  - Automatically falls back to CPU-style transfers on non-CUDA backends (Apple MPS, ROCm, CPU).
+  - Recommended when training on NVIDIA GPUs with spare copy engine capacity.
+
+### `--group_offload_to_disk_path`
+
+- **What**: Directory path used to spill grouped parameters to disk instead of RAM.
+- **Why**: Useful for extremely tight CPU RAM budgets (e.g., workstation with large NVMe drive).
+- **Tip**: Use a fast local SSD; network filesystems will significantly slow training.
 
 ### `--pretrained_model_name_or_path`
 

diff --git a/documentation/QUICKSTART.md b/documentation/QUICKSTART.md
@@ -23,9 +23,11 @@ For the complete and most accurate feature matrix, please see the [main README.m
 | [Lumina2](/documentation/quickstart/LUMINA2.md)      | 2B   |  ✓   |  ✓   |     ✓    | optional (int8)           | bf16            | ✓                    | ✓               |           |
 | [Cosmos2](/documentation/quickstart/COSMOS2IMAGE.md)      | 2B   |  ✓   |  ✓   |     ✓    | not recommended         | bf16            | ✓                    | ✓               |           |
 | [LTX Video](/documentation/quickstart/LTXVIDEO.md)| ~2.5 B      |  ✓   |  ✓   |     ✓     | optional (int8,  fp8)       | bf16            | ✓                    | ✓               |           |
-| [Wan 2.1](/documentation/quickstart/WAN.md)      | 1.3B-14B   |  ✓   |  ✓   |     ✓*    | optional (int8)           | bf16            | ✓                    | ✓               |           |
+| [Wan 2.x](/documentation/quickstart/WAN.md)      | 1.3B-14B   |  ✓   |  ✓   |     ✓*    | optional (int8)           | bf16            | ✓                    | ✓               |           |
 | [Qwen Image](/documentation/quickstart/QWEN_IMAGE.md) | 20B |  ✓   |  ✓   |     ✓*    | required (int8, nf4)      | bf16            | ✓ (required)         | ✓               |           |
 
 **Note:** The above table provides a simplified overview. For the complete and most accurate feature matrix with detailed specifications, please see the [main README.md](../README.md#model-architecture-support).
 
+> ℹ️ The Wan quickstart covers 2.1 training plus the 2.2 high/low stage presets and the new time-embedding compatibility toggle.
+
 > ⚠️ These tutorials are a work-in-progress. They contain full end-to-end instructions for a basic training session.
diff --git a/documentation/quickstart/AURAFLOW.md b/documentation/quickstart/AURAFLOW.md
@@ -10,6 +10,23 @@ Auraflow v0.3 was released as a 6B parameter MMDiT that uses Pile T5 for its enc
 
 This model is somewhat slow for inference, but trains at a decent speed.
 
+### Memory offloading (optional)
+
+Auraflow benefits greatly from the new grouped offloading path. Add the following to your training flags if you are limited to a single 24G (or smaller) GPU:
+
+```bash
+--enable_group_offload \
+--group_offload_type block_level \
+--group_offload_blocks_per_group 1 \
+--group_offload_use_stream \
+# optional: spill offloaded weights to disk instead of RAM
+# --group_offload_to_disk_path /fast-ssd/simpletuner-offload
+```
+
+- Streams are automatically disabled on non-CUDA backends, so the command is safe to reuse on ROCm and MPS.
+- Do not combine this with `--enable_model_cpu_offload`.
+- Disk offloading trades throughput for lower host RAM pressure; keep it on a local SSD for best results.
+
 ### Prerequisites
 
 Make sure that you have python installed; SimpleTuner does well with 3.10 through 3.12.

diff --git a/documentation/quickstart/COSMOS2IMAGE.md b/documentation/quickstart/COSMOS2IMAGE.md
@@ -10,6 +10,23 @@ Cosmos2 Predict (Image) is a vision transformer-based model that uses flow match
 
 A 24GB GPU is recommended as the minimum for comfortable training without extensive optimizations.
 
+### Memory offloading (optional)
+
+To squeeze Cosmos2 into smaller GPUs, enable grouped offloading:
+
+```bash
+--enable_group_offload \
+--group_offload_type block_level \
+--group_offload_blocks_per_group 1 \
+--group_offload_use_stream \
+# optional: spill offloaded weights to disk instead of RAM
+# --group_offload_to_disk_path /fast-ssd/simpletuner-offload
+```
+
+- Streams are only honoured on CUDA; other devices fall back automatically.
+- Do not combine this with `--enable_model_cpu_offload`.
+- Disk staging is optional and helps when system RAM is the bottleneck.
+
 ### Prerequisites
 
 Make sure that you have python installed; SimpleTuner does well with 3.10 through 3.12.

diff --git a/documentation/quickstart/FLUX.md b/documentation/quickstart/FLUX.md
@@ -26,6 +26,23 @@ Luckily, these are readily available through providers such as [LambdaLabs](http
 
 **Unlike other models, Apple GPUs do not currently work for training Flux.**
 
+### Memory offloading (optional)
+
+Flux supports grouped module offloading via diffusers v0.33+. This dramatically reduces VRAM pressure when you are bottlenecked by the transformer weights. You can enable it by adding the following flags to `TRAINER_EXTRA_ARGS` (or the WebUI Hardware page):
+
+```bash
+--enable_group_offload \
+--group_offload_type block_level \
+--group_offload_blocks_per_group 1 \
+--group_offload_use_stream \
+# optional: spill offloaded weights to disk instead of RAM
+# --group_offload_to_disk_path /fast-ssd/simpletuner-offload
+```
+
+- `--group_offload_use_stream` is only effective on CUDA devices; SimpleTuner automatically disables streams on ROCm, MPS and CPU backends.
+- Do **not** combine this with `--enable_model_cpu_offload` — the two strategies are mutually exclusive.
+- When using `--group_offload_to_disk_path`, prefer a fast local SSD/NVMe target.
+
 ## Prerequisites
 
 Make sure that you have python installed; SimpleTuner does well with 3.10 through 3.12.

diff --git a/documentation/quickstart/LTXVIDEO.md b/documentation/quickstart/LTXVIDEO.md
@@ -14,6 +14,23 @@ You'll need:
 
 Apple silicon systems work great with LTX so far, albeit at a lower resolution due to limits inside the MPS backend used by Pytorch.
 
+### Memory offloading (optional)
+
+If you are close to the VRAM limit, enable grouped offloading in your config:
+
+```bash
+--enable_group_offload \
+--group_offload_type block_level \
+--group_offload_blocks_per_group 1 \
+--group_offload_use_stream \
+# optional: spill offloaded weights to disk instead of RAM
+# --group_offload_to_disk_path /fast-ssd/simpletuner-offload
+```
+
+- CUDA users benefit from `--group_offload_use_stream`; other backends ignore it automatically.
+- Skip `--group_offload_to_disk_path` unless system RAM is <64 GB — disk staging is slower but keeps runs stable.
+- Disable `--enable_model_cpu_offload` when using group offloading.
+
 ### Prerequisites
 
 Make sure that you have python installed; SimpleTuner does well with 3.10 through 3.12.

diff --git a/documentation/quickstart/WAN.md b/documentation/quickstart/WAN.md
@@ -29,10 +29,30 @@ Currently, image-to-video training is not supported for Wan, but T2V LoRA and Ly
 - Resolution: 1280x720
 -->
 
+#### Image to Video (Wan 2.2)
+
+Recent Wan 2.2 I2V checkpoints work with the same training flow:
+
+- High stage: https://huggingface.co/Wan-AI/Wan2.2-I2V-14B-Diffusers/tree/main/high_noise_model
+- Low stage: https://huggingface.co/Wan-AI/Wan2.2-I2V-14B-Diffusers/tree/main/low_noise_model
+
+You can target the stage you want with the `model_flavour` and `wan_validation_load_other_stage` settings outlined later in this guide.
+
 You'll need:
 - **a realistic minimum** is 16GB or, a single 3090 or V100 GPU
 - **ideally** multiple 4090, A6000, L40S, or better
 
+If you encounter shape mismatches in the time embedding layers when running Wan 2.2 checkpoints, enable the new
+`wan_force_2_1_time_embedding` flag. This forces the transformer to fall back to Wan 2.1 style time embeddings and
+resolves the compatibility issue.
+
+#### Stage presets & validation
+
+- `model_flavour=i2v-14b-2.2-high` targets the Wan 2.2 high-noise stage.
+- `model_flavour=i2v-14b-2.2-low` targets the low-noise stage (same checkpoints, different subfolder).
+- Toggle `wan_validation_load_other_stage=true` to load the opposite stage alongside the one you train for validation renders.
+- Leave the flavour unset (or use `t2v-480p-1.3b-2.1`) for the standard Wan 2.1 text-to-video run.
+
 Apple silicon systems do not work super well with Wan 2.1 so far, something like 10 minutes for a single training step can be expected..
 
 ### Prerequisites
@@ -112,6 +132,23 @@ simpletuner configure
 
 > ⚠️ For users located in countries where Hugging Face Hub is not readily accessible, you should add `HF_ENDPOINT=https://hf-mirror.com` to your `~/.bashrc` or `~/.zshrc` depending on which `$SHELL` your system uses.
 
+### Memory offloading (optional)
+
+Wan is one of the heaviest models SimpleTuner supports. Enable grouped offloading if you are close to the VRAM ceiling:
+
+```bash
+--enable_group_offload \
+--group_offload_type block_level \
+--group_offload_blocks_per_group 1 \
+--group_offload_use_stream \
+# optional: spill offloaded weights to disk instead of RAM
+# --group_offload_to_disk_path /fast-ssd/simpletuner-offload
+```
+
+- Only CUDA devices honour `--group_offload_use_stream`; ROCm/MPS fall back automatically.
+- Leave disk staging commented out unless CPU memory is the bottleneck.
+- `--enable_model_cpu_offload` is mutually exclusive with group offload.
+
 
 If you prefer to manually configure:
 
@@ -432,6 +469,30 @@ Create a `--data_backend_config` (`config/multidatabackend.json`) document conta
 ]
 ```
 
+- Wan 2.2 image-to-video runs create CLIP conditioning caches. In the **video** dataset entry, point at a dedicated backend and (optionally) override the cache path:
+
+```json
+  {
+    "id": "disney-black-and-white",
+    "type": "local",
+    "dataset_type": "video",
+    "conditioning_image_embeds": "disney-conditioning",
+    "cache_dir_conditioning_image_embeds": "cache/conditioning_image_embeds/disney-black-and-white"
+  }
+```
+
+- Define the conditioning backend once and reuse it across datasets if needed (full object shown here for clarity):
+
+```json
+  {
+    "id": "disney-conditioning",
+    "type": "local",
+    "dataset_type": "conditioning_image_embeds",
+    "cache_dir": "cache/conditioning_image_embeds/disney-conditioning",
+    "disabled": false
+  }
+```
+
 - In the `video` subsection, we have the following keys we can set:
   - `num_frames` (optional, int) is how many seconds of data we'll train on.
     - At 15 fps, 75 frames is 5 seconds of video, standard output. This should be your target.
@@ -488,6 +549,8 @@ simpletuner train
 simpletuner train
 ```
 
+> ℹ️ Append `--model_flavour i2v-14b-2.2-high` (or `low`) and, if desired, `--wan_validation_load_other_stage` inside `TRAINER_EXTRA_ARGS` or your CLI invocation when you train Wan 2.2. Add `--wan_force_2_1_time_embedding` only when the checkpoint reports a time-embedding shape mismatch.
+
 **Option 3 (Legacy method - still works):**
 ```bash
 ./train.sh

diff --git a/setup.py b/setup.py
@@ -69,9 +69,7 @@ def build_rocm_wheel_url(package: str, version: str, rocm_version: str) -> str:
     py_tag = f"cp{sys.version_info.major}{sys.version_info.minor}"
     platform_tag = _rocm_platform_tag()
     filename = f"{package}-{version}%2Brocm{rocm_version}-{py_tag}-{py_tag}-{platform_tag}.whl"
-    base_url = os.environ.get(
-        "SIMPLETUNER_ROCM_BASE_URL", f"https://download.pytorch.org/whl/rocm{rocm_version}"
-    )
+    base_url = os.environ.get("SIMPLETUNER_ROCM_BASE_URL", f"https://download.pytorch.org/whl/rocm{rocm_version}")
     return f"{package} @ {base_url}/{filename}"
 
 
@@ -86,6 +84,7 @@ def get_cuda_dependencies():
         "torchao>=0.12.0",
         "nvidia-cudnn-cu12",
         "nvidia-nccl-cu12",
+        "nvidia-ml-py>=12.555",
         "lm-eval>=0.4.4",
     ]
 
@@ -183,7 +182,7 @@ def _collect_package_files(*directories: str):
     "wandb>=0.21.0",
     "requests>=2.32.4",
     "pillow>=11.3.0",
-    "trainingsample>=0.2.1",
+    "trainingsample>=0.2.10",
     "accelerate>=1.5.2",
     "safetensors>=0.5.3",
     "compel>=2.1.1",
@@ -218,7 +217,6 @@ def _collect_package_files(*directories: str):
     "imageio[pyav]>=2.37.0",
     "hf-xet>=1.1.5",
     "peft-singlora>=0.2.0",
-    "trainingsample>=0.2.1",
     "cryptography>=41.0.0",
 ]
 

diff --git a/simpletuner/cli.py b/simpletuner/cli.py
@@ -14,6 +14,8 @@
 from pathlib import Path
 from typing import List, Optional
 
+from simpletuner.simpletuner_sdk.server.utils.paths import get_config_directory, get_template_directory
+
 
 def find_config_file() -> Optional[str]:
     """Find config file in current directory or config/ subdirectory."""
@@ -609,6 +611,13 @@ def cmd_server(args) -> int:
         os.environ["SIMPLETUNER_SSL_KEYFILE"] = ssl_config["keyfile"]
         os.environ["SIMPLETUNER_SSL_CERTFILE"] = ssl_config["certfile"]
 
+    # Ensure template resolution points to packaged templates unless overridden
+    os.environ.setdefault("TEMPLATE_DIR", str(get_template_directory()))
+
+    # Ensure a configuration directory exists and record it for downstream services
+    config_dir = get_config_directory()
+    os.environ.setdefault("SIMPLETUNER_CONFIG_DIR", str(config_dir))
+
     try:
         import uvicorn
 
@@ -622,12 +631,6 @@ def cmd_server(args) -> int:
         # Create app with specified mode
         app = create_app(mode=server_mode, ssl_no_verify=ssl_no_verify)
 
-        # Create necessary directories
-        os.makedirs("static/css", exist_ok=True)
-        os.makedirs("static/js", exist_ok=True)
-        os.makedirs("templates", exist_ok=True)
-        os.makedirs("configs", exist_ok=True)
-
         # Configure uvicorn SSL
         uvicorn_config = {"app": app, "host": host, "port": port, "reload": reload, "log_level": "info"}