Documentation pt2

bghira · bghira · commit 39aac5c8bf83 · 2023-08-28T18:21:53.000-07:00
diff --git a/README.md b/README.md
@@ -13,6 +13,10 @@ The features implemented will eventually be shared between SD 2.1 and SDXL as mu
 * Legacy trainer does not implement precomputed embeds/latents
 * Currently, the legacy trainer is somewhat neglected. The last release pre-SDXL support should be used for SD 2.1.
 
+## Tutorial
+
+Please fully explore this README before embarking on [the tutorial](/TUTORIAL.md), as it contains vital information that you might need to know first.
+
 ## General design philosophy
 
 * Just throw captioned images into a dir and the script does the rest.
@@ -23,13 +27,9 @@ The features implemented will eventually be shared between SD 2.1 and SDXL as mu
 
 * VAE (latents) outputs are precomputed before training and saved to storage, so that we do not need to invoke the VAE during the forward pass.
 * Since SDXL has two text encoders, we precompute all of the captions into embeds and then store those as well.
-* Train on a 40G GPU when using lower base resolutions.
+* **Train on a 40G GPU** when using lower base resolutions. Sorry, but it's just not doable to train SDXL's full U-net on 24G, even with Adafactor.
 * EMA (Exponential moving average) weight network as an optional way to reduce model over-cooking.
 
-With this script, at 1024x1024 batch size 10, we can nearly saturate a single 80G A100!
-
-At 1024x1024 batch size 4, we can use a 48G A6000 GPU, which reduces the cost of multi-GPU training!
-
 ## Stable Diffusion 2.0 / 2.1
 
 Stable Diffusion 2.1 is notoriously difficult to fine-tune. Many of the default scripts are not making the smartest choices, and result in poor-quality outputs:
@@ -38,10 +38,41 @@ Stable Diffusion 2.1 is notoriously difficult to fine-tune. Many of the default
 * Not using enforced zero SNR on the terminal timestep, using offset noise instead. This results in a more noisy image.
 * Training on only square, 768x768 images, that will result in the model losing the ability to (or at the very least, simply not improving) generalise across aspect ratios.
 
+## Hardware Requirements
+
+All testing of this script has been done using:
+
+* A100-80G
+* A6000 48G
+* 4090 24G
+
+Despite optimisations, SDXL training **will not work on a 24G GPU**, though SD 2.1 training works fantastically well there.
+
+### SDXL 1.0
+
+At 1024x1024 batch size 10, we can nearly saturate a single 80G A100's entire VRAM pool!
+
+At 1024x1024 batch size 4, we can begin to make use of a 48G A6000 GPU, which substantially reduces the cost of multi-GPU training!
+
+With a resolution reduction down to 768 pixels, you can shift requirements down to an A100-40G.
+
+For further reductions, when training at a resolution of `256x256` the model can still generalise training data quite well, in addition to supporting a much higher batch size around 15 if the VRAM is present.
+
+### Stable Diffusion 2.x
+
+Generally, a batch size of 4-8 for aspect bucketed data at 768px base was achievable within 24G of VRAM.
+
+On an A100-80G, a batch size of 15 could be reached with nearly all of the VRAM in use
+
+For 1024px training, the VRAM requirement goes up substantially, but it is still doable in roughly an equivalent footprint to an _optimised_ SDXL setup.
+
+Optimizations from the SDXL trainer could be ported to the legacy trainer (text embed cache, precomputed latents) to bring this down, substantially, and make 1024px training more viable on consumer kit.
+
 ## Scripts
 
+* `ubuntu.sh` - This is a basic "installer" that makes it quick to deploy on a Vast.ai instance.
 * `train_sdxl.sh` - This is where the magic happens.
-* `training.sh` - some variables are here, but if they are, they're not meant to be tuned.
+* `training.sh` - This is the legacy Stable Diffusion 1.x / 2.x trainer. The last stable version was before SDXL support was introduced. 😞
 * `sdxl-env.sh.example` - These are the SDXL training parameters, you should copy to `sdxl-env.sh`
 * `sd21-env.sh.example` - These are the training parameters, copy to `env.sh`