-
-
Notifications
You must be signed in to change notification settings - Fork 514
SDNQ Quantization
SD.Next Quantization provides full cross-platform quantization to reduce memory usage and increase performance for any device.
- Go into
Settings -> Quantization Settings - Enable the desired Quantization options under the
SDNQmenu
Model, TE and LLM are the main targets for most use cases - If model is already loaded, reload the model
Once quantization options are set, they will be applied to any model loaded after that
- SDNQ is fully cross-platform, supports all GPUs and CPUs and includes many quantization methods:
- 8-bit, 7-bit, 6-bit, 5-bit, 4-bit, 3-bit, 2-bit and 1-bit int and uint
- 8-bit e5, e4 and fnuz float
-
note:
int8is very close to the original 16 bit quality
- Supports nearly all model types
- Supports compute optimizations using Triton via
torch.compile - Supports Quantized MatMul with significant speedups on INT8 or FP8 supported GPUs
- Supports on the fly quantization during model load with little to no overhead (called as
premode) - Supports quantization for the convolutional layers with UNet models
- Supports post load quantization for any model
- Supports on the fly usage of LoRa models
- Supports SVD Quantization
- Supports balanced offload
Benchmarks are available in the Quantization Wiki.
-
Dequantize using torch.compile
Highly recommended for much better performance if Triton is available -
Use Quantized MatMul
Recommended for much better performance if Triton is available on supported GPUs
Supported GPUs for quantized matmul are listed in the Use Quantized MatMul section. - Recommended quantization dtype is
INT8for its fast speed and almost no loss in quality
You can useINT6with little quality loss to save more memory andUINT4to save even more memory
float8_e4m3fnis another option for fast speed and high quality butFP8has slightly lower quality and performance thanINT8
Triton enables the use of optimized kernels for much better performance.
Triton is not required for SDNQ but it is highly recommended for much better performance.
SDNQ will use Triton by default via torch.compile if Triton is available. You can override this with dequantize using torch.compile option.
-
Linux
- Triton comes built-in on Linux, you can use the Triton optimizations out of the box.
-
Windows
- Windows requires manual installation of Triton.
Installation steps are available in the Quantization Wiki
- Windows requires manual installation of Triton.
-
Linux
- Triton comes built-in on Linux, you can use the Triton optimizations out of the box.
-
Windows
- Windows requires manual installation of Triton and not guaranteed to work with Zluda.
Experimental installation steps are available in the ZLUDA Wiki
- Windows requires manual installation of Triton and not guaranteed to work with Zluda.
- Triton comes built-in with Intel on both Windows and Linux, you can use the Triton optimizations out of the box.
Windows might require additional installation of MSVC if it is not already installed and activated.
Installation steps are available in the PyTorch Inductor Windows wiki
Used to decide which parts of the model will get quantized.
Recommended options are Model and TE.
Default is none.
-
Modelis used quantize the Diffusion Models. -
TEis used to quantize the Text Encoders. -
LLMis used to quantize the LLMs with Prompt Enhance. -
Controlis used to quantize ControlNets. -
VAEis used to quantize the VAE. Using the VAE option is not recommended.
Note
VAE Upcast has to be set to false if you use the VAE option with FP16.
If you get black images with SDXL models, use the FP16 Fixed VAE.
Used to decide when the quantization step will happen on model load.
Default is auto.
-
Automode will choosepreorpostautomatically depending on the model. -
Premode will quantize the model while the model is loading. Reduces system RAM usage. -
Postmode will quantize the model after the model is loaded into system RAM.
Pre mode is compatible with DiT and Video models like Flux but older UNet models like SDXL are only compatible with post mode.
Used to decide the data type used to store the model weights.
Recommended types are int8 for 8 bit, int6 for 6 bit, float8_e4m3fn for fp8 and uint4 for 4 bit.
Default is int8.
INT8 quants have very similar quality to the full 16 bit precision while using 2 times less memory.
INT6 quants are the middle ground. Similar quality to to the full 16 bit precision while using 2.7 times less memory.
INT4 quants have lower quality and less performance but uses 3.6 times less memory.
FP8 quants have similar quality to INT6 but with the same memory usage as INT8.
Unsigned quants have the extra u added to the start of their name while the symmetric quants don't have any prefix.
Unsigned (asymmetric) types: uint8, uint7, uint6, uint5, uint4, uint3, uint2 and uint1
Symmetric types: int8, int7, int6, int5, int4, int3, int2, float8_e4m3fn, float8_e5m2, float8_e4m3fnuz and float8_e5m2fnuz
Asymmetric quants uses unsigned integers, meaning they can't store negative values and will use another variable called zero point for this purpose.
Symmetric quants can store negative and positive values meaning they don't have extra zero point value and they run faster than unsigned quants because of this.
Quality difference between asymmetric and symmetric quantization is very small for 8 to 6 bits but you should use asymmetric methods below 5 bits.
-
int8uses int8 and has -128 to 127 range. -
int7uses eight int7 values packed into seven uint8 values and has -64 to 63 range. -
int6uses four int6 values packed into three uint8 values and has -32 to 31 range. -
int5uses eight int5 values packed into five uint8 values and has -16 to 15 range. -
int4uses two int4 values packed into a single uint8 value and has -8 to 7 range. -
int3uses eight int3 values packed into a three uint8 values and has -4 to 3 range. -
int2uses four int2 values packed into a single uint8 value and has -2 to 1 range. -
uint8uses uint8 and has 0 to 255 range. -
uint7uses eight uint7 values packed into seven uint8 values and has 0 to 127 range. -
uint6uses four uint6 values packed into three uint8 values and has 0 to 63 range. -
uint5uses eight uint5 values packed into five uint8 values and has 0 to 31 range. -
uint4uses two uint4 values packed into a single uint8 value and has 0 to 15 range. -
uint3uses eight uint3 values packed into a three uint8 value and has 0 to 7 range. -
uint2uses four uint2 values packed into a single uint8 value and has 0 to 3 range. -
uint1uses eight uint1 values packed into a single uint8 value and has 0 to 1 range. -
float8_e4m3fnuses float8_e4m3fn and has -448 to 448 range. -
float8_e5m2uses float8_e5m2 and has -57344 to 57344 range. -
float8_e4m3fnuzuses float8_e4m3fnuz and has -240 to 240 range. -
float8_e5m2fnuzuses float8_e5m2fnuz and has -57344 to 57344 range.
Same as Quantization type but for the Text Encoders.
default option will use the same type as Quantization type.
A comma separated list of module names to skip quantization.
Modules listed in this option will not be quantized and will be kept in full precision.
An example list: transformer_blocks.0.img_mod.1.weight, transformer_blocks.0.*, img_in
Default is empty.
A JSON dictionary of quantization types and module names list used to quantize the model with mixed quantization types.
Quantization types can be any valid quantization type supported by SDNQ or it can also be minimum_Xbit.
minimum_Xbit will quantize the specified modules into the specifed bit if the main quantization dtype has less precision.
For example, minimum_6bit will quantize the specified modules to int6 if you are using int5 or below but won't do anything if you are using int6 or above.
Default is empty.
An example dict:
{
"int8": ["transformer_blocks.0.img_mod.1.weight", "transformer_blocks.0.*"],
"minimum_6bit": ["img_in"]
}Used to decide how many elements of a tensor will share the same quantization group.
Higher values have better performance and less memory usage but with less quality.
Default is 0, meaning it will decide the group size based on your quantization type setting.
Linear layers will use this formula to find the group size: 2 ** (2 + number_of_bits)
Convolutions will use this formula to find the group size: 2 ** (1 + number_of_bits)
Setting the group size to -1 will disable grouping.
Using Quantized MatMul with FP8 quant types will disable group sizes.
Using Quantized Matmul with int or uint quant types will continue to use Group Sizes by default if the number of bits is less than 6.
The rank size to use for SVD quantization.
Higher values have better quality but with less performance and more memory usage.
Default is 32.
The number of steps to use in the lowrank SVD estimation.
Higher values have better quality but takes longer to quantize.
Default is 8.
Enabling this option will apply SVD quantization on top of SDNQ quantization.
SVD has much higher quality but runs slower.
SVD also makes Loras usable with 4 bit quantization.
More info on SVD quantization: https://arxiv.org/abs/2411.05007
Disabled by default.
Note: SVD lowrank used by SDNQ is not deterministic.
Meaning that you will get slightly different quantization results every time.
Enabling this option will quantize convolutional layers in UNet models too.
Has much better memory savings but lower quality.
Convolutions will use uint4 when using quants with less than 4 bits.
Disabled by default.
Uses Triton via torch.compile on the dequantization step.
Has significantly higher performance.
This setting requires a full restart of the webui to apply.
Enabled by default if Triton is available.
Enabling this option will use quantized INT8 or FP8 MatMul instead of BF16 / FP16.
Has significantly higher performance on GPUs with INT8 or FP8 support.
Requires Triton. Disabled by default.
Supported GPUs
-
Nvidia
Requires Turing (RTX 2000) or newer GPUs for INT8 matmul.
Requires Ada (RTX 4000) or newer GPUs for FP8 matmul. -
AMD
Requires RDNA2 (RX 6000) or newer GPUs for INT8 matmul.
Requires MI300X or RDNA4 (RX 9000) for FP8 matmul.- RDNA3 (RX 7000) supports INT8 matmul but runs at the same speed as FP16.
- RDNA2 (RX 6000) and older GPUs are supported via Triton.
-
Intel
Requires Alchemist (Arc A) or newer GPUs for INT8 matmul.
Intel doesn't support FP8 matmul.
Quantized INT8 MatMul is compatible with any int or uint quant type.
Quantized FP8 MatMul is only compatible with float8_e4m3fn quant type on most GPUs. CPUs and some GPUs can use the other FP8 types too.
Recommended quant type to use with this option is int8 for quality and INT8 matmul tends to be faster than FP8.
Recommended quant type for FP8 matmul is float8_e4m3fn for quality and better hardware support.
Same as Use Quantized MatMul but for the convolutional layers with UNets like SDXL.
Disabled by default.
Enabling this option will use the GPU for quantization calculations on model load.
Can be faster with weak CPUs but can also be slower because of the GPU to CPU communication overhead.
Enabled by default.
When Model load device map in the Models & Loading settings is set to default or cpu this option will send a part of the model weights to the GPU and quantize it, then will send it back to the CPU right away.
If device map is set to gpu, model weights will be loaded directly into GPU and the quantized model weights will be kept in the GPU until the quantization of the current model part is over.
If Model offload mode is set to none, quantized model weights will be sent to the GPU after quantization and will stay in the GPU.
If Model offload mode is set to model, quantized model weights will be sent to the GPU after quantization and will be sent back to the CPU after the quantization of the current model part is over.
Enabling this option will use FP32 on the dequantization step.
Has higher quality outputs but lower performance.
Disabled by default.