You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Dropdown menu for quickly switching between different models.
16
16
* Large number of extensions (built-in and user-contributed), including Coqui TTS for realistic voice outputs, Whisper STT for voice inputs, translation, [multimodal pipelines](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/multimodal), vector databases, Stable Diffusion integration, and a lot more. See [the wiki](https://github.com/oobabooga/text-generation-webui/wiki/07-%E2%80%90-Extensions) and [the extensions directory](https://github.com/oobabooga/text-generation-webui-extensions) for details.
17
17
*[Chat with custom characters](https://github.com/oobabooga/text-generation-webui/wiki/03-%E2%80%90-Parameters-Tab#character).
--loader LOADER Choose the model loader manually, otherwise, it will get autodetected. Valid options: Transformers, llama.cpp, llamacpp_HF, ExLlamav2_HF, ExLlamav2,
240
-
AutoGPTQ, AutoAWQ, GPTQ-for-LLaMa, QuIP#.
240
+
AutoGPTQ, AutoAWQ.
241
241
242
242
Transformers/Accelerate:
243
243
--cpu Use the CPU to generate text. Warning: Training on CPU is extremely slow.
@@ -293,21 +293,16 @@ ExLlamaV2:
293
293
294
294
AutoGPTQ:
295
295
--triton Use triton.
296
-
--no_inject_fused_attention Disable the use of fused attention, which will use less VRAM at the cost of slower inference.
297
296
--no_inject_fused_mlp Triton mode only: disable the use of fused MLP, which will use less VRAM at the cost of slower inference.
298
297
--no_use_cuda_fp16 This can make models faster on some systems.
299
298
--desc_act For models that do not have a quantize_config.json, this parameter is used to define whether to set desc_act or not in BaseQuantizeConfig.
300
299
--disable_exllama Disable ExLlama kernel, which can improve inference speed on some systems.
301
300
--disable_exllamav2 Disable ExLlamav2 kernel.
302
-
303
-
GPTQ-for-LLaMa:
304
301
--wbits WBITS Load a pre-quantized model with specified precision in bits. 2, 3, 4 and 8 are supported.
305
-
--model_type MODEL_TYPE Model type of pre-quantized model. Currently LLaMA, OPT, and GPT-J are supported.
306
302
--groupsize GROUPSIZE Group size.
307
-
--pre_layer PRE_LAYER [PRE_LAYER ...] The number of layers to allocate to the GPU. Setting this parameter enables CPU offloading for 4-bit models. For multi-gpu, write the numbers separated
308
-
by spaces, eg --pre_layer 30 60.
309
-
--checkpoint CHECKPOINT The path to the quantized checkpoint file. If not specified, it will be automatically detected.
310
-
--monkey-patch Apply the monkey patch for using LoRAs with quantized models.
303
+
304
+
AutoAWQ:
305
+
--no_inject_fused_attention Disable the use of fused attention, which will use less VRAM at the cost of slower inference.
311
306
312
307
HQQ:
313
308
--hqq-backend HQQ_BACKEND Backend for the HQQ loader. Valid options: PYTORCH, PYTORCH_COMPILE, ATEN.
Copy file name to clipboardExpand all lines: docs/04 - Model Tab.md
-8Lines changed: 0 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -64,14 +64,6 @@ Loads: GPTQ models.
64
64
***no_use_cuda_fp16**: On some systems, the performance can be very bad with this unset. Can usually be ignored.
65
65
***desc_act**: For ancient models without proper metadata, sets the model "act-order" parameter manually. Can usually be ignored.
66
66
67
-
### GPTQ-for-LLaMa
68
-
69
-
Loads: GPTQ models.
70
-
71
-
Ancient loader, the first one to implement 4-bit quantization. It works on older GPUs for which ExLlamaV2 and AutoGPTQ do not work, and it doesn't work with "act-order", so you should use it with simple 4-bit-128g models.
72
-
73
-
***pre_layer**: Used for CPU offloading. The higher the number, the more layers will be sent to the GPU. GPTQ-for-LLaMa CPU offloading was faster than the one implemented in AutoGPTQ the last time I checked.
74
-
75
67
### llama.cpp
76
68
77
69
Loads: GGUF models. Note: GGML models have been deprecated and do not work anymore.
\* Training LoRAs with GPTQ models also works with the Transformers loader. Make sure to check "auto-devices" and "disable_exllama" before loading the model.
21
19
22
-
\*\* Requires the monkey-patch. The instructions can be found [here](https://github.com/oobabooga/text-generation-webui/wiki/08-%E2%80%90-Additional-Tips#using-loras-with-gptq-for-llama).
23
-
24
-
\*\*\* Multi-LoRA in PEFT is tricky and the current implementation does not work reliably in all cases.
20
+
\*\* Multi-LoRA in PEFT is tricky and the current implementation does not work reliably in all cases.
0 commit comments