Skip to content

Commit bd7cc42

Browse files
authored
Backend cleanup (#6025)
1 parent 6a1682a commit bd7cc42

23 files changed

Lines changed: 57 additions & 442 deletions

README.md

Lines changed: 11 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ Its goal is to become the [AUTOMATIC1111/stable-diffusion-webui](https://github.
1111
## Features
1212

1313
* 3 interface modes: default (two columns), notebook, and chat.
14-
* Multiple model backends: [Transformers](https://github.com/huggingface/transformers), [llama.cpp](https://github.com/ggerganov/llama.cpp) (through [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)), [ExLlamaV2](https://github.com/turboderp/exllamav2), [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), [AutoAWQ](https://github.com/casper-hansen/AutoAWQ), [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa), [QuIP#](https://github.com/Cornell-RelaxML/quip-sharp).
14+
* Multiple model backends: [Transformers](https://github.com/huggingface/transformers), [llama.cpp](https://github.com/ggerganov/llama.cpp) (through [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)), [ExLlamaV2](https://github.com/turboderp/exllamav2), [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), [AutoAWQ](https://github.com/casper-hansen/AutoAWQ).
1515
* Dropdown menu for quickly switching between different models.
1616
* Large number of extensions (built-in and user-contributed), including Coqui TTS for realistic voice outputs, Whisper STT for voice inputs, translation, [multimodal pipelines](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/multimodal), vector databases, Stable Diffusion integration, and a lot more. See [the wiki](https://github.com/oobabooga/text-generation-webui/wiki/07-%E2%80%90-Extensions) and [the extensions directory](https://github.com/oobabooga/text-generation-webui-extensions) for details.
1717
* [Chat with custom characters](https://github.com/oobabooga/text-generation-webui/wiki/03-%E2%80%90-Parameters-Tab#character).
@@ -208,12 +208,12 @@ usage: server.py [-h] [--multi-user] [--character CHARACTER] [--model MODEL] [--
208208
[--tensorcores] [--n_ctx N_CTX] [--threads THREADS] [--threads-batch THREADS_BATCH] [--no_mul_mat_q] [--n_batch N_BATCH] [--no-mmap] [--mlock] [--n-gpu-layers N_GPU_LAYERS]
209209
[--tensor_split TENSOR_SPLIT] [--numa] [--logits_all] [--no_offload_kqv] [--cache-capacity CACHE_CAPACITY] [--row_split] [--streaming-llm] [--attention-sink-size ATTENTION_SINK_SIZE]
210210
[--gpu-split GPU_SPLIT] [--autosplit] [--max_seq_len MAX_SEQ_LEN] [--cfg-cache] [--no_flash_attn] [--cache_8bit] [--cache_4bit] [--num_experts_per_token NUM_EXPERTS_PER_TOKEN]
211-
[--triton] [--no_inject_fused_attention] [--no_inject_fused_mlp] [--no_use_cuda_fp16] [--desc_act] [--disable_exllama] [--disable_exllamav2] [--wbits WBITS] [--model_type MODEL_TYPE]
212-
[--groupsize GROUPSIZE] [--pre_layer PRE_LAYER [PRE_LAYER ...]] [--checkpoint CHECKPOINT] [--monkey-patch] [--hqq-backend HQQ_BACKEND] [--deepspeed]
213-
[--nvme-offload-dir NVME_OFFLOAD_DIR] [--local_rank LOCAL_RANK] [--alpha_value ALPHA_VALUE] [--rope_freq_base ROPE_FREQ_BASE] [--compress_pos_emb COMPRESS_POS_EMB] [--listen]
214-
[--listen-port LISTEN_PORT] [--listen-host LISTEN_HOST] [--share] [--auto-launch] [--gradio-auth GRADIO_AUTH] [--gradio-auth-path GRADIO_AUTH_PATH] [--ssl-keyfile SSL_KEYFILE]
215-
[--ssl-certfile SSL_CERTFILE] [--api] [--public-api] [--public-api-id PUBLIC_API_ID] [--api-port API_PORT] [--api-key API_KEY] [--admin-key ADMIN_KEY] [--nowebui]
216-
[--multimodal-pipeline MULTIMODAL_PIPELINE]
211+
[--triton] [--no_inject_fused_mlp] [--no_use_cuda_fp16] [--desc_act] [--disable_exllama] [--disable_exllamav2] [--wbits WBITS] [--groupsize GROUPSIZE] [--no_inject_fused_attention]
212+
[--hqq-backend HQQ_BACKEND] [--deepspeed] [--nvme-offload-dir NVME_OFFLOAD_DIR] [--local_rank LOCAL_RANK] [--alpha_value ALPHA_VALUE] [--rope_freq_base ROPE_FREQ_BASE]
213+
[--compress_pos_emb COMPRESS_POS_EMB] [--listen] [--listen-port LISTEN_PORT] [--listen-host LISTEN_HOST] [--share] [--auto-launch] [--gradio-auth GRADIO_AUTH]
214+
[--gradio-auth-path GRADIO_AUTH_PATH] [--ssl-keyfile SSL_KEYFILE] [--ssl-certfile SSL_CERTFILE] [--api] [--public-api] [--public-api-id PUBLIC_API_ID] [--api-port API_PORT]
215+
[--api-key API_KEY] [--admin-key ADMIN_KEY] [--nowebui] [--multimodal-pipeline MULTIMODAL_PIPELINE] [--model_type MODEL_TYPE] [--pre_layer PRE_LAYER [PRE_LAYER ...]]
216+
[--checkpoint CHECKPOINT] [--monkey-patch]
217217
218218
Text generation web UI
219219
@@ -237,7 +237,7 @@ Basic settings:
237237
238238
Model loader:
239239
--loader LOADER Choose the model loader manually, otherwise, it will get autodetected. Valid options: Transformers, llama.cpp, llamacpp_HF, ExLlamav2_HF, ExLlamav2,
240-
AutoGPTQ, AutoAWQ, GPTQ-for-LLaMa, QuIP#.
240+
AutoGPTQ, AutoAWQ.
241241
242242
Transformers/Accelerate:
243243
--cpu Use the CPU to generate text. Warning: Training on CPU is extremely slow.
@@ -293,21 +293,16 @@ ExLlamaV2:
293293
294294
AutoGPTQ:
295295
--triton Use triton.
296-
--no_inject_fused_attention Disable the use of fused attention, which will use less VRAM at the cost of slower inference.
297296
--no_inject_fused_mlp Triton mode only: disable the use of fused MLP, which will use less VRAM at the cost of slower inference.
298297
--no_use_cuda_fp16 This can make models faster on some systems.
299298
--desc_act For models that do not have a quantize_config.json, this parameter is used to define whether to set desc_act or not in BaseQuantizeConfig.
300299
--disable_exllama Disable ExLlama kernel, which can improve inference speed on some systems.
301300
--disable_exllamav2 Disable ExLlamav2 kernel.
302-
303-
GPTQ-for-LLaMa:
304301
--wbits WBITS Load a pre-quantized model with specified precision in bits. 2, 3, 4 and 8 are supported.
305-
--model_type MODEL_TYPE Model type of pre-quantized model. Currently LLaMA, OPT, and GPT-J are supported.
306302
--groupsize GROUPSIZE Group size.
307-
--pre_layer PRE_LAYER [PRE_LAYER ...] The number of layers to allocate to the GPU. Setting this parameter enables CPU offloading for 4-bit models. For multi-gpu, write the numbers separated
308-
by spaces, eg --pre_layer 30 60.
309-
--checkpoint CHECKPOINT The path to the quantized checkpoint file. If not specified, it will be automatically detected.
310-
--monkey-patch Apply the monkey patch for using LoRAs with quantized models.
303+
304+
AutoAWQ:
305+
--no_inject_fused_attention Disable the use of fused attention, which will use less VRAM at the cost of slower inference.
311306
312307
HQQ:
313308
--hqq-backend HQQ_BACKEND Backend for the HQQ loader. Valid options: PYTORCH, PYTORCH_COMPILE, ATEN.

docs/04 - Model Tab.md

Lines changed: 0 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -64,14 +64,6 @@ Loads: GPTQ models.
6464
* **no_use_cuda_fp16**: On some systems, the performance can be very bad with this unset. Can usually be ignored.
6565
* **desc_act**: For ancient models without proper metadata, sets the model "act-order" parameter manually. Can usually be ignored.
6666

67-
### GPTQ-for-LLaMa
68-
69-
Loads: GPTQ models.
70-
71-
Ancient loader, the first one to implement 4-bit quantization. It works on older GPUs for which ExLlamaV2 and AutoGPTQ do not work, and it doesn't work with "act-order", so you should use it with simple 4-bit-128g models.
72-
73-
* **pre_layer**: Used for CPU offloading. The higher the number, the more layers will be sent to the GPU. GPTQ-for-LLaMa CPU offloading was faster than the one implemented in AutoGPTQ the last time I checked.
74-
7567
### llama.cpp
7668

7769
Loads: GGUF models. Note: GGML models have been deprecated and do not work anymore.

docs/08 - Additional Tips.md

Lines changed: 0 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -13,28 +13,6 @@ Source: https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/1126
1313

1414
This file will be automatically detected the next time you start the web UI.
1515

16-
## Using LoRAs with GPTQ-for-LLaMa
17-
18-
This requires using a monkey patch that is supported by this web UI: https://github.com/johnsmith0031/alpaca_lora_4bit
19-
20-
To use it:
21-
22-
Install alpaca_lora_4bit using pip
23-
24-
```
25-
git clone https://github.com/johnsmith0031/alpaca_lora_4bit.git
26-
cd alpaca_lora_4bit
27-
git fetch origin winglian-setup_pip
28-
git checkout winglian-setup_pip
29-
pip install .
30-
```
31-
32-
Start the UI with the --monkey-patch flag:
33-
34-
```
35-
python server.py --model llama-7b-4bit-128g --listen --lora tloen_alpaca-lora-7b --monkey-patch
36-
```
37-
3816
## DeepSpeed
3917

4018
`DeepSpeed ZeRO-3` is an alternative offloading strategy for full-precision (16-bit) transformers models.

docs/What Works.md

Lines changed: 2 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2,15 +2,13 @@
22

33
| Loader | Loading 1 LoRA | Loading 2 or more LoRAs | Training LoRAs | Multimodal extension | Perplexity evaluation |
44
|----------------|----------------|-------------------------|----------------|----------------------|-----------------------|
5-
| Transformers ||\*\*\* |\* |||
5+
| Transformers ||\*\* |\* |||
66
| llama.cpp ||||| use llamacpp_HF |
77
| llamacpp_HF ||||||
88
| ExLlamav2_HF ||||||
99
| ExLlamav2 ||||| use ExLlamav2_HF |
1010
| AutoGPTQ ||||||
1111
| AutoAWQ | ? || ? | ? ||
12-
| GPTQ-for-LLaMa |\*\* |\*\*\* ||||
13-
| QuIP# | ? | ? | ? | ? ||
1412
| HQQ | ? | ? | ? | ? ||
1513

1614
❌ = not implemented
@@ -19,6 +17,4 @@
1917

2018
\* Training LoRAs with GPTQ models also works with the Transformers loader. Make sure to check "auto-devices" and "disable_exllama" before loading the model.
2119

22-
\*\* Requires the monkey-patch. The instructions can be found [here](https://github.com/oobabooga/text-generation-webui/wiki/08-%E2%80%90-Additional-Tips#using-loras-with-gptq-for-llama).
23-
24-
\*\*\* Multi-LoRA in PEFT is tricky and the current implementation does not work reliably in all cases.
20+
\*\* Multi-LoRA in PEFT is tricky and the current implementation does not work reliably in all cases.

modules/AutoGPTQ_loader.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ def load_quantized(model_name):
4444
'model_basename': pt_path.stem,
4545
'device': "xpu:0" if is_xpu_available() else "cuda:0" if not shared.args.cpu else "cpu",
4646
'use_triton': shared.args.triton,
47-
'inject_fused_attention': not shared.args.no_inject_fused_attention,
47+
'inject_fused_attention': False,
4848
'inject_fused_mlp': not shared.args.no_inject_fused_mlp,
4949
'use_safetensors': use_safetensors,
5050
'trust_remote_code': shared.args.trust_remote_code,

modules/GPTQ_loader.py

Lines changed: 0 additions & 171 deletions
This file was deleted.

modules/loaders.py

Lines changed: 0 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -105,7 +105,6 @@
105105
],
106106
'AutoGPTQ': [
107107
'triton',
108-
'no_inject_fused_attention',
109108
'no_inject_fused_mlp',
110109
'no_use_cuda_fp16',
111110
'wbits',
@@ -131,21 +130,6 @@
131130
'trust_remote_code',
132131
'no_use_fast',
133132
],
134-
'GPTQ-for-LLaMa': [
135-
'wbits',
136-
'groupsize',
137-
'model_type',
138-
'pre_layer',
139-
'trust_remote_code',
140-
'no_use_fast',
141-
'gptq_for_llama_info',
142-
],
143-
'QuIP#': [
144-
'trust_remote_code',
145-
'no_use_fast',
146-
'no_flash_attn',
147-
'quipsharp_info',
148-
],
149133
'HQQ': [
150134
'hqq_backend',
151135
'trust_remote_code',
@@ -205,9 +189,7 @@ def transformers_samplers():
205189
loaders_samplers = {
206190
'Transformers': transformers_samplers(),
207191
'AutoGPTQ': transformers_samplers(),
208-
'GPTQ-for-LLaMa': transformers_samplers(),
209192
'AutoAWQ': transformers_samplers(),
210-
'QuIP#': transformers_samplers(),
211193
'HQQ': transformers_samplers(),
212194
'ExLlamav2': {
213195
'temperature',
@@ -339,15 +321,6 @@ def transformers_samplers():
339321
},
340322
}
341323

342-
loaders_model_types = {
343-
'GPTQ-for-LLaMa': [
344-
"None",
345-
"llama",
346-
"opt",
347-
"gptj"
348-
],
349-
}
350-
351324

352325
@functools.cache
353326
def list_all_samplers():
@@ -375,13 +348,6 @@ def blacklist_samplers(loader, dynamic_temperature):
375348
return output
376349

377350

378-
def get_model_types(loader):
379-
if loader in loaders_model_types:
380-
return loaders_model_types[loader]
381-
382-
return ["None"]
383-
384-
385351
def get_gpu_memory_keys():
386352
return [k for k in shared.gradio if k.startswith('gpu_memory')]
387353

0 commit comments

Comments
 (0)