Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
6b546a2
llama.cpp: increase the max threads from 32 to 256 (#5889)
chraac May 19, 2024
2de586f
Update accelerate requirement from ==0.27.* to ==0.30.* (#5989)
dependabot[bot] May 19, 2024
b63dc4e
UI: Warn user if they are trying to load a model from no path (#6006)
poshul May 19, 2024
8456d13
[docs] small docker changes (#5917)
jvanmelckebeke May 19, 2024
5cb5970
fix: grammar not support utf-8 (#5900)
A0nameless0man May 19, 2024
d7bd3da
Add Llama 3 instruction template (#5891)
Touch-Night May 19, 2024
907702c
Fix gguf multipart file loading (#5857)
Tisjwlf May 19, 2024
818b4e0
Let grammar escape backslashes (#5865)
altoiddealer May 19, 2024
9f77ed1
--idle-timeout flag to unload the model if unused for N minutes (#6026)
oobabooga May 20, 2024
852c943
DRY: A modern repetition penalty that reliably prevents looping (#5677)
p-e-w May 20, 2024
6a1682a
README: update command-line flags with raw --help output
oobabooga May 20, 2024
bd7cc42
Backend cleanup (#6025)
oobabooga May 21, 2024
ae86292
Fix getting Phi-3-small-128k-instruct logits
oobabooga May 21, 2024
9e18994
Minor fix after bd7cc4234d0d2cc890c5e023f67741615c44484a (thanks @bel…
oobabooga May 21, 2024
8aaa0a6
Fixed minor typo in docs - Training Tab.md (#6038)
iamrohitanshu May 21, 2024
5499bc9
Fix stopping strings for llama-3 and phi (#6043)
oobabooga May 22, 2024
ad54d52
Revert "Fix stopping strings for llama-3 and phi (#6043)"
oobabooga May 23, 2024
4f1e96b
Downloader: Add --model-dir argument, respect --model-dir in the UI
oobabooga May 24, 2024
8df68b0
Remove MinPLogitsWarper (it's now a transformers built-in)
oobabooga May 27, 2024
a363cdf
Fix missing bos token for some models (including Llama-3) (#6050)
belladoreai May 27, 2024
9a33e4f
Accelerate DRYLogitsProcessor by using numpy on cpu (fix #5677)
jojje Jun 3, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
304 changes: 141 additions & 163 deletions README.md

Large diffs are not rendered by default.

8 changes: 0 additions & 8 deletions docs/04 - Model Tab.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,14 +64,6 @@ Loads: GPTQ models.
* **no_use_cuda_fp16**: On some systems, the performance can be very bad with this unset. Can usually be ignored.
* **desc_act**: For ancient models without proper metadata, sets the model "act-order" parameter manually. Can usually be ignored.

### GPTQ-for-LLaMa

Loads: GPTQ models.

Ancient loader, the first one to implement 4-bit quantization. It works on older GPUs for which ExLlamaV2 and AutoGPTQ do not work, and it doesn't work with "act-order", so you should use it with simple 4-bit-128g models.

* **pre_layer**: Used for CPU offloading. The higher the number, the more layers will be sent to the GPU. GPTQ-for-LLaMa CPU offloading was faster than the one implemented in AutoGPTQ the last time I checked.

### llama.cpp

Loads: GGUF models. Note: GGML models have been deprecated and do not work anymore.
Expand Down
2 changes: 1 addition & 1 deletion docs/05 - Training Tab.md
Original file line number Diff line number Diff line change
Expand Up @@ -124,7 +124,7 @@ When you're running training, the WebUI's console window will log reports that i

"Loss" in the world of AI training theoretically means "how close is the model to perfect", with `0` meaning "absolutely perfect". This is calculated by measuring the difference between the model outputting exactly the text you're training it to output, and what it actually outputs.

In practice, a good LLM should have a very complex variable range of ideas running in its artificial head, so a loss of `0` would indicate that the model has broken and forgotten to how think about anything other than what you trained it.
In practice, a good LLM should have a very complex variable range of ideas running in its artificial head, so a loss of `0` would indicate that the model has broken and forgotten how to think about anything other than what you trained it on.

So, in effect, Loss is a balancing game: you want to get it low enough that it understands your data, but high enough that it isn't forgetting everything else. Generally, if it goes below `1.0`, it's going to start forgetting its prior memories, and you should stop training. In some cases you may prefer to take it as low as `0.5` (if you want it to be very very predictable). Different goals have different needs, so don't be afraid to experiment and see what works best for you.

Expand Down
22 changes: 0 additions & 22 deletions docs/08 - Additional Tips.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,28 +13,6 @@ Source: https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/1126

This file will be automatically detected the next time you start the web UI.

## Using LoRAs with GPTQ-for-LLaMa

This requires using a monkey patch that is supported by this web UI: https://github.com/johnsmith0031/alpaca_lora_4bit

To use it:

Install alpaca_lora_4bit using pip

```
git clone https://github.com/johnsmith0031/alpaca_lora_4bit.git
cd alpaca_lora_4bit
git fetch origin winglian-setup_pip
git checkout winglian-setup_pip
pip install .
```

Start the UI with the --monkey-patch flag:

```
python server.py --model llama-7b-4bit-128g --listen --lora tloen_alpaca-lora-7b --monkey-patch
```

## DeepSpeed

`DeepSpeed ZeRO-3` is an alternative offloading strategy for full-precision (16-bit) transformers models.
Expand Down
2 changes: 1 addition & 1 deletion docs/09 - Docker.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ Use these commands to launch the image:

```
cd text-generation-webui
ln -s docker/{nvidia/Dockerfile,docker-compose.yml,.dockerignore} .
ln -s docker/{nvidia/Dockerfile,nvidia/docker-compose.yml,.dockerignore} .
cp docker/.env.example .env
# Edit .env and set TORCH_CUDA_ARCH_LIST based on your GPU model
docker compose up --build
Expand Down
8 changes: 2 additions & 6 deletions docs/What Works.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,13 @@

| Loader | Loading 1 LoRA | Loading 2 or more LoRAs | Training LoRAs | Multimodal extension | Perplexity evaluation |
|----------------|----------------|-------------------------|----------------|----------------------|-----------------------|
| Transformers | ✅ | ✅\*\*\* | ✅\* | ✅ | ✅ |
| Transformers | ✅ | ✅\*\* | ✅\* | ✅ | ✅ |
| llama.cpp | ❌ | ❌ | ❌ | ❌ | use llamacpp_HF |
| llamacpp_HF | ❌ | ❌ | ❌ | ❌ | ✅ |
| ExLlamav2_HF | ✅ | ✅ | ❌ | ❌ | ✅ |
| ExLlamav2 | ✅ | ✅ | ❌ | ❌ | use ExLlamav2_HF |
| AutoGPTQ | ✅ | ❌ | ❌ | ✅ | ✅ |
| AutoAWQ | ? | ❌ | ? | ? | ✅ |
| GPTQ-for-LLaMa | ✅\*\* | ✅\*\*\* | ✅ | ✅ | ✅ |
| QuIP# | ? | ? | ? | ? | ✅ |
| HQQ | ? | ? | ? | ? | ✅ |

❌ = not implemented
Expand All @@ -19,6 +17,4 @@

\* Training LoRAs with GPTQ models also works with the Transformers loader. Make sure to check "auto-devices" and "disable_exllama" before loading the model.

\*\* Requires the monkey-patch. The instructions can be found [here](https://github.com/oobabooga/text-generation-webui/wiki/08-%E2%80%90-Additional-Tips#using-loras-with-gptq-for-llama).

\*\*\* Multi-LoRA in PEFT is tricky and the current implementation does not work reliably in all cases.
\*\* Multi-LoRA in PEFT is tricky and the current implementation does not work reliably in all cases.
12 changes: 8 additions & 4 deletions download-model.py
Original file line number Diff line number Diff line change
Expand Up @@ -167,8 +167,11 @@ def get_download_links_from_huggingface(self, model, branch, text_only=False, sp
is_llamacpp = has_gguf and specific_file is not None
return links, sha256, is_lora, is_llamacpp

def get_output_folder(self, model, branch, is_lora, is_llamacpp=False):
base_folder = 'models' if not is_lora else 'loras'
def get_output_folder(self, model, branch, is_lora, is_llamacpp=False, model_dir=None):
if model_dir:
base_folder = model_dir
else:
base_folder = 'models' if not is_lora else 'loras'

# If the model is of type GGUF, save directly in the base_folder
if is_llamacpp:
Expand Down Expand Up @@ -304,7 +307,8 @@ def check_model_files(self, model, branch, links, sha256, output_folder):
parser.add_argument('--threads', type=int, default=4, help='Number of files to download simultaneously.')
parser.add_argument('--text-only', action='store_true', help='Only download text files (txt/json).')
parser.add_argument('--specific-file', type=str, default=None, help='Name of the specific file to download (if not provided, downloads all).')
parser.add_argument('--output', type=str, default=None, help='The folder where the model should be saved.')
parser.add_argument('--output', type=str, default=None, help='Save the model files to this folder.')
parser.add_argument('--model-dir', type=str, default=None, help='Save the model files to a subfolder of this folder instead of the default one (text-generation-webui/models).')
parser.add_argument('--clean', action='store_true', help='Does not resume the previous download.')
parser.add_argument('--check', action='store_true', help='Validates the checksums of model files.')
parser.add_argument('--max-retries', type=int, default=5, help='Max retries count when get error in download time.')
Expand Down Expand Up @@ -333,7 +337,7 @@ def check_model_files(self, model, branch, links, sha256, output_folder):
if args.output:
output_folder = Path(args.output)
else:
output_folder = downloader.get_output_folder(model, branch, is_lora, is_llamacpp=is_llamacpp)
output_folder = downloader.get_output_folder(model, branch, is_lora, is_llamacpp=is_llamacpp, model_dir=args.model_dir)

if args.check:
# Check previously downloaded files
Expand Down
4 changes: 4 additions & 0 deletions extensions/openai/typing.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,10 @@ class GenerationOptions(BaseModel):
seed: int = -1
encoder_repetition_penalty: float = 1
no_repeat_ngram_size: int = 0
dry_multiplier: float = 0
dry_base: float = 1.75
dry_allowed_length: int = 2
dry_sequence_breakers: str = '"\\n", ":", "\\"", "*"'
truncation_length: int = 0
max_tokens_second: int = 0
prompt_lookup_num_tokens: int = 0
Expand Down
13 changes: 13 additions & 0 deletions instruction-templates/Llama-v3.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
instruction_template: |-
{%- set ns = namespace(found=false) -%}
{%- for message in messages -%}
{%- if message['role'] == 'system' -%}
{%- set ns.found = true -%}
{%- endif -%}
{%- endfor -%}
{%- for message in messages -%}
{{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n' + message['content'].rstrip() + '<|eot_id|>' -}}
{%- endfor -%}
{%- if add_generation_prompt -%}
{{-'<|start_header_id|>assistant<|end_header_id|>\n\n'-}}
{%- endif -%}
2 changes: 1 addition & 1 deletion modules/AutoGPTQ_loader.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ def load_quantized(model_name):
'model_basename': pt_path.stem,
'device': "xpu:0" if is_xpu_available() else "cuda:0" if not shared.args.cpu else "cpu",
'use_triton': shared.args.triton,
'inject_fused_attention': not shared.args.no_inject_fused_attention,
'inject_fused_attention': False,
'inject_fused_mlp': not shared.args.no_inject_fused_mlp,
'use_safetensors': use_safetensors,
'trust_remote_code': shared.args.trust_remote_code,
Expand Down
171 changes: 0 additions & 171 deletions modules/GPTQ_loader.py

This file was deleted.

8 changes: 0 additions & 8 deletions modules/chat.py
Original file line number Diff line number Diff line change
Expand Up @@ -308,9 +308,6 @@ def chatbot_wrapper(text, state, regenerate=False, _continue=False, loading_mess
'internal': output['internal']
}

if shared.model_name == 'None' or shared.model is None:
raise ValueError("No model is loaded! Select one in the Model tab.")

# Generate the prompt
kwargs = {
'_continue': _continue,
Expand Down Expand Up @@ -355,11 +352,6 @@ def impersonate_wrapper(text, state):

static_output = chat_html_wrapper(state['history'], state['name1'], state['name2'], state['mode'], state['chat_style'], state['character_menu'])

if shared.model_name == 'None' or shared.model is None:
logger.error("No model is loaded! Select one in the Model tab.")
yield '', static_output
return

prompt = generate_chat_prompt('', state, impersonate=True)
stopping_strings = get_stopping_strings(state)

Expand Down
Loading