Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 0 additions & 73 deletions docs/08 - Additional Tips.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,79 +58,6 @@ pip install -U deepspeed
deepspeed --num_gpus=1 server.py --deepspeed --chat --model gpt-j-6B
```

> RWKV: RNN with Transformer-level LLM Performance
>
> It combines the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding (using the final hidden state).

https://github.com/BlinkDL/RWKV-LM

https://github.com/BlinkDL/ChatRWKV

## Using RWKV in the web UI

### Hugging Face weights

Simply download the weights from https://huggingface.co/RWKV and load them as you would for any other model.

There is a bug in transformers==4.29.2 that prevents RWKV from being loaded in 8-bit mode. You can install the dev branch to solve this bug: `pip install git+https://github.com/huggingface/transformers`

### Original .pth weights

The instructions below are from before RWKV was supported in transformers, and they are kept for legacy purposes. The old implementation is possibly faster, but it lacks the full range of samplers that the transformers library offers.

#### 0. Install the RWKV library

```
pip install rwkv
```

`0.7.3` was the last version that I tested. If you experience any issues, try ```pip install rwkv==0.7.3```.

#### 1. Download the model

It is available in different sizes:

* https://huggingface.co/BlinkDL/rwkv-4-pile-3b/
* https://huggingface.co/BlinkDL/rwkv-4-pile-7b/
* https://huggingface.co/BlinkDL/rwkv-4-pile-14b/

There are also older releases with smaller sizes like:

* https://huggingface.co/BlinkDL/rwkv-4-pile-169m/resolve/main/RWKV-4-Pile-169M-20220807-8023.pth

Download the chosen `.pth` and put it directly in the `models` folder.

#### 2. Download the tokenizer

[20B_tokenizer.json](https://raw.githubusercontent.com/BlinkDL/ChatRWKV/main/v2/20B_tokenizer.json)

Also put it directly in the `models` folder. Make sure to not rename it. It should be called `20B_tokenizer.json`.

#### 3. Launch the web UI

No additional steps are required. Just launch it as you would with any other model.

```
python server.py --listen --no-stream --model RWKV-4-Pile-169M-20220807-8023.pth
```

#### Setting a custom strategy

It is possible to have very fine control over the offloading and precision for the model with the `--rwkv-strategy` flag. Possible values include:

```
"cpu fp32" # CPU mode
"cuda fp16" # GPU mode with float16 precision
"cuda fp16 *30 -> cpu fp32" # GPU+CPU offloading. The higher the number after *, the higher the GPU allocation.
"cuda fp16i8" # GPU mode with 8-bit precision
```

See the README for the PyPl package for more details: https://pypi.org/project/rwkv/

#### Compiling the CUDA kernel

You can compile the CUDA kernel for the model with `--rwkv-cuda-on`. This should improve the performance a lot but I haven't been able to get it to work yet.

## Miscellaneous info

### You can train LoRAs in CPU mode
Expand Down
11 changes: 8 additions & 3 deletions extensions/gallery/script.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,14 @@
import gradio as gr

from modules.html_generator import get_image_cache
from modules.shared import gradio, settings
from modules.shared import gradio


params = {
'items_per_page': 50,
'open': False,
}

cards = []


Expand Down Expand Up @@ -104,7 +109,7 @@ def custom_js():


def ui():
with gr.Accordion("Character gallery", open=settings["gallery-open"], elem_id='gallery-extension'):
with gr.Accordion("Character gallery", open=params["open"], elem_id='gallery-extension'):
gr.HTML(value="<style>" + generate_css() + "</style>")
with gr.Row():
filter_box = gr.Textbox(label='', placeholder='Filter', lines=1, max_lines=1, container=False, elem_id='gallery-filter-box')
Expand All @@ -116,7 +121,7 @@ def ui():
label="",
samples=generate_html(),
elem_classes=["character-gallery"],
samples_per_page=settings["gallery-items_per_page"]
samples_per_page=params["items_per_page"]
)

filter_box.change(lambda: None, None, None, js=f'() => {{{custom_js()}; gotoFirstPage()}}').success(
Expand Down
26 changes: 26 additions & 0 deletions instruction-templates/Command-R.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
instruction_template: |-
{%- if messages[0]['role'] == 'system' -%}
{%- set loop_messages = messages[1:] -%}
{%- set system_message = messages[0]['content'] -%}
{%- elif false == true -%}
{%- set loop_messages = messages -%}
{%- set system_message = 'You are Command-R, a brilliant, sophisticated, AI-assistant trained to assist human users by providing thorough responses. You are trained by Cohere.' -%}
{%- else -%}
{%- set loop_messages = messages -%}
{%- set system_message = false -%}
{%- endif -%}
{%- if system_message != false -%}
{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' + system_message + '<|END_OF_TURN_TOKEN|>' }}
{%- endif -%}
{%- for message in loop_messages -%}
{%- set content = message['content'] -%}
{%- if message['role'] == 'user' -%}
{{ '<|START_OF_TURN_TOKEN|><|USER_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}
{%- elif message['role'] == 'assistant' -%}
{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}
{%- endif -%}
{%- endfor -%}
{%- if add_generation_prompt -%}
{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' }}
{%- endif -%}

19 changes: 19 additions & 0 deletions js/main.js
Original file line number Diff line number Diff line change
Expand Up @@ -464,3 +464,22 @@ function handleVisibilityChange(isVisible) {
}

respondToRenameVisibility(renameTextArea, handleVisibilityChange);

//------------------------------------------------
// Adjust the chat tab margin if no extension UI
// is present at the bottom
//------------------------------------------------

if (document.getElementById('extensions') === null) {
document.getElementById("chat-tab").style.marginBottom = "-29px";
}

//------------------------------------------------
// Focus on the chat input after starting a new chat
//------------------------------------------------

document.querySelectorAll('.focus-on-chat-input').forEach(element => {
element.addEventListener('click', function() {
document.querySelector('#chat-input textarea').focus();
});
});
2 changes: 2 additions & 0 deletions models/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -192,3 +192,5 @@
instruction_template: 'Synthia'
.*(hercules|hyperion):
instruction_template: 'ChatML'
.*command-r:
instruction_template: 'Command-R'
2 changes: 1 addition & 1 deletion modules/chat.py
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ def generate_chat_prompt(user_input, state, **kwargs):
messages.append({"role": "user", "content": user_input})

def remove_extra_bos(prompt):
for bos_token in ['<s>', '<|startoftext|>']:
for bos_token in ['<s>', '<|startoftext|>', '<BOS_TOKEN>', '<|endoftext|>']:
while prompt.startswith(bos_token):
prompt = prompt[len(bos_token):]

Expand Down
23 changes: 18 additions & 5 deletions modules/models.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import gc
import logging
import os
import pprint
import re
import time
import traceback
Expand Down Expand Up @@ -126,15 +127,19 @@ def huggingface_loader(model_name):
path_to_model = Path(f'{shared.args.model_dir}/{model_name}')
params = {
'low_cpu_mem_usage': True,
'trust_remote_code': shared.args.trust_remote_code,
'torch_dtype': torch.bfloat16 if shared.args.bf16 else torch.float16,
'use_safetensors': True if shared.args.force_safetensors else None
}

if shared.args.trust_remote_code:
params['trust_remote_code'] = True

if shared.args.use_flash_attention_2:
params['use_flash_attention_2'] = True

config = AutoConfig.from_pretrained(path_to_model, trust_remote_code=params['trust_remote_code'])
if shared.args.force_safetensors:
params['force_safetensors'] = True

config = AutoConfig.from_pretrained(path_to_model, trust_remote_code=shared.args.trust_remote_code)

if 'chatglm' in model_name.lower():
LoaderClass = AutoModel
Expand All @@ -147,6 +152,10 @@ def huggingface_loader(model_name):

# Load the model without any special settings
if not any([shared.args.cpu, shared.args.load_in_8bit, shared.args.load_in_4bit, shared.args.auto_devices, shared.args.disk, shared.args.deepspeed, shared.args.gpu_memory is not None, shared.args.cpu_memory is not None, shared.args.compress_pos_emb > 1, shared.args.alpha_value > 1, shared.args.disable_exllama, shared.args.disable_exllamav2]):
logger.info("TRANSFORMERS_PARAMS=")
pprint.PrettyPrinter(indent=4, sort_dicts=False).pprint(params)
print()

model = LoaderClass.from_pretrained(path_to_model, **params)
if not (hasattr(model, 'is_loaded_in_4bit') and model.is_loaded_in_4bit):
if torch.backends.mps.is_available():
Expand Down Expand Up @@ -175,7 +184,9 @@ def huggingface_loader(model_name):
params['torch_dtype'] = torch.float32
else:
params['device_map'] = 'auto'
params['max_memory'] = get_max_memory_dict()
if x := get_max_memory_dict():
params['max_memory'] = x

if shared.args.load_in_4bit:
# See https://github.com/huggingface/transformers/pull/23479/files
# and https://huggingface.co/blog/4bit-transformers-bitsandbytes
Expand All @@ -186,7 +197,6 @@ def huggingface_loader(model_name):
'bnb_4bit_use_double_quant': shared.args.use_double_quant,
}

logger.info('Using the following 4-bit params: ' + str(quantization_config_params))
params['quantization_config'] = BitsAndBytesConfig(**quantization_config_params)

elif shared.args.load_in_8bit:
Expand Down Expand Up @@ -230,6 +240,9 @@ def huggingface_loader(model_name):
elif shared.args.alpha_value > 1:
params['rope_scaling'] = {'type': 'dynamic', 'factor': RoPE.get_alpha_value(shared.args.alpha_value, shared.args.rope_freq_base)}

logger.info("TRANSFORMERS_PARAMS=")
pprint.PrettyPrinter(indent=4, sort_dicts=False).pprint(params)
print()
model = LoaderClass.from_pretrained(path_to_model, **params)

return model
Expand Down
3 changes: 3 additions & 0 deletions modules/models_settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,9 @@ def get_model_metadata(model):
metadata = json.loads(open(path, 'r', encoding='utf-8').read())
if 'chat_template' in metadata:
template = metadata['chat_template']
if isinstance(template, list):
template = template[0]['template']

for k in ['eos_token', 'bos_token']:
if k in metadata:
value = metadata[k]
Expand Down
4 changes: 1 addition & 3 deletions modules/shared.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,9 +62,7 @@
'chat_template_str': "{%- for message in messages %}\n {%- if message['role'] == 'system' -%}\n {%- if message['content'] -%}\n {{- message['content'] + '\\n\\n' -}}\n {%- endif -%}\n {%- if user_bio -%}\n {{- user_bio + '\\n\\n' -}}\n {%- endif -%}\n {%- else -%}\n {%- if message['role'] == 'user' -%}\n {{- name1 + ': ' + message['content'] + '\\n'-}}\n {%- else -%}\n {{- name2 + ': ' + message['content'] + '\\n' -}}\n {%- endif -%}\n {%- endif -%}\n{%- endfor -%}",
'chat-instruct_command': 'Continue the chat dialogue below. Write a single reply for the character "<|character|>".\n\n<|prompt|>',
'autoload_model': False,
'gallery-items_per_page': 50,
'gallery-open': False,
'default_extensions': ['gallery'],
'default_extensions': [],
}

default_settings = copy.deepcopy(settings)
Expand Down
12 changes: 6 additions & 6 deletions modules/ui_chat.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,14 +68,14 @@ def create_ui():
with gr.Row():
shared.gradio['rename_chat'] = gr.Button('Rename', elem_classes='refresh-button', interactive=not mu)
shared.gradio['delete_chat'] = gr.Button('🗑️', elem_classes='refresh-button', interactive=not mu)
shared.gradio['delete_chat-confirm'] = gr.Button('Confirm', variant='stop', visible=False, elem_classes='refresh-button')
shared.gradio['delete_chat-cancel'] = gr.Button('Cancel', visible=False, elem_classes='refresh-button')
shared.gradio['Start new chat'] = gr.Button('New chat', elem_classes='refresh-button')
shared.gradio['delete_chat-confirm'] = gr.Button('Confirm', variant='stop', visible=False, elem_classes=['refresh-button', 'focus-on-chat-input'])
shared.gradio['delete_chat-cancel'] = gr.Button('Cancel', visible=False, elem_classes=['refresh-button', 'focus-on-chat-input'])
shared.gradio['Start new chat'] = gr.Button('New chat', elem_classes=['refresh-button', 'focus-on-chat-input'])

with gr.Row(elem_id='rename-row'):
shared.gradio['rename_to'] = gr.Textbox(label='Rename to:', placeholder='New name', visible=False, elem_classes=['no-background'])
shared.gradio['rename_to-confirm'] = gr.Button('Confirm', visible=False, elem_classes='refresh-button')
shared.gradio['rename_to-cancel'] = gr.Button('Cancel', visible=False, elem_classes='refresh-button')
shared.gradio['rename_to-confirm'] = gr.Button('Confirm', visible=False, elem_classes=['refresh-button', 'focus-on-chat-input'])
shared.gradio['rename_to-cancel'] = gr.Button('Cancel', visible=False, elem_classes=['refresh-button', 'focus-on-chat-input'])

with gr.Row(elem_id='chat-controls', elem_classes=['pretty_scrollbar']):
with gr.Column():
Expand Down Expand Up @@ -378,4 +378,4 @@ def create_event_handlers():
partial(chat.generate_chat_prompt, '', _continue=True), gradio('interface_state'), gradio('textbox-notebook')).then(
lambda: None, None, None, js=f'() => {{{ui.switch_tabs_js}; switch_to_notebook()}}')

shared.gradio['show_controls'].change(None, gradio('show_controls'), None, js=f'(x) => {{{ui.show_controls_js}; toggle_controls(x)}}')
shared.gradio['show_controls'].change(lambda x: None, gradio('show_controls'), None, js=f'(x) => {{{ui.show_controls_js}; toggle_controls(x)}}')
28 changes: 14 additions & 14 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
accelerate==0.27.*
aqlm[gpu,cpu]==1.1.2; platform_system == "Linux"
aqlm[gpu,cpu]==1.1.3; platform_system == "Linux"
bitsandbytes==0.43.*
colorama
datasets
einops
gradio==4.23.*
gradio==4.25.*
hqq==0.1.5
jinja2==3.1.2
lm_eval==0.3.0
Expand Down Expand Up @@ -33,22 +33,22 @@ sse-starlette==1.6.5
tiktoken

# llama-cpp-python (CPU only, AVX2)
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.59+cpuavx2-cp311-cp311-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.59+cpuavx2-cp310-cp310-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.59+cpuavx2-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.59+cpuavx2-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.60+cpuavx2-cp311-cp311-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.60+cpuavx2-cp310-cp310-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.60+cpuavx2-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.60+cpuavx2-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"

# llama-cpp-python (CUDA, no tensor cores)
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.59+cu121-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.59+cu121-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.59+cu121-cp311-cp311-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.59+cu121-cp310-cp310-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.60+cu121-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.60+cu121-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.60+cu121-cp311-cp311-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.60+cu121-cp310-cp310-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"

# llama-cpp-python (CUDA, tensor cores)
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.2.59+cu121-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.2.59+cu121-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.2.59+cu121-cp311-cp311-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.2.59+cu121-cp310-cp310-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.2.60+cu121-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.2.60+cu121-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.2.60+cu121-cp311-cp311-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.2.60+cu121-cp310-cp310-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"

# CUDA wheels
https://github.com/jllllll/AutoGPTQ/releases/download/v0.6.0/auto_gptq-0.6.0+cu121-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
Expand Down
Loading