Skip to content

Commit 9b623b8

Browse files
authored
Bump llama-cpp-python to 0.2.64, use official wheels (#5921)
1 parent 0877741 commit 9b623b8

16 files changed

+53
-325
lines changed

README.md

Lines changed: 5 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -107,14 +107,11 @@ pip install -r <requirements file according to table below>
107107

108108
Requirements file to use:
109109

110-
| GPU | CPU | requirements file to use |
110+
| GPU | requirements file to use |
111111
|--------|---------|---------|
112-
| NVIDIA | has AVX2 | `requirements.txt` |
113-
| NVIDIA | no AVX2 | `requirements_noavx2.txt` |
114-
| AMD | has AVX2 | `requirements_amd.txt` |
115-
| AMD | no AVX2 | `requirements_amd_noavx2.txt` |
116-
| CPU only | has AVX2 | `requirements_cpu_only.txt` |
117-
| CPU only | no AVX2 | `requirements_cpu_only_noavx2.txt` |
112+
| NVIDIA | `requirements.txt` |
113+
| AMD | `requirements_amd.txt` |
114+
| CPU only | `requirements_cpu_only.txt` |
118115
| Apple | Intel | `requirements_apple_intel.txt` |
119116
| Apple | Apple Silicon | `requirements_apple_silicon.txt` |
120117

@@ -132,7 +129,7 @@ Then browse to
132129

133130
##### AMD GPU on Windows
134131

135-
1) Use `requirements_cpu_only.txt` or `requirements_cpu_only_noavx2.txt` in the command above.
132+
1) Use `requirements_cpu_only.txt` in the command above.
136133

137134
2) Manually install llama-cpp-python using the appropriate command for your hardware: [Installation from PyPI](https://github.com/abetlen/llama-cpp-python#installation-with-hardware-acceleration).
138135
* Use the `LLAMA_HIPBLAS=on` toggle.
@@ -255,7 +252,6 @@ List of command-line flags
255252

256253
| Flag | Description |
257254
|-------------|-------------|
258-
| `--tensorcores` | Use llama-cpp-python compiled with tensor cores support. This increases performance on RTX cards. NVIDIA only. |
259255
| `--n_ctx N_CTX` | Size of the prompt context. |
260256
| `--threads` | Number of threads to use. |
261257
| `--threads-batch THREADS_BATCH` | Number of threads to use for batches/prompt processing. |

docs/04 - Model Tab.md

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ Options:
2121
* **alpha_value**: Used to extend the context length of a model with a minor loss in quality. I have measured 1.75 to be optimal for 1.5x context, and 2.5 for 2x context. That is, with alpha = 2.5 you can make a model with 4096 context length go to 8192 context length.
2222
* **rope_freq_base**: Originally another way to write "alpha_value", it ended up becoming a necessary parameter for some models like CodeLlama, which was fine-tuned with this set to 1000000 and hence needs to be loaded with it set to 1000000 as well.
2323
* **compress_pos_emb**: The first and original context-length extension method, discovered by [kaiokendev](https://kaiokendev.github.io/til). When set to 2, the context length is doubled, 3 and it's tripled, etc. It should only be used for models that have been fine-tuned with this parameter set to different than 1. For models that have not been tuned to have greater context length, alpha_value will lead to a smaller accuracy loss.
24-
* **cpu**: Loads the model in CPU mode using Pytorch. The model will be loaded in 32-bit precision, so a lot of RAM will be used. CPU inference with transformers is older than llama.cpp and it works, but it's a lot slower. Note: this parameter has a different interpretation in the llama.cpp loader (see below).
24+
* **cpu**: Loads the model in CPU mode using Pytorch. The model will be loaded in 32-bit precision, so a lot of RAM will be used. CPU inference with transformers is older than llama.cpp and it works, but it's a lot slower.
2525
* **load-in-8bit**: Load the model in 8-bit precision using bitsandbytes. The 8-bit kernel in that library has been optimized for training and not inference, so load-in-8bit is slower than load-in-4bit (but more accurate).
2626
* **bf16**: Use bfloat16 precision instead of float16 (the default). Only applies when quantization is not used.
2727
* **auto-devices**: When checked, the backend will try to guess a reasonable value for "gpu-memory" to allow you to load a model with CPU offloading. I recommend just setting "gpu-memory" manually instead. This parameter is also needed for loading GPTQ models, in which case it needs to be checked before loading the model.
@@ -84,9 +84,7 @@ Example: https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF
8484
* **n_batch**: Batch size for prompt processing. Higher values are supposed to make generation faster, but I have never obtained any benefit from changing this value.
8585
* **threads**: Number of threads. Recommended value: your number of physical cores.
8686
* **threads_batch**: Number of threads for batch processing. Recommended value: your total number of cores (physical + virtual).
87-
* **tensorcores**: Use llama.cpp compiled with "tensor cores" support, which improves performance on NVIDIA RTX cards in most cases.
8887
* **streamingllm**: Experimental feature to avoid re-evaluating the entire prompt when part of it is removed, for instance, when you hit the context length for the model in chat mode and an old message is removed.
89-
* **cpu**: Force a version of llama.cpp compiled without GPU acceleration to be used. Can usually be ignored. Only set this if you want to use CPU only and llama.cpp doesn't work otherwise.
9088
* **no_mul_mat_q**: Disable the mul_mat_q kernel. This kernel usually improves generation speed significantly. This option to disable it is included in case it doesn't work on some system.
9189
* **no-mmap**: Loads the model into memory at once, possibly preventing I/O operations later on at the cost of a longer load time.
9290
* **mlock**: Force the system to keep the model in RAM rather than swapping or compressing (no idea what this means, never used it).

modules/llama_cpp_python_hijack.py

Lines changed: 2 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,11 @@
11
from typing import Sequence
22

3+
import llama_cpp
34
from tqdm import tqdm
45

56
from modules import shared
67
from modules.cache_utils import process_llamacpp_cache
78

8-
try:
9-
import llama_cpp
10-
except:
11-
llama_cpp = None
12-
13-
try:
14-
import llama_cpp_cuda
15-
except:
16-
llama_cpp_cuda = None
17-
18-
try:
19-
import llama_cpp_cuda_tensorcores
20-
except:
21-
llama_cpp_cuda_tensorcores = None
22-
239

2410
def eval_with_progress(self, tokens: Sequence[int]):
2511
"""
@@ -81,7 +67,7 @@ def my_generate(self, *args, **kwargs):
8167
lib.Llama.generate = my_generate
8268

8369

84-
for lib in [llama_cpp, llama_cpp_cuda, llama_cpp_cuda_tensorcores]:
70+
for lib in [llama_cpp]:
8571
if lib is not None:
8672
lib.Llama.eval = eval_with_progress
8773
monkey_patch_generate(lib)

modules/llamacpp_hf.py

Lines changed: 3 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
from pathlib import Path
33
from typing import Any, Dict, Optional, Union
44

5+
import llama_cpp
56
import torch
67
from torch.nn import CrossEntropyLoss
78
from transformers import GenerationConfig, PretrainedConfig, PreTrainedModel
@@ -10,32 +11,6 @@
1011
from modules import RoPE, llama_cpp_python_hijack, shared
1112
from modules.logging_colors import logger
1213

13-
try:
14-
import llama_cpp
15-
except:
16-
llama_cpp = None
17-
18-
try:
19-
import llama_cpp_cuda
20-
except:
21-
llama_cpp_cuda = None
22-
23-
try:
24-
import llama_cpp_cuda_tensorcores
25-
except:
26-
llama_cpp_cuda_tensorcores = None
27-
28-
29-
def llama_cpp_lib():
30-
if shared.args.cpu and llama_cpp is not None:
31-
return llama_cpp
32-
elif shared.args.tensorcores and llama_cpp_cuda_tensorcores is not None:
33-
return llama_cpp_cuda_tensorcores
34-
elif llama_cpp_cuda is not None:
35-
return llama_cpp_cuda
36-
else:
37-
return llama_cpp
38-
3914

4015
class LlamacppHF(PreTrainedModel):
4116
def __init__(self, model, path):
@@ -57,7 +32,7 @@ def __init__(self, model, path):
5732
'n_tokens': self.model.n_tokens,
5833
'input_ids': self.model.input_ids.copy(),
5934
'scores': self.model.scores.copy(),
60-
'ctx': llama_cpp_lib().llama_new_context_with_model(model.model, model.context_params)
35+
'ctx': llama_cpp.llama_new_context_with_model(model.model, model.context_params)
6136
}
6237

6338
def _validate_model_class(self):
@@ -220,7 +195,7 @@ def from_pretrained(cls, pretrained_model_name_or_path: Optional[Union[str, os.P
220195
'split_mode': 1 if not shared.args.row_split else 2
221196
}
222197

223-
Llama = llama_cpp_lib().Llama
198+
Llama = llama_cpp.Llama
224199
model = Llama(**params)
225200

226201
return LlamacppHF(model, model_file)

modules/llamacpp_model.py

Lines changed: 5 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
import re
22
from functools import partial
33

4+
import llama_cpp
45
import numpy as np
56
import torch
67

@@ -9,32 +10,6 @@
910
from modules.logging_colors import logger
1011
from modules.text_generation import get_max_prompt_length
1112

12-
try:
13-
import llama_cpp
14-
except:
15-
llama_cpp = None
16-
17-
try:
18-
import llama_cpp_cuda
19-
except:
20-
llama_cpp_cuda = None
21-
22-
try:
23-
import llama_cpp_cuda_tensorcores
24-
except:
25-
llama_cpp_cuda_tensorcores = None
26-
27-
28-
def llama_cpp_lib():
29-
if shared.args.cpu and llama_cpp is not None:
30-
return llama_cpp
31-
elif shared.args.tensorcores and llama_cpp_cuda_tensorcores is not None:
32-
return llama_cpp_cuda_tensorcores
33-
elif llama_cpp_cuda is not None:
34-
return llama_cpp_cuda
35-
else:
36-
return llama_cpp
37-
3813

3914
def ban_eos_logits_processor(eos_token, input_ids, logits):
4015
logits[eos_token] = -float('inf')
@@ -60,8 +35,8 @@ def __del__(self):
6035
@classmethod
6136
def from_pretrained(self, path):
6237

63-
Llama = llama_cpp_lib().Llama
64-
LlamaCache = llama_cpp_lib().LlamaCache
38+
Llama = llama_cpp.Llama
39+
LlamaCache = llama_cpp.LlamaCache
6540

6641
result = self()
6742
cache_capacity = 0
@@ -126,12 +101,12 @@ def load_grammar(self, string):
126101
if string != self.grammar_string:
127102
self.grammar_string = string
128103
if string.strip() != '':
129-
self.grammar = llama_cpp_lib().LlamaGrammar.from_string(string)
104+
self.grammar = llama_cpp.LlamaGrammar.from_string(string)
130105
else:
131106
self.grammar = None
132107

133108
def generate(self, prompt, state, callback=None):
134-
LogitsProcessorList = llama_cpp_lib().LogitsProcessorList
109+
LogitsProcessorList = llama_cpp.LogitsProcessorList
135110
prompt = prompt if type(prompt) is str else prompt.decode()
136111

137112
# Handle truncation

modules/loaders.py

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -41,11 +41,9 @@
4141
'alpha_value',
4242
'rope_freq_base',
4343
'compress_pos_emb',
44-
'cpu',
4544
'numa',
4645
'no_offload_kqv',
4746
'row_split',
48-
'tensorcores',
4947
'streaming_llm',
5048
'attention_sink_size',
5149
],
@@ -62,15 +60,13 @@
6260
'alpha_value',
6361
'rope_freq_base',
6462
'compress_pos_emb',
65-
'cpu',
6663
'numa',
6764
'cfg_cache',
6865
'trust_remote_code',
6966
'no_use_fast',
7067
'logits_all',
7168
'no_offload_kqv',
7269
'row_split',
73-
'tensorcores',
7470
'streaming_llm',
7571
'attention_sink_size',
7672
'llamacpp_HF_info',

modules/shared.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -113,7 +113,6 @@
113113

114114
# llama.cpp
115115
group = parser.add_argument_group('llama.cpp')
116-
group.add_argument('--tensorcores', action='store_true', help='Use llama-cpp-python compiled with tensor cores support. This increases performance on RTX cards. NVIDIA only.')
117116
group.add_argument('--n_ctx', type=int, default=2048, help='Size of the prompt context.')
118117
group.add_argument('--threads', type=int, default=0, help='Number of threads to use.')
119118
group.add_argument('--threads-batch', type=int, default=0, help='Number of threads to use for batches/prompt processing.')
@@ -204,7 +203,8 @@
204203
group.add_argument('--multimodal-pipeline', type=str, default=None, help='The multimodal pipeline to use. Examples: llava-7b, llava-13b.')
205204

206205
# Deprecated parameters
207-
# group = parser.add_argument_group('Deprecated')
206+
group = parser.add_argument_group('Deprecated')
207+
group.add_argument('--tensorcores', action='store_true', help='DEPRECATED')
208208

209209
args = parser.parse_args()
210210
args_defaults = parser.parse_args([])
@@ -214,7 +214,7 @@
214214
if hasattr(args, arg):
215215
provided_arguments.append(arg)
216216

217-
deprecated_args = []
217+
deprecated_args = ['tensorcores']
218218

219219

220220
def do_cmd_flags_warnings():

one_click.py

Lines changed: 5 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -58,32 +58,6 @@ def is_x86_64():
5858
return platform.machine() == "x86_64"
5959

6060

61-
def cpu_has_avx2():
62-
try:
63-
import cpuinfo
64-
65-
info = cpuinfo.get_cpu_info()
66-
if 'avx2' in info['flags']:
67-
return True
68-
else:
69-
return False
70-
except:
71-
return True
72-
73-
74-
def cpu_has_amx():
75-
try:
76-
import cpuinfo
77-
78-
info = cpuinfo.get_cpu_info()
79-
if 'amx' in info['flags']:
80-
return True
81-
else:
82-
return False
83-
except:
84-
return True
85-
86-
8761
def torch_version():
8862
site_packages_path = None
8963
for sitedir in site.getsitepackages():
@@ -305,7 +279,7 @@ def install_webui():
305279

306280
# Install Git and then Pytorch
307281
print_big_message("Installing PyTorch.")
308-
run_cmd(f"conda install -y -k ninja git && {install_pytorch} && python -m pip install py-cpuinfo==9.0.0", assert_success=True, environment=True)
282+
run_cmd(f"conda install -y -k ninja git && {install_pytorch}", assert_success=True, environment=True)
309283

310284
if selected_gpu == "INTEL":
311285
# Install oneAPI dependencies via conda
@@ -372,13 +346,13 @@ def update_requirements(initial_installation=False, pull=True):
372346
is_cpu = '+cpu' in torver # 2.0.1+cpu
373347

374348
if is_rocm:
375-
base_requirements = "requirements_amd" + ("_noavx2" if not cpu_has_avx2() else "") + ".txt"
349+
base_requirements = "requirements_amd.txt"
376350
elif is_cpu or is_intel:
377-
base_requirements = "requirements_cpu_only" + ("_noavx2" if not cpu_has_avx2() else "") + ".txt"
351+
base_requirements = "requirements_cpu_only.txt"
378352
elif is_macos():
379353
base_requirements = "requirements_apple_" + ("intel" if is_x86_64() else "silicon") + ".txt"
380354
else:
381-
base_requirements = "requirements" + ("_noavx2" if not cpu_has_avx2() else "") + ".txt"
355+
base_requirements = "requirements.txt"
382356

383357
requirements_file = base_requirements
384358

@@ -389,6 +363,7 @@ def update_requirements(initial_installation=False, pull=True):
389363
textgen_requirements = open(requirements_file).read().splitlines()
390364
if is_cuda118:
391365
textgen_requirements = [req.replace('+cu121', '+cu118').replace('+cu122', '+cu118') for req in textgen_requirements]
366+
textgen_requirements = [req for req in textgen_requirements if '-cu121' not in req]
392367
if is_windows() and is_cuda118: # No flash-attention on Windows for CUDA 11
393368
textgen_requirements = [req for req in textgen_requirements if 'oobabooga/flash-attention' not in req]
394369

requirements.txt

Lines changed: 5 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -33,23 +33,11 @@ flask_cloudflared==0.0.14
3333
sse-starlette==1.6.5
3434
tiktoken
3535

36-
# llama-cpp-python (CPU only, AVX2)
37-
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.61+cpuavx2-cp311-cp311-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
38-
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.61+cpuavx2-cp310-cp310-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
39-
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.61+cpuavx2-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
40-
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.61+cpuavx2-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
41-
42-
# llama-cpp-python (CUDA, no tensor cores)
43-
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.61+cu121-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
44-
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.61+cu121-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
45-
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.61+cu121-cp311-cp311-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
46-
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.61+cu121-cp310-cp310-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
47-
48-
# llama-cpp-python (CUDA, tensor cores)
49-
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.2.61+cu121-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
50-
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.2.61+cu121-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
51-
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.2.61+cu121-cp311-cp311-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
52-
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.2.61+cu121-cp310-cp310-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
36+
# llama-cpp-python (CUDA)
37+
https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.64-cu121/llama_cpp_python-0.2.64-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
38+
https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.64-cu121/llama_cpp_python-0.2.64-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
39+
https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.64-cu121/llama_cpp_python-0.2.64-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
40+
https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.64-cu121/llama_cpp_python-0.2.64-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
5341

5442
# CUDA wheels
5543
https://github.com/jllllll/AutoGPTQ/releases/download/v0.6.0/auto_gptq-0.6.0+cu121-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"

0 commit comments

Comments
 (0)