You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+5-9Lines changed: 5 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -107,14 +107,11 @@ pip install -r <requirements file according to table below>
107
107
108
108
Requirements file to use:
109
109
110
-
| GPU |CPU |requirements file to use |
110
+
| GPU | requirements file to use |
111
111
|--------|---------|---------|
112
-
| NVIDIA | has AVX2 |`requirements.txt`|
113
-
| NVIDIA | no AVX2 |`requirements_noavx2.txt`|
114
-
| AMD | has AVX2 |`requirements_amd.txt`|
115
-
| AMD | no AVX2 |`requirements_amd_noavx2.txt`|
116
-
| CPU only | has AVX2 |`requirements_cpu_only.txt`|
117
-
| CPU only | no AVX2 |`requirements_cpu_only_noavx2.txt`|
112
+
| NVIDIA |`requirements.txt`|
113
+
| AMD |`requirements_amd.txt`|
114
+
| CPU only |`requirements_cpu_only.txt`|
118
115
| Apple | Intel |`requirements_apple_intel.txt`|
119
116
| Apple | Apple Silicon |`requirements_apple_silicon.txt`|
120
117
@@ -132,7 +129,7 @@ Then browse to
132
129
133
130
##### AMD GPU on Windows
134
131
135
-
1) Use `requirements_cpu_only.txt`or `requirements_cpu_only_noavx2.txt`in the command above.
132
+
1) Use `requirements_cpu_only.txt` in the command above.
136
133
137
134
2) Manually install llama-cpp-python using the appropriate command for your hardware: [Installation from PyPI](https://github.com/abetlen/llama-cpp-python#installation-with-hardware-acceleration).
138
135
* Use the `LLAMA_HIPBLAS=on` toggle.
@@ -255,7 +252,6 @@ List of command-line flags
255
252
256
253
| Flag | Description |
257
254
|-------------|-------------|
258
-
|`--tensorcores`| Use llama-cpp-python compiled with tensor cores support. This increases performance on RTX cards. NVIDIA only. |
259
255
|`--n_ctx N_CTX`| Size of the prompt context. |
260
256
|`--threads`| Number of threads to use. |
261
257
|`--threads-batch THREADS_BATCH`| Number of threads to use for batches/prompt processing. |
Copy file name to clipboardExpand all lines: docs/04 - Model Tab.md
+1-3Lines changed: 1 addition & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -21,7 +21,7 @@ Options:
21
21
***alpha_value**: Used to extend the context length of a model with a minor loss in quality. I have measured 1.75 to be optimal for 1.5x context, and 2.5 for 2x context. That is, with alpha = 2.5 you can make a model with 4096 context length go to 8192 context length.
22
22
***rope_freq_base**: Originally another way to write "alpha_value", it ended up becoming a necessary parameter for some models like CodeLlama, which was fine-tuned with this set to 1000000 and hence needs to be loaded with it set to 1000000 as well.
23
23
***compress_pos_emb**: The first and original context-length extension method, discovered by [kaiokendev](https://kaiokendev.github.io/til). When set to 2, the context length is doubled, 3 and it's tripled, etc. It should only be used for models that have been fine-tuned with this parameter set to different than 1. For models that have not been tuned to have greater context length, alpha_value will lead to a smaller accuracy loss.
24
-
***cpu**: Loads the model in CPU mode using Pytorch. The model will be loaded in 32-bit precision, so a lot of RAM will be used. CPU inference with transformers is older than llama.cpp and it works, but it's a lot slower. Note: this parameter has a different interpretation in the llama.cpp loader (see below).
24
+
***cpu**: Loads the model in CPU mode using Pytorch. The model will be loaded in 32-bit precision, so a lot of RAM will be used. CPU inference with transformers is older than llama.cpp and it works, but it's a lot slower.
25
25
***load-in-8bit**: Load the model in 8-bit precision using bitsandbytes. The 8-bit kernel in that library has been optimized for training and not inference, so load-in-8bit is slower than load-in-4bit (but more accurate).
26
26
***bf16**: Use bfloat16 precision instead of float16 (the default). Only applies when quantization is not used.
27
27
***auto-devices**: When checked, the backend will try to guess a reasonable value for "gpu-memory" to allow you to load a model with CPU offloading. I recommend just setting "gpu-memory" manually instead. This parameter is also needed for loading GPTQ models, in which case it needs to be checked before loading the model.
***n_batch**: Batch size for prompt processing. Higher values are supposed to make generation faster, but I have never obtained any benefit from changing this value.
85
85
***threads**: Number of threads. Recommended value: your number of physical cores.
86
86
***threads_batch**: Number of threads for batch processing. Recommended value: your total number of cores (physical + virtual).
87
-
***tensorcores**: Use llama.cpp compiled with "tensor cores" support, which improves performance on NVIDIA RTX cards in most cases.
88
87
***streamingllm**: Experimental feature to avoid re-evaluating the entire prompt when part of it is removed, for instance, when you hit the context length for the model in chat mode and an old message is removed.
89
-
***cpu**: Force a version of llama.cpp compiled without GPU acceleration to be used. Can usually be ignored. Only set this if you want to use CPU only and llama.cpp doesn't work otherwise.
90
88
***no_mul_mat_q**: Disable the mul_mat_q kernel. This kernel usually improves generation speed significantly. This option to disable it is included in case it doesn't work on some system.
91
89
***no-mmap**: Loads the model into memory at once, possibly preventing I/O operations later on at the cost of a longer load time.
92
90
***mlock**: Force the system to keep the model in RAM rather than swapping or compressing (no idea what this means, never used it).
0 commit comments