I'm running this on a Mac mini M2 Pro 16GB. I used the MacOS one-click-installer, and copied the vicuna-13b-v1.5-16k.Q4_K_M model into the models dir. When I select this model, it selects the llama.cpp loader.
If I set the n-gpu-layers parameter to 0, everything works. It doesn't use the GPU though.
If I set it to 1 (or any value other than 0), loading the model produces the following:
2023-09-17 18:38:19 INFO:Loading vicuna-13b-v1.5-16k.Q4_K_M.gguf...
2023-09-17 18:38:19 INFO:llama.cpp weights detected: models/vicuna-13b-v1.5-16k.Q4_K_M.gguf
2023-09-17 18:38:19 INFO:Cache capacity is 0 bytes
llama_model_loader: loaded meta data with 20 key-value pairs and 363 tensors from models/vicuna-13b-v1.5-16k.Q4_K_M.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor 0: token_embd.weight q4_K [ 5120, 32000, 1, 1 ]
llama_model_loader: - tensor 1: blk.0.attn_q.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 2: blk.0.attn_k.weight q4_K [ 5120, 5120, 1, 1 ]
...
llama_model_loader: - tensor 362: output.weight q6_K [ 5120, 32000, 1, 1 ]
llama_model_loader: - kv 0: general.architecture str
llama_model_loader: - kv 1: general.name str
llama_model_loader: - kv 2: llama.context_length u32
llama_model_loader: - kv 3: llama.embedding_length u32
llama_model_loader: - kv 4: llama.block_count u32
llama_model_loader: - kv 5: llama.feed_forward_length u32
llama_model_loader: - kv 6: llama.rope.dimension_count u32
llama_model_loader: - kv 7: llama.attention.head_count u32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv 10: llama.rope.scale_linear f32
llama_model_loader: - kv 11: general.file_type u32
llama_model_loader: - kv 12: tokenizer.ggml.model str
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr
llama_model_loader: - kv 14: tokenizer.ggml.scores arr
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32
llama_model_loader: - kv 19: general.quantization_version u32
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type q4_K: 241 tensors
llama_model_loader: - type q6_K: 41 tensors
llm_load_print_meta: format = GGUF V2 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 16384
llm_load_print_meta: n_ctx = 1048
llm_load_print_meta: n_embd = 5120
llm_load_print_meta: n_head = 40
llm_load_print_meta: n_head_kv = 40
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: f_norm_eps = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff = 13824
llm_load_print_meta: freq_base = 10000.0
llm_load_print_meta: freq_scale = 0.25
llm_load_print_meta: model type = 13B
llm_load_print_meta: model ftype = mostly Q4_K - Medium
llm_load_print_meta: model size = 13.02 B
llm_load_print_meta: general.name = lmsys_vicuna-13b-v1.5-16k
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.12 MB
llm_load_tensors: mem required = 7500.97 MB (+ 818.75 MB per state)
...................................................................................................
llama_new_context_with_model: kv self size = 818.75 MB
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Pro
ggml_metal_init: picking default device: Apple M2 Pro
ggml_metal_init: loading '(null)'
ggml_metal_init: error: Error Domain=NSCocoaErrorDomain Code=258 "The file name is invalid."
llama_new_context_with_model: ggml_metal_init() failed
2023-09-17 18:38:19 ERROR:Failed to load the model.
Traceback (most recent call last):
File "/Users/jhandl/oobabooga_macos/text-generation-webui/modules/ui_model_menu.py", line 194, in load_model_wrapper
shared.model, shared.tokenizer = load_model(shared.model_name, loader)
File "/Users/jhandl/oobabooga_macos/text-generation-webui/modules/models.py", line 77, in load_model
output = load_func_map[loader](model_name)
File "/Users/jhandl/oobabooga_macos/text-generation-webui/modules/models.py", line 245, in llamacpp_loader
model, tokenizer = LlamaCppModel.from_pretrained(model_file)
File "/Users/jhandl/oobabooga_macos/text-generation-webui/modules/llamacpp_model.py", line 87, in from_pretrained
result.model = Llama(**params)
File "/Users/jhandl/oobabooga_macos/installer_files/env/lib/python3.10/site-packages/llama_cpp/llama.py", line 334, in __init__
assert self.ctx is not None
AssertionError
Exception ignored in: <function LlamaCppModel.__del__ at 0x157ee39a0>
Traceback (most recent call last):
File "/Users/jhandl/oobabooga_macos/text-generation-webui/modules/llamacpp_model.py", line 46, in __del__
self.model.__del__()
AttributeError: 'LlamaCppModel' object has no attribute 'model'
I'm running this on a Mac mini M2 Pro 16GB. I used the MacOS one-click-installer, and copied the
vicuna-13b-v1.5-16k.Q4_K_Mmodel into the models dir. When I select this model, it selects the llama.cpp loader.If I set the n-gpu-layers parameter to 0, everything works. It doesn't use the GPU though.
If I set it to 1 (or any value other than 0), loading the model produces the following: