Replies: 2 comments
-
|
I've now found the llama-cpp-python interface GitHub, and I've asked this question there: |
Beta Was this translation helpful? Give feedback.
-
|
This is an old thread, but I couldn't find the answer anywhere, so just in case someone else could use the info: The syntax to pass the flag to llama-cpp-python from Oobabooga is: Here Also note For completeness: the 3 is the number of experts desired. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm using ooba webui, and I notice that when I look at the Exllamav2 model loader, it has an option, 'Number of experts per token' for Mixtral that lets you set it to a different value to the usual value of 2.
But when I use the llama.cpp loader (because I'm using an 8bpp GGUF of Mixtral), that option isn't available.
I want to see how good a response I can get from Mixtral, so I don't want to switch to a lower bpp so the model fits on my GPU, because that would make the response worse in a different way.
Is there any way to get a higher number of experts while still using a GGUF?
I asked this question on the llama.cpp Discussion tab here:
ggml-org/llama.cpp#5114
and I got this reply:

It's useful info, but I can't see how to apply it in ooba webui's Python interface to llama.cpp.
The closest I got was editing this installed file:
text-generation-webui/installer_files/env/lib/python3.11/site-packages/llama_cpp_cuda_tensorcores/llama.py
Just before this bit:
...there's a valid variable called self.model_params.kv_overrides that's of this type:

but my Python skills aren't good enough to know what to do with that. Any ideas? I don't need the full UI to change the number of experts whenever I like, as long as I can set it to a different value to 2 in code, that's fine for now. Although it would be nice if a UI was hooked up later, like ExLlamav2 has.
Beta Was this translation helpful? Give feedback.
All reactions