Trying to add Mixtral expert count to llama.cpp loader #5367

araleza · 2024-01-24T17:58:40Z

araleza
Jan 24, 2024

I'm using ooba webui, and I notice that when I look at the Exllamav2 model loader, it has an option, 'Number of experts per token' for Mixtral that lets you set it to a different value to the usual value of 2.

But when I use the llama.cpp loader (because I'm using an 8bpp GGUF of Mixtral), that option isn't available.

I want to see how good a response I can get from Mixtral, so I don't want to switch to a lower bpp so the model fits on my GPU, because that would make the response worse in a different way.

Is there any way to get a higher number of experts while still using a GGUF?

I asked this question on the llama.cpp Discussion tab here:
ggml-org/llama.cpp#5114

and I got this reply:

It's useful info, but I can't see how to apply it in ooba webui's Python interface to llama.cpp.

The closest I got was editing this installed file:

text-generation-webui/installer_files/env/lib/python3.11/site-packages/llama_cpp_cuda_tensorcores/llama.py

Just before this bit:

...there's a valid variable called self.model_params.kv_overrides that's of this type:

but my Python skills aren't good enough to know what to do with that. Any ideas? I don't need the full UI to change the number of experts whenever I like, as long as I can set it to a different value to 2 in code, that's fine for now. Although it would be nice if a UI was hooked up later, like ExLlamav2 has.

araleza · 2024-01-28T15:31:43Z

araleza
Jan 28, 2024
Author

I've now found the llama-cpp-python interface GitHub, and I've asked this question there:
abetlen/llama-cpp-python#1137

0 replies

dEATh-PM · 2025-11-19T19:40:12Z

dEATh-PM
Nov 19, 2025

This is an old thread, but I couldn't find the answer anywhere, so just in case someone else could use the info:

The syntax to pass the flag to llama-cpp-python from Oobabooga is:

override-kv=llama.expert_used_count=int:3

Here override-kv is the flag and llama.expert_used_count=int:3 is the value, though it in turn is a flag and value.

Also note llama is the arch (for architecture I assume) used, see: https://github.com/search?q=repo%3Aggml-org%2Fllama.cpp%20path%3Agguf-py%2Fgguf%2Fconstants.py%20MODEL_ARCH_NAMES&type=code

For completeness: the 3 is the number of experts desired.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trying to add Mixtral expert count to llama.cpp loader #5367

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Trying to add Mixtral expert count to llama.cpp loader #5367

Uh oh!

Uh oh!

araleza Jan 24, 2024

Replies: 2 comments

Uh oh!

araleza Jan 28, 2024 Author

Uh oh!

dEATh-PM Nov 19, 2025

araleza
Jan 24, 2024

araleza
Jan 28, 2024
Author

dEATh-PM
Nov 19, 2025