Create parameters overview#1269
Conversation
mcm007
commented
Feb 14, 2026
- I have read the contributing guidelines
- Self-reported review complexity:
- Low
- Medium
- High
- format as table - sections
- quickstart - build and run
other tools examples
|
Oh, that is nice. If we could have some detailed documentation somewhere about -ot (--override-tensor), that would be absolutely fantastic. (how to list all the tensors, what they mean, what should be the strategy about putting which ones on GPUs and CPUs, etc.) It can wait though, that's already a very nice step forward. |
|
This is starting to look good. I can add that for OT, it's good to put up/down/gate onto the GPU for speedups, explicitly. IIRC, up/gate are for prompt processing and down is for text generation. Up/Gate shouldn't be on separate GPU devices because it might cause a bit of a deadlock. For models with shared experts, they should end up on GPU.. i.e in the case of GPT-OSS. |
If you want to list all tensors, you can just click on any quant on any GGUF on hugging face. (In this particular case the GGUF is split in 4, this is part 2 of 4) The -ot is simply a regex to match the tensors you see on that page. As for the strat for CPU + GPU, you put anything that says "exps" in your slowest memory, and anything else in your fastest memory (VRAM). Those ffn "exps" are the sparse experts tensors, the ones that get actually used only 2-5% of the times (depending on the model). If then you have extra VRAM to spare, you start putting some of the exps into VRAM too, because why not in the end. At least that's what I do. Some layers (layers are called blk.n in gguf), are different in some models. For example this guy: The first three layers are different, they don't have exps, they have dense ffn, so they should all go in VRAM. Dense layers are very good to speed up mixed inference systems, as a much larger share of active parameters is fixed, and hence you know which to put in faster VRAM. In general, in a single GPU + CPU system, you just do something like this: -ngl 999 -ot "blk.(?:[0-9]|[1-7][0-9]|[8][0-7]).ffn._exps.=CPU" As for multi GPUs.. that's for richer people then me to figure out lol |
|
You're making me think I should look at some models with different initial blocks and make sure what's in there. Maybe I can get another speedup placing them in vram over later up/down/gate layers. In some quants the layers aren't uniform so it can be better to skip larger layers if more smaller blocks will fit without empty space where nothing fits. |
README.md
Outdated
| ### Prerequisites | ||
|
|
||
| ``` | ||
| apt-get update && apt-get install build-essential git libcurl4-openssl-dev curl libgomp1 cmake |
There was a problem hiding this comment.
This should be generic or reference platforms like Windows, since as it stands it makes it seem like there is no Windows support.
Or if you already have the quant locally you can just run gguf_dump.py. |
See, all of this post is fantastic information (thanks!). And it's going to be lost in the reply of an obscure push request... Consolidating the information into some kind of guide somewhere for advanced users would be amazing. |
- description - add jargon section - add suggestions from feedbacks
|
Thanks for all replies! All suggestions were included. |
Yes I did misunderstand things. Nevermind, it's not what -cache-ram-similarity does. |
Hey @MrHills-rs, you can disable it with Also, on the logs for entries like: |
docs/parameters.md
Outdated
| python3 gguf-py/scripts/gguf_dump.py /models/Qwen_Qwen3-0.6B-IQ4_NL.gguf | ||
| ``` | ||
|
|
||
| - `-ngl`, `-ot`, `--cpu-moe`, `--n-cpu-moe N`, `-ooae` |
There was a problem hiding this comment.
-ooae is not related to where the model weights get stored. Instead, once we have some MoE tensors (ffn_(up|gate|down)_exps.weight) on the CPU, and during batch processing the scheduler decides to copy them to a GPU to perform the corresponding matrix multiplications, -ooae tells the scheduler to offload only the activated experts. The -ooae option is actually ON by default, and one uses -no-ooae to turn it off. Offloading only the activated experts is useful for some models, where often the number of activated experts is much smaller than the total number of experts, so -ooae reduces the amount of RAM -> VRAM data transfer. A model where this makes a significant difference for hybrid CPU/GPU inference is GPT-OSS-120B. For many MoE models and large batches basically all experts are activated, so this option makes no difference (or can even slightly lower performance because it costs some time to determine which experts are active, but if all experts turn out to be active, this time was spent for nothing).
|
|
||
| `-ot "blk.(?:[0-9]|[1-7][0-9]|[8][0-7]).ffn._exps.=CPU"` To create exceptions and put back in ram anything that has "ffn" and "_exps" in its name, and that sits in layers called "blk.n", where "n" (the lawyer number) is any match between 0 and 9, or between 1 to 7 + 0 to 9 (aka a number between 10 and 79), or 8 + 0 to 7 (aka a number between 80 and 87). | ||
| Basically a complicated way of saying put all experts from layer 0 to 87 in ram. Experts from layer 88 to 93 (there's 93 layers in qwen3vl 235b) can sit in VRAM still. (Thats all I can load on a 5090). | ||
|
|
There was a problem hiding this comment.
I think this section would greatly benefit from some practical examples. E.g., take some common GPU configurations and some popular models that don't fit into VRAM, and give specific examples for these configurations how one should use tensor overrides (or --cpu-mode, --n-cpu-moe) to end up with a meaningful VRAM and GPU utilization.
I still see people using -ngl N (with N less than the number of layers) for MoE models, so one needs to hammer it down that this is basically never useful for MoE models.
There was a problem hiding this comment.
Added "Common GPU configurations and popular models", as well needs to be filled in.
docs/parameters.md
Outdated
|
|
||
| 3. Offload less to the GPU. Try to find a mix of parameters that better suits your system that default. | ||
|
|
||
| - Use `--no-kv-offload` to keep KV cache on CPU. |
There was a problem hiding this comment.
I have never found an actually useful application of --no-kv-offload. Hence, at least on my book, it shouldn't get so much attention. Or, if it does, lets have an example of where this is useful.
|
|
||
| ``` | ||
| llama-sweep-bench -m /models/model.gguf -c 12288 -ub 512 -rtr -fa -ctk q8_0 -ctv q8_0 | ||
| ``` |
There was a problem hiding this comment.
Perhaps mention that llama-sweep-bench understands all parameters that one would use in llama-server or llama-cli (but obviously not all get used, only those that are related to loading the model, setting up the context parameters, and running the benchmark).
| ``` | ||
|
|
||
| On Linux, install the required packages: | ||
|
|
There was a problem hiding this comment.
Strictly speaking, this is Debian/Ubuntu specific, so perhaps mention that they need to find the corresponding packages in the package manager of their Linux distro).
| | `--no-warmup` | Skip warming up the model with an empty run | - | | | ||
| | `--mlock` | Force system to keep model in RAM rather than swapping or compressing | - | | | ||
| | `--no-mmap` | Do not memory-map model (slower load but may reduce pageouts) | - | | | ||
| | `-rtr, --run-time-repack` | Repack tensors if interleaved variant is available | - | May improve performance on some systems. [PR 147](https://github.com/ikawrakow/ik_llama.cpp/pull/147) | |
There was a problem hiding this comment.
As there are many people who have been using llama.cpp and decided to try ik_llama.cpp, I think it would be very useful to explicitly mark the parameters that are ik_llama.cpp specific. It can also be useful to have a section on llama.cpp parameters that are not available in ik_llama.cpp.
There was a problem hiding this comment.
Added "Unique parameters" section, still needs to be populated...
- no-ooae - placeholder for common commands - no-kv-offload - llama-sweep-bench - placeholder for unique parameters