Create parameters overview by mcm007 · Pull Request #1269 · ikawrakow/ik_llama.cpp

mcm007 · 2026-02-14T07:58:15Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

- format as table - sections

- quickstart - build and run

other tools examples

alexisnaveros · 2026-02-14T09:55:36Z

Oh, that is nice.

If we could have some detailed documentation somewhere about -ot (--override-tensor), that would be absolutely fantastic. (how to list all the tensors, what they mean, what should be the strategy about putting which ones on GPUs and CPUs, etc.) It can wait though, that's already a very nice step forward.

Ph0rk0z · 2026-02-14T14:46:16Z

This is starting to look good. I can add that for OT, it's good to put up/down/gate onto the GPU for speedups, explicitly. IIRC, up/gate are for prompt processing and down is for text generation. Up/Gate shouldn't be on separate GPU devices because it might cause a bit of a deadlock. For models with shared experts, they should end up on GPU.. i.e in the case of GPT-OSS.

MrHills-rs · 2026-02-14T14:55:30Z

Oh, that is nice.

If we could have some detailed documentation somewhere about -ot (--override-tensor), that would be absolutely fantastic. (how to list all the tensors, what they mean, what should be the strategy about putting which ones on GPUs and CPUs, etc.) It can wait though, that's already a very nice step forward.

If you want to list all tensors, you can just click on any quant on any GGUF on hugging face.

https://huggingface.co/ubergarm/MiniMax-M2.5-GGUF?show_file_info=IQ4_XS%2FMiniMax-M2.5-IQ4_XS-00002-of-00004.gguf

(In this particular case the GGUF is split in 4, this is part 2 of 4)

The -ot is simply a regex to match the tensors you see on that page.

As for the strat for CPU + GPU, you put anything that says "exps" in your slowest memory, and anything else in your fastest memory (VRAM). Those ffn "exps" are the sparse experts tensors, the ones that get actually used only 2-5% of the times (depending on the model). If then you have extra VRAM to spare, you start putting some of the exps into VRAM too, because why not in the end.

At least that's what I do.

Some layers (layers are called blk.n in gguf), are different in some models. For example this guy:

https://huggingface.co/unsloth/GLM-5-GGUF?show_file_info=UD-IQ3_XXS%2FGLM-5-UD-IQ3_XXS-00002-of-00008.gguf

The first three layers are different, they don't have exps, they have dense ffn, so they should all go in VRAM. Dense layers are very good to speed up mixed inference systems, as a much larger share of active parameters is fixed, and hence you know which to put in faster VRAM.
Also the layers from the 4th onwards have shared exps, "shexp", those too go to VRAM as they are always active.

In general, in a single GPU + CPU system, you just do something like this:

-ngl 999
to put all layers in vram by default

-ot "blk.(?:[0-9]|[1-7][0-9]|[8][0-7]).ffn._exps.=CPU"
To create exceptions and put back in ram anything that has "ffn" and "_exps" in its name, and that sits in layers called "blk.n", where "n" (the lawyer number) is any match between 0 and 9, or between 1 to 7 + 0 to 9 (aka a number between 10 and 79), or 8 + 0 to 7 (aka a number between 80 and 87).
Basically a complicated way of saying put all experts from layer 0 to 87 in ram. Experts from layer 88 to 93 (there's 93 layers in qwen3vl 235b) can sit in VRAM still. (Thats all I can load on a 5090).

As for multi GPUs.. that's for richer people then me to figure out lol

Ph0rk0z · 2026-02-14T15:18:06Z

You're making me think I should look at some models with different initial blocks and make sure what's in there. Maybe I can get another speedup placing them in vram over later up/down/gate layers. In some quants the layers aren't uniform so it can be better to skip larger layers if more smaller blocks will fit without empty space where nothing fits.

saood06 · 2026-02-14T18:02:13Z

README.md

+### Prerequisites
+
+```
+apt-get update && apt-get install build-essential git libcurl4-openssl-dev curl libgomp1 cmake


This should be generic or reference platforms like Windows, since as it stands it makes it seem like there is no Windows support.

saood06 · 2026-02-14T18:02:49Z

If you want to list all tensors, you can just click on any quant on any GGUF on hugging face.

Or if you already have the quant locally you can just run gguf_dump.py.

alexisnaveros · 2026-02-14T18:17:21Z

In general, in a single GPU + CPU system, you just do something like this:

-ngl 999 to put all layers in vram by default

-ot "blk.(?:[0-9]|[1-7][0-9]|[8][0-7]).ffn.__exps._=CPU" To create exceptions and put back in ram anything that has "ffn" and "_exps" in its name, and that sits in layers called "blk.n", where "n" (the lawyer number) is any match between 0 and 9, or between 1 to 7 + 0 to 9 (aka a number between 10 and 79), or 8 + 0 to 7 (aka a number between 80 and 87). Basically a complicated way of saying put all experts from layer 0 to 87 in ram. Experts from layer 88 to 93 (there's 93 layers in qwen3vl 235b) can sit in VRAM still. (Thats all I can load on a 5090).

@MrHills-rs

See, all of this post is fantastic information (thanks!). And it's going to be lost in the reply of an obscure push request...

Consolidating the information into some kind of guide somewhere for advanced users would be amazing.

- description - add jargon section - add suggestions from feedbacks

mcm007 · 2026-02-15T12:34:45Z

Thanks for all replies!

All suggestions were included.

MrHills-rs · 2026-02-15T13:05:18Z

Thanks for all replies!

All suggestions were included.

~~I see that the default --cache-ram-similarity is 0.5. isn't that a bit problematic?~~

~~I'm not sure I understand this well but does that mean that as long as token embeddings are 50% similar they will be reused?~~

I actually had a problem with minimax2.5 IQ4_XS where it struggled to understand identities. I wrote "you" in my own message and it kept thinking that I was talking about myself. This happened after a typing error, aka
~~- I wrote the wrong word~~
~~- I pressed send~~
~~- I stopped, realizing my mistake~~
~~- I re wrote the message, yet it didn't look like the correction had any effect. Multiple refreshes kept ignoring my correction.~~

~~This might or might not have anything to do with it, but after restarting ik_llama.cpp and rebuilding the KV cache from zero the problem went away.~~

~~I just realized this reading parameters.md. If I understand the situation right, this might be a trap many will fall into, no?~~

~~Am I misunderstanding something?~~

Yes I did misunderstand things. Nevermind, it's not what -cache-ram-similarity does.

#954

mcm007 · 2026-02-15T13:24:07Z

This might or might not have anything to do with it, but after restarting ik_llama.cpp and rebuilding the KV cache from zero the problem went away.

Hey @MrHills-rs, you can disable it with -cram 0 then re-test.

Also, on the logs for entries like:

slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.300 (> 0.100 thold), f_keep = 0.028
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 109, total state size = 6.335 MiB
srv          load:  - looking for better prompt, base f_keep = 0.028, sim = 0.300
srv        update:  - cache state: 2 prompts, 11.915 MiB (limits: 4096.000 MiB, 20224 tokens, 70470 est)

ikawrakow · 2026-02-18T07:10:42Z

docs/parameters.md

+python3 gguf-py/scripts/gguf_dump.py /models/Qwen_Qwen3-0.6B-IQ4_NL.gguf
+```
+
+- `-ngl`, `-ot`, `--cpu-moe`, `--n-cpu-moe N`, `-ooae`


-ooae is not related to where the model weights get stored. Instead, once we have some MoE tensors (ffn_(up|gate|down)_exps.weight) on the CPU, and during batch processing the scheduler decides to copy them to a GPU to perform the corresponding matrix multiplications, -ooae tells the scheduler to offload only the activated experts. The -ooae option is actually ON by default, and one uses -no-ooae to turn it off. Offloading only the activated experts is useful for some models, where often the number of activated experts is much smaller than the total number of experts, so -ooae reduces the amount of RAM -> VRAM data transfer. A model where this makes a significant difference for hybrid CPU/GPU inference is GPT-OSS-120B. For many MoE models and large batches basically all experts are activated, so this option makes no difference (or can even slightly lower performance because it costs some time to determine which experts are active, but if all experts turn out to be active, this time was spent for nothing).

ikawrakow · 2026-02-18T07:14:41Z

docs/parameters.md

+
+   `-ot "blk.(?:[0-9]|[1-7][0-9]|[8][0-7]).ffn._exps.=CPU"` To create exceptions and put back in ram anything that has "ffn" and "_exps" in its name, and that sits in layers called "blk.n", where "n" (the lawyer number) is any match between 0 and 9, or between 1 to 7 + 0 to 9 (aka a number between 10 and 79), or 8 + 0 to 7 (aka a number between 80 and 87).
+   Basically a complicated way of saying put all experts from layer 0 to 87 in ram. Experts from layer 88 to 93 (there's 93 layers in qwen3vl 235b) can sit in VRAM still. (Thats all I can load on a 5090).
+


I think this section would greatly benefit from some practical examples. E.g., take some common GPU configurations and some popular models that don't fit into VRAM, and give specific examples for these configurations how one should use tensor overrides (or --cpu-mode, --n-cpu-moe) to end up with a meaningful VRAM and GPU utilization.

I still see people using -ngl N (with N less than the number of layers) for MoE models, so one needs to hammer it down that this is basically never useful for MoE models.

Added "Common GPU configurations and popular models", as well needs to be filled in.

ikawrakow · 2026-02-18T07:16:34Z

docs/parameters.md

+
+3. Offload less to the GPU. Try to find a mix of parameters that better suits your system that default.
+
+- Use `--no-kv-offload` to keep KV cache on CPU.


I have never found an actually useful application of --no-kv-offload. Hence, at least on my book, it shouldn't get so much attention. Or, if it does, lets have an example of where this is useful.

ikawrakow · 2026-02-18T07:20:09Z

docs/parameters.md

+
+```
+llama-sweep-bench -m /models/model.gguf -c 12288 -ub 512 -rtr -fa -ctk q8_0 -ctv q8_0
+```


Perhaps mention that llama-sweep-bench understands all parameters that one would use in llama-server or llama-cli (but obviously not all get used, only those that are related to loading the model, setting up the context parameters, and running the benchmark).

ikawrakow · 2026-02-18T07:22:00Z

README.md

+```
+
+On Linux, install the required packages:
+


Strictly speaking, this is Debian/Ubuntu specific, so perhaps mention that they need to find the corresponding packages in the package manager of their Linux distro).

Made it more clear.

ikawrakow · 2026-02-18T07:25:04Z

docs/parameters.md

+| `--no-warmup` | Skip warming up the model with an empty run | - |  |
+| `--mlock` | Force system to keep model in RAM rather than swapping or compressing | - |  |
+| `--no-mmap` | Do not memory-map model (slower load but may reduce pageouts) | - |  |
+| `-rtr, --run-time-repack` | Repack tensors if interleaved variant is available | - | May improve performance on some systems. [PR 147](https://github.com/ikawrakow/ik_llama.cpp/pull/147) |


As there are many people who have been using llama.cpp and decided to try ik_llama.cpp, I think it would be very useful to explicitly mark the parameters that are ik_llama.cpp specific. It can also be useful to have a section on llama.cpp parameters that are not available in ik_llama.cpp.

Added "Unique parameters" section, still needs to be populated...

- no-ooae - placeholder for common commands - no-kv-offload - llama-sweep-bench - placeholder for unique parameters

mcm007 added 8 commits February 11, 2026 21:14

raw parameters.md

fc8e45e

fix small typos in common.cpp

7343135

Update build args in parameters.md

6cd7a2f

Update parameters.md

4bbc4da

- format as table - sections

Update README.md

fce80df

- quickstart - build and run

Update parameters.md

178d3ea

other tools examples

add PR links

fc14d46

Merge branch 'ikawrakow:main' into create_parameters_overview

a9d49e0

mcm007 mentioned this pull request Feb 14, 2026

Improve README.md #1260

Merged

4 tasks

saood06 reviewed Feb 14, 2026

View reviewed changes

mcm007 added 3 commits February 15, 2026 09:49

Merge branch 'ikawrakow:main' into create_parameters_overview

5ddd04b

multiple updates to parameters.md

c0f6da0

- description - add jargon section - add suggestions from feedbacks

don't imply that only linux is supported in README.md

ac4fe49

mcm007 added 4 commits February 15, 2026 16:17

add alias to parameters.md

548241b

Merge branch 'ikawrakow:main' into create_parameters_overview

4adf624

Update README.md with recent models and features

4a86f45

Update parameters.md with latest features

01544b4

mcm007 marked this pull request as ready for review February 17, 2026 18:49

ikawrakow reviewed Feb 18, 2026

View reviewed changes

mcm007 marked this pull request as draft February 19, 2026 18:40

Merge branch 'ikawrakow:main' into create_parameters_overview

333622d

mcm007 added 2 commits February 19, 2026 21:26

address suggestions

b7e51ff

- no-ooae - placeholder for common commands - no-kv-offload - llama-sweep-bench - placeholder for unique parameters

specify Linux distro in README.md

590060f

ikawrakow approved these changes Feb 20, 2026

View reviewed changes

ikawrakow marked this pull request as ready for review February 20, 2026 06:20

ikawrakow merged commit b2cb451 into ikawrakow:main Feb 20, 2026


		`-ot "blk.(?:[0-9]\|[1-7][0-9]\|[8][0-7]).ffn._exps.=CPU"` To create exceptions and put back in ram anything that has "ffn" and "_exps" in its name, and that sits in layers called "blk.n", where "n" (the lawyer number) is any match between 0 and 9, or between 1 to 7 + 0 to 9 (aka a number between 10 and 79), or 8 + 0 to 7 (aka a number between 80 and 87).
		Basically a complicated way of saying put all experts from layer 0 to 87 in ram. Experts from layer 88 to 93 (there's 93 layers in qwen3vl 235b) can sit in VRAM still. (Thats all I can load on a 5090).


		3. Offload less to the GPU. Try to find a mix of parameters that better suits your system that default.

		- Use `--no-kv-offload` to keep KV cache on CPU.

Conversation

mcm007 commented Feb 14, 2026

Uh oh!

alexisnaveros commented Feb 14, 2026

Uh oh!

Ph0rk0z commented Feb 14, 2026

Uh oh!

MrHills-rs commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ph0rk0z commented Feb 14, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

saood06 commented Feb 14, 2026

Uh oh!

alexisnaveros commented Feb 14, 2026

Uh oh!

mcm007 commented Feb 15, 2026

Uh oh!

MrHills-rs commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mcm007 commented Feb 15, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

MrHills-rs commented Feb 14, 2026 •

edited

Loading

MrHills-rs commented Feb 15, 2026 •

edited

Loading