Skip to content

Create parameters overview#1269

Merged
ikawrakow merged 18 commits intoikawrakow:mainfrom
mcm007:create_parameters_overview
Feb 20, 2026
Merged

Create parameters overview#1269
ikawrakow merged 18 commits intoikawrakow:mainfrom
mcm007:create_parameters_overview

Conversation

@mcm007
Copy link
Copy Markdown
Contributor

@mcm007 mcm007 commented Feb 14, 2026

@mcm007 mcm007 mentioned this pull request Feb 14, 2026
4 tasks
@alexisnaveros
Copy link
Copy Markdown

Oh, that is nice.

If we could have some detailed documentation somewhere about -ot (--override-tensor), that would be absolutely fantastic. (how to list all the tensors, what they mean, what should be the strategy about putting which ones on GPUs and CPUs, etc.) It can wait though, that's already a very nice step forward.

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented Feb 14, 2026

This is starting to look good. I can add that for OT, it's good to put up/down/gate onto the GPU for speedups, explicitly. IIRC, up/gate are for prompt processing and down is for text generation. Up/Gate shouldn't be on separate GPU devices because it might cause a bit of a deadlock. For models with shared experts, they should end up on GPU.. i.e in the case of GPT-OSS.

@MrHills-rs
Copy link
Copy Markdown

MrHills-rs commented Feb 14, 2026

Oh, that is nice.

If we could have some detailed documentation somewhere about -ot (--override-tensor), that would be absolutely fantastic. (how to list all the tensors, what they mean, what should be the strategy about putting which ones on GPUs and CPUs, etc.) It can wait though, that's already a very nice step forward.

If you want to list all tensors, you can just click on any quant on any GGUF on hugging face.

https://huggingface.co/ubergarm/MiniMax-M2.5-GGUF?show_file_info=IQ4_XS%2FMiniMax-M2.5-IQ4_XS-00002-of-00004.gguf

(In this particular case the GGUF is split in 4, this is part 2 of 4)

The -ot is simply a regex to match the tensors you see on that page.

As for the strat for CPU + GPU, you put anything that says "exps" in your slowest memory, and anything else in your fastest memory (VRAM). Those ffn "exps" are the sparse experts tensors, the ones that get actually used only 2-5% of the times (depending on the model). If then you have extra VRAM to spare, you start putting some of the exps into VRAM too, because why not in the end.

At least that's what I do.

Some layers (layers are called blk.n in gguf), are different in some models. For example this guy:

https://huggingface.co/unsloth/GLM-5-GGUF?show_file_info=UD-IQ3_XXS%2FGLM-5-UD-IQ3_XXS-00002-of-00008.gguf

The first three layers are different, they don't have exps, they have dense ffn, so they should all go in VRAM. Dense layers are very good to speed up mixed inference systems, as a much larger share of active parameters is fixed, and hence you know which to put in faster VRAM.
Also the layers from the 4th onwards have shared exps, "shexp", those too go to VRAM as they are always active.

In general, in a single GPU + CPU system, you just do something like this:

-ngl 999
to put all layers in vram by default

-ot "blk.(?:[0-9]|[1-7][0-9]|[8][0-7]).ffn._exps.=CPU"
To create exceptions and put back in ram anything that has "ffn" and "_exps" in its name, and that sits in layers called "blk.n", where "n" (the lawyer number) is any match between 0 and 9, or between 1 to 7 + 0 to 9 (aka a number between 10 and 79), or 8 + 0 to 7 (aka a number between 80 and 87).
Basically a complicated way of saying put all experts from layer 0 to 87 in ram. Experts from layer 88 to 93 (there's 93 layers in qwen3vl 235b) can sit in VRAM still. (Thats all I can load on a 5090).

As for multi GPUs.. that's for richer people then me to figure out lol

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented Feb 14, 2026

You're making me think I should look at some models with different initial blocks and make sure what's in there. Maybe I can get another speedup placing them in vram over later up/down/gate layers. In some quants the layers aren't uniform so it can be better to skip larger layers if more smaller blocks will fit without empty space where nothing fits.

README.md Outdated
### Prerequisites

```
apt-get update && apt-get install build-essential git libcurl4-openssl-dev curl libgomp1 cmake
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be generic or reference platforms like Windows, since as it stands it makes it seem like there is no Windows support.

@saood06
Copy link
Copy Markdown
Collaborator

saood06 commented Feb 14, 2026

If you want to list all tensors, you can just click on any quant on any GGUF on hugging face.

Or if you already have the quant locally you can just run gguf_dump.py.

@alexisnaveros
Copy link
Copy Markdown

In general, in a single GPU + CPU system, you just do something like this:

-ngl 999 to put all layers in vram by default

-ot "blk.(?:[0-9]|[1-7][0-9]|[8][0-7]).ffn.__exps._=CPU" To create exceptions and put back in ram anything that has "ffn" and "_exps" in its name, and that sits in layers called "blk.n", where "n" (the lawyer number) is any match between 0 and 9, or between 1 to 7 + 0 to 9 (aka a number between 10 and 79), or 8 + 0 to 7 (aka a number between 80 and 87). Basically a complicated way of saying put all experts from layer 0 to 87 in ram. Experts from layer 88 to 93 (there's 93 layers in qwen3vl 235b) can sit in VRAM still. (Thats all I can load on a 5090).

@MrHills-rs

See, all of this post is fantastic information (thanks!). And it's going to be lost in the reply of an obscure push request...

Consolidating the information into some kind of guide somewhere for advanced users would be amazing.

@mcm007
Copy link
Copy Markdown
Contributor Author

mcm007 commented Feb 15, 2026

Thanks for all replies!

All suggestions were included.

@MrHills-rs
Copy link
Copy Markdown

MrHills-rs commented Feb 15, 2026

Thanks for all replies!

All suggestions were included.

I see that the default --cache-ram-similarity is 0.5. isn't that a bit problematic?

I'm not sure I understand this well but does that mean that as long as token embeddings are 50% similar they will be reused?

I actually had a problem with minimax2.5 IQ4_XS where it struggled to understand identities. I wrote "you" in my own message and it kept thinking that I was talking about myself. This happened after a typing error, aka
- I wrote the wrong word
- I pressed send
- I stopped, realizing my mistake
- I re wrote the message, yet it didn't look like the correction had any effect. Multiple refreshes kept ignoring my correction.

This might or might not have anything to do with it, but after restarting ik_llama.cpp and rebuilding the KV cache from zero the problem went away.

I just realized this reading parameters.md. If I understand the situation right, this might be a trap many will fall into, no?

Am I misunderstanding something?

Yes I did misunderstand things. Nevermind, it's not what -cache-ram-similarity does.

#954

@mcm007
Copy link
Copy Markdown
Contributor Author

mcm007 commented Feb 15, 2026

This might or might not have anything to do with it, but after restarting ik_llama.cpp and rebuilding the KV cache from zero the problem went away.

Hey @MrHills-rs, you can disable it with -cram 0 then re-test.

Also, on the logs for entries like:

slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.300 (> 0.100 thold), f_keep = 0.028
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 109, total state size = 6.335 MiB
srv          load:  - looking for better prompt, base f_keep = 0.028, sim = 0.300
srv        update:  - cache state: 2 prompts, 11.915 MiB (limits: 4096.000 MiB, 20224 tokens, 70470 est)

@mcm007 mcm007 marked this pull request as ready for review February 17, 2026 18:49
python3 gguf-py/scripts/gguf_dump.py /models/Qwen_Qwen3-0.6B-IQ4_NL.gguf
```

- `-ngl`, `-ot`, `--cpu-moe`, `--n-cpu-moe N`, `-ooae`
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-ooae is not related to where the model weights get stored. Instead, once we have some MoE tensors (ffn_(up|gate|down)_exps.weight) on the CPU, and during batch processing the scheduler decides to copy them to a GPU to perform the corresponding matrix multiplications, -ooae tells the scheduler to offload only the activated experts. The -ooae option is actually ON by default, and one uses -no-ooae to turn it off. Offloading only the activated experts is useful for some models, where often the number of activated experts is much smaller than the total number of experts, so -ooae reduces the amount of RAM -> VRAM data transfer. A model where this makes a significant difference for hybrid CPU/GPU inference is GPT-OSS-120B. For many MoE models and large batches basically all experts are activated, so this option makes no difference (or can even slightly lower performance because it costs some time to determine which experts are active, but if all experts turn out to be active, this time was spent for nothing).


`-ot "blk.(?:[0-9]|[1-7][0-9]|[8][0-7]).ffn._exps.=CPU"` To create exceptions and put back in ram anything that has "ffn" and "_exps" in its name, and that sits in layers called "blk.n", where "n" (the lawyer number) is any match between 0 and 9, or between 1 to 7 + 0 to 9 (aka a number between 10 and 79), or 8 + 0 to 7 (aka a number between 80 and 87).
Basically a complicated way of saying put all experts from layer 0 to 87 in ram. Experts from layer 88 to 93 (there's 93 layers in qwen3vl 235b) can sit in VRAM still. (Thats all I can load on a 5090).

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this section would greatly benefit from some practical examples. E.g., take some common GPU configurations and some popular models that don't fit into VRAM, and give specific examples for these configurations how one should use tensor overrides (or --cpu-mode, --n-cpu-moe) to end up with a meaningful VRAM and GPU utilization.

I still see people using -ngl N (with N less than the number of layers) for MoE models, so one needs to hammer it down that this is basically never useful for MoE models.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added "Common GPU configurations and popular models", as well needs to be filled in.


3. Offload less to the GPU. Try to find a mix of parameters that better suits your system that default.

- Use `--no-kv-offload` to keep KV cache on CPU.
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have never found an actually useful application of --no-kv-offload. Hence, at least on my book, it shouldn't get so much attention. Or, if it does, lets have an example of where this is useful.


```
llama-sweep-bench -m /models/model.gguf -c 12288 -ub 512 -rtr -fa -ctk q8_0 -ctv q8_0
```
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps mention that llama-sweep-bench understands all parameters that one would use in llama-server or llama-cli (but obviously not all get used, only those that are related to loading the model, setting up the context parameters, and running the benchmark).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

```

On Linux, install the required packages:

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking, this is Debian/Ubuntu specific, so perhaps mention that they need to find the corresponding packages in the package manager of their Linux distro).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made it more clear.

| `--no-warmup` | Skip warming up the model with an empty run | - | |
| `--mlock` | Force system to keep model in RAM rather than swapping or compressing | - | |
| `--no-mmap` | Do not memory-map model (slower load but may reduce pageouts) | - | |
| `-rtr, --run-time-repack` | Repack tensors if interleaved variant is available | - | May improve performance on some systems. [PR 147](https://github.com/ikawrakow/ik_llama.cpp/pull/147) |
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As there are many people who have been using llama.cpp and decided to try ik_llama.cpp, I think it would be very useful to explicitly mark the parameters that are ik_llama.cpp specific. It can also be useful to have a section on llama.cpp parameters that are not available in ik_llama.cpp.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added "Unique parameters" section, still needs to be populated...

@mcm007 mcm007 marked this pull request as draft February 19, 2026 18:40
- no-ooae
- placeholder for common commands
- no-kv-offload
- llama-sweep-bench
- placeholder for unique parameters
@ikawrakow ikawrakow marked this pull request as ready for review February 20, 2026 06:20
@ikawrakow ikawrakow merged commit b2cb451 into ikawrakow:main Feb 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants