Fix slot selection logic in get_available_slot by CamNoob · Pull Request #1332 · ikawrakow/ik_llama.cpp

CamNoob · 2026-02-27T01:53:13Z

The bug was likely introduced in PR #973 when the similarity calculation was changed from LCP to token-level similarity, but sim_best was still initialized to 0 instead of -1.0f.

When slot_prompt_similarity threshold was set high (e.g., 0.8) and no slot met the threshold, sim_best stayed at 0, causing ret to remain nullptr. This led to the system getting stuck without selecting any slot.

This fix:

Changed sim_best initialization from 0 to -1.0f
Added best_slot variable to track the best slot found during similarity search
Only set ret = best_slot after the loop completes
Removed redundant ret == nullptr check

This ensures that even when no slot meets the slot_prompt_similarity threshold, the system still identifies the best available slot and falls back to LRU correctly.

Related: PR #973 (Server: Handle context shift better), PR #1285 (Fix slot prompt updating)

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

The bug was likely introduced in PR ikawrakow#973 when the similarity calculation was changed from LCP to token-level similarity, but sim_best was still initialized to 0 instead of -1.0f. When slot_prompt_similarity threshold was set high (e.g., 0.8) and no slot met the threshold, sim_best stayed at 0, causing ret to remain nullptr. This led to the system getting stuck without selecting any slot. This fix: - Changed sim_best initialization from 0 to -1.0f - Added best_slot variable to track the best slot found during similarity search - Only set ret = best_slot after the loop completes - Removed redundant ret == nullptr check This ensures that even when no slot meets the slot_prompt_similarity threshold, the system still identifies the best available slot and falls back to LRU correctly. Related: PR ikawrakow#973 (Server: Handle context shift better), PR ikawrakow#1285 (Fix slot prompt updating)

When MoE layers run on CPU with self-extend (-ga or -ger), the KV cache operations may not be properly synchronized between CPU and GPU backends. This fix: - Added comment explaining the issue - Added llama_kv_cache_defrag(ctx) call after context extension to ensure all layers (including CPU-based MoE) have consistent state The defrag call ensures the KV cache is properly organized before the next batch processing, which is critical when MoE layers run on CPU while other layers run on GPU.

firecoperana · 2026-02-27T14:39:57Z

I personally never experience this bug. When a new prompt that is entirely different from current slot arrives, it successfully selects the slot and cache the prompt. If no slot meets the slot_prompt_similarity, it uses the last used slot unless that slot is also busy, so it shouldn't get stuck. I only use one slot by the way. Can you give me more details when this bug occurs?

This reverts commit cc069f8.

CamNoob · 2026-02-28T01:10:39Z

Well, I ran into the issue with Qwen3.5 122B, so it might be a model-specific issue. In an agentic workflow (I used pi-agent), it would overwrite the slot everytime, forcing me to process the prompt from scratch every time.

firecoperana · 2026-02-28T02:55:56Z

Do you have logs? In theory, it should cache the prompt so it will not be reprocessed every time on latest main.

CamNoob · 2026-02-28T11:49:39Z

Let me revert to main and test again, will report back later tonight

usrlocalben · 2026-03-04T21:02:04Z

something seems wrong in main, running prefixes of a cached prompt causes reprocessing of the full prompt.

prompt ABCD -> compute ABCD -> done
prompt AB -> cache-hit AB -> done
prompt ABCD -> cache-hit AB (why not ABCD?) -> compute CD !?

firecoperana · 2026-03-04T23:07:28Z

What happened is that after you send the second prompt AB, you erased CD from the cached ABCD prompt, so you only cached AB. You can increase --cache-ram-similarity to 0.8. This will cache ABCD. Then Prompt AB ->process AB. Prompt ABCDE again, it loads ABCD and process E.

jukofyork · 2026-03-20T01:58:58Z

What happened is that after you send the second prompt AB, you erased CD from the cached ABCD prompt, so you only cached AB. You can increase --cache-ram-similarity to 0.8. This will cache ABCD. Then Prompt AB ->process AB. Prompt ABCDE again, it loads ABCD and process E.

Is there any downside to increasing --cache-ram-similarity like this? I often run into this same problem in opencode if the primary agent spawns parallel sub-agents that all have the same system message (eg: they all "steal" each other's cached tokens and ends up doing lots of reprocessing).

I saw in mainline they actually reduced this to 0.1 and I decided to leave off playing with the value as couldn't understand why this would be helpful?!

firecoperana · 2026-03-20T03:10:10Z

The only downside is the increase in ram usage I guess. cache-ram-similarity is hard coded as 0.5 from mainline. It means if the percentage of cached tokens kept is less than 0.5, it will trigger the cache prompt. In your case, the similarity is probably is 0.8 or 0.9, so you want cache-ram-similarity to be above this value to cache it to ram.
0.1 is related to slot selection, not prompt cache.

jukofyork · 2026-03-20T08:53:11Z

The only downside is the increase in ram usage I guess. cache-ram-similarity is hard coded as 0.5 from mainline. It means if the percentage of cached tokens kept is less than 0.5, it will trigger the cache prompt. In your case, the similarity is probably is 0.8 or 0.9, so you want cache-ram-similarity to be above this value to cache it to ram. 0.1 is related to slot selection, not prompt cache.

Ah, yeah I got confused about the 0.1 thing after seeing they changed it in mainline.

I will try 0.9 as have plenty of RAM spare.

ikawrakow requested a review from firecoperana February 27, 2026 06:01

Revert "Fix ncmoe interaction with self-extend (sm) layer"

d97d6b1

This reverts commit cc069f8.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix slot selection logic in get_available_slot#1332

Fix slot selection logic in get_available_slot#1332
CamNoob wants to merge 3 commits intoikawrakow:mainfrom
CamNoob:fix/slot-selection-logic

CamNoob commented Feb 27, 2026

Uh oh!

firecoperana commented Feb 27, 2026

Uh oh!

CamNoob commented Feb 28, 2026

Uh oh!

firecoperana commented Feb 28, 2026

Uh oh!

CamNoob commented Feb 28, 2026

Uh oh!

usrlocalben commented Mar 4, 2026 •

edited

Loading

Uh oh!

firecoperana commented Mar 4, 2026

Uh oh!

jukofyork commented Mar 20, 2026

Uh oh!

firecoperana commented Mar 20, 2026 •

edited

Loading

Uh oh!

jukofyork commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

CamNoob commented Feb 27, 2026

Uh oh!

firecoperana commented Feb 27, 2026

Uh oh!

CamNoob commented Feb 28, 2026

Uh oh!

firecoperana commented Feb 28, 2026

Uh oh!

CamNoob commented Feb 28, 2026

Uh oh!

usrlocalben commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

firecoperana commented Mar 4, 2026

Uh oh!

jukofyork commented Mar 20, 2026

Uh oh!

firecoperana commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jukofyork commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

usrlocalben commented Mar 4, 2026 •

edited

Loading

firecoperana commented Mar 20, 2026 •

edited

Loading