Skip to content

Fix slot selection logic in get_available_slot#1332

Open
CamNoob wants to merge 3 commits intoikawrakow:mainfrom
CamNoob:fix/slot-selection-logic
Open

Fix slot selection logic in get_available_slot#1332
CamNoob wants to merge 3 commits intoikawrakow:mainfrom
CamNoob:fix/slot-selection-logic

Conversation

@CamNoob
Copy link
Copy Markdown

@CamNoob CamNoob commented Feb 27, 2026

The bug was likely introduced in PR #973 when the similarity calculation was changed from LCP to token-level similarity, but sim_best was still initialized to 0 instead of -1.0f.

When slot_prompt_similarity threshold was set high (e.g., 0.8) and no slot met the threshold, sim_best stayed at 0, causing ret to remain nullptr. This led to the system getting stuck without selecting any slot.

This fix:

  • Changed sim_best initialization from 0 to -1.0f
  • Added best_slot variable to track the best slot found during similarity search
  • Only set ret = best_slot after the loop completes
  • Removed redundant ret == nullptr check

This ensures that even when no slot meets the slot_prompt_similarity threshold, the system still identifies the best available slot and falls back to LRU correctly.

Related: PR #973 (Server: Handle context shift better), PR #1285 (Fix slot prompt updating)

The bug was likely introduced in PR ikawrakow#973 when the similarity calculation
was changed from LCP to token-level similarity, but sim_best was still
initialized to 0 instead of -1.0f.

When slot_prompt_similarity threshold was set high (e.g., 0.8) and no slot
met the threshold, sim_best stayed at 0, causing ret to remain nullptr.
This led to the system getting stuck without selecting any slot.

This fix:
- Changed sim_best initialization from 0 to -1.0f
- Added best_slot variable to track the best slot found during similarity search
- Only set ret = best_slot after the loop completes
- Removed redundant ret == nullptr check

This ensures that even when no slot meets the slot_prompt_similarity threshold,
the system still identifies the best available slot and falls back to LRU correctly.

Related: PR ikawrakow#973 (Server: Handle context shift better), PR ikawrakow#1285 (Fix slot prompt updating)
When MoE layers run on CPU with self-extend (-ga or -ger), the KV cache
operations may not be properly synchronized between CPU and GPU backends.

This fix:
- Added comment explaining the issue
- Added llama_kv_cache_defrag(ctx) call after context extension
  to ensure all layers (including CPU-based MoE) have consistent state

The defrag call ensures the KV cache is properly organized before
the next batch processing, which is critical when MoE layers run on CPU
while other layers run on GPU.
@firecoperana
Copy link
Copy Markdown
Collaborator

I personally never experience this bug. When a new prompt that is entirely different from current slot arrives, it successfully selects the slot and cache the prompt. If no slot meets the slot_prompt_similarity, it uses the last used slot unless that slot is also busy, so it shouldn't get stuck. I only use one slot by the way. Can you give me more details when this bug occurs?

@CamNoob
Copy link
Copy Markdown
Author

CamNoob commented Feb 28, 2026

Well, I ran into the issue with Qwen3.5 122B, so it might be a model-specific issue. In an agentic workflow (I used pi-agent), it would overwrite the slot everytime, forcing me to process the prompt from scratch every time.

@firecoperana
Copy link
Copy Markdown
Collaborator

Do you have logs? In theory, it should cache the prompt so it will not be reprocessed every time on latest main.

@CamNoob
Copy link
Copy Markdown
Author

CamNoob commented Feb 28, 2026

Let me revert to main and test again, will report back later tonight

@usrlocalben
Copy link
Copy Markdown
Contributor

usrlocalben commented Mar 4, 2026

something seems wrong in main, running prefixes of a cached prompt causes reprocessing of the full prompt.

prompt ABCD -> compute ABCD -> done
prompt AB -> cache-hit AB -> done
prompt ABCD -> cache-hit AB (why not ABCD?) -> compute CD !?

@firecoperana
Copy link
Copy Markdown
Collaborator

What happened is that after you send the second prompt AB, you erased CD from the cached ABCD prompt, so you only cached AB. You can increase --cache-ram-similarity to 0.8. This will cache ABCD. Then Prompt AB ->process AB. Prompt ABCDE again, it loads ABCD and process E.

@jukofyork
Copy link
Copy Markdown
Contributor

What happened is that after you send the second prompt AB, you erased CD from the cached ABCD prompt, so you only cached AB. You can increase --cache-ram-similarity to 0.8. This will cache ABCD. Then Prompt AB ->process AB. Prompt ABCDE again, it loads ABCD and process E.

Is there any downside to increasing --cache-ram-similarity like this? I often run into this same problem in opencode if the primary agent spawns parallel sub-agents that all have the same system message (eg: they all "steal" each other's cached tokens and ends up doing lots of reprocessing).

I saw in mainline they actually reduced this to 0.1 and I decided to leave off playing with the value as couldn't understand why this would be helpful?!

@firecoperana
Copy link
Copy Markdown
Collaborator

firecoperana commented Mar 20, 2026

The only downside is the increase in ram usage I guess. cache-ram-similarity is hard coded as 0.5 from mainline. It means if the percentage of cached tokens kept is less than 0.5, it will trigger the cache prompt. In your case, the similarity is probably is 0.8 or 0.9, so you want cache-ram-similarity to be above this value to cache it to ram.
0.1 is related to slot selection, not prompt cache.

@jukofyork
Copy link
Copy Markdown
Contributor

The only downside is the increase in ram usage I guess. cache-ram-similarity is hard coded as 0.5 from mainline. It means if the percentage of cached tokens kept is less than 0.5, it will trigger the cache prompt. In your case, the similarity is probably is 0.8 or 0.9, so you want cache-ram-similarity to be above this value to cache it to ram. 0.1 is related to slot selection, not prompt cache.

Ah, yeah I got confused about the 0.1 thing after seeing they changed it in mainline.

I will try 0.9 as have plenty of RAM spare.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants