Fix slot selection logic in get_available_slot#1332
Fix slot selection logic in get_available_slot#1332CamNoob wants to merge 3 commits intoikawrakow:mainfrom
Conversation
The bug was likely introduced in PR ikawrakow#973 when the similarity calculation was changed from LCP to token-level similarity, but sim_best was still initialized to 0 instead of -1.0f. When slot_prompt_similarity threshold was set high (e.g., 0.8) and no slot met the threshold, sim_best stayed at 0, causing ret to remain nullptr. This led to the system getting stuck without selecting any slot. This fix: - Changed sim_best initialization from 0 to -1.0f - Added best_slot variable to track the best slot found during similarity search - Only set ret = best_slot after the loop completes - Removed redundant ret == nullptr check This ensures that even when no slot meets the slot_prompt_similarity threshold, the system still identifies the best available slot and falls back to LRU correctly. Related: PR ikawrakow#973 (Server: Handle context shift better), PR ikawrakow#1285 (Fix slot prompt updating)
When MoE layers run on CPU with self-extend (-ga or -ger), the KV cache operations may not be properly synchronized between CPU and GPU backends. This fix: - Added comment explaining the issue - Added llama_kv_cache_defrag(ctx) call after context extension to ensure all layers (including CPU-based MoE) have consistent state The defrag call ensures the KV cache is properly organized before the next batch processing, which is critical when MoE layers run on CPU while other layers run on GPU.
|
I personally never experience this bug. When a new prompt that is entirely different from current slot arrives, it successfully selects the slot and cache the prompt. If no slot meets the slot_prompt_similarity, it uses the last used slot unless that slot is also busy, so it shouldn't get stuck. I only use one slot by the way. Can you give me more details when this bug occurs? |
This reverts commit cc069f8.
|
Well, I ran into the issue with Qwen3.5 122B, so it might be a model-specific issue. In an agentic workflow (I used pi-agent), it would overwrite the slot everytime, forcing me to process the prompt from scratch every time. |
|
Do you have logs? In theory, it should cache the prompt so it will not be reprocessed every time on latest main. |
|
Let me revert to main and test again, will report back later tonight |
|
something seems wrong in main, running prefixes of a cached prompt causes reprocessing of the full prompt. prompt ABCD -> compute ABCD -> done |
|
What happened is that after you send the second prompt AB, you erased CD from the cached ABCD prompt, so you only cached AB. You can increase --cache-ram-similarity to 0.8. This will cache ABCD. Then Prompt AB ->process AB. Prompt ABCDE again, it loads ABCD and process E. |
Is there any downside to increasing I saw in mainline they actually reduced this to |
|
The only downside is the increase in ram usage I guess. cache-ram-similarity is hard coded as 0.5 from mainline. It means if the percentage of cached tokens kept is less than 0.5, it will trigger the cache prompt. In your case, the similarity is probably is 0.8 or 0.9, so you want cache-ram-similarity to be above this value to cache it to ram. |
Ah, yeah I got confused about the I will try |
The bug was likely introduced in PR #973 when the similarity calculation was changed from LCP to token-level similarity, but sim_best was still initialized to 0 instead of -1.0f.
When slot_prompt_similarity threshold was set high (e.g., 0.8) and no slot met the threshold, sim_best stayed at 0, causing ret to remain nullptr. This led to the system getting stuck without selecting any slot.
This fix:
This ensures that even when no slot meets the slot_prompt_similarity threshold, the system still identifies the best available slot and falls back to LRU correctly.
Related: PR #973 (Server: Handle context shift better), PR #1285 (Fix slot prompt updating)