[Hybrid] Create checkpoints while processing the prompt#17428
[Hybrid] Create checkpoints while processing the prompt#17428whoreson wants to merge 2 commits intoggml-org:masterfrom
Conversation
…e loading request for a non-existent file
|
You can simply start llama-server with |
most uninformed comment of the month award |
|
My apologies. I mistakenly thought Granite was using sliding window attention (for which my comment would have made sense), but it's not. Disregard my previous comment. Having said that, there is no need to be unfriendly towards people trying to be helpful, even if they made a mistake. |
|
i can still feel very friendly even when mildly annoyed (useful trait when keeping women around) |
|
@whoreson please stay on topic and refrain from needlessly agitating specific persons or groups of people. |
|
well cudadev sir. anyway this is the wrong place for this, creating a usable cache is a library task, doesn't belong in a http client on that note, it would make a lot of sense if the (horrible, incompatible, and screenspace-waster) web UI would finally be separated out of llama-server, and would communicate with it via HTTP, but i don't think there's any chance of that happening, considering that llama.cpp had exactly zero feature freeze/release periods in all of its existence, just code being shovelled in endlessly |
|
Now that Qwen 3.5 is out, can you guys FINALLY fix this functionality so that I won't have to track all the server code reorganizations locally? Jesus |
|
Hi, The second commit is fully orthogonal, perhaps it should be moved to a different PR, so we can focus on converging on an optimal solution for the kvcache truncation with checkpoints. So my reason to checkpointing at fixed (albeit configurable) interval in my PR is that the token interval controls the worst case max lantecy that you get. If the context is small I would like fewer checkpoints if it's large I would like more checkpoints. I think the interval should be fixed in tokens. Still it's suboptimal in my approach that ctx-checkpoints is left to the user to set right. I would prefer if there was no reservation and if checkpoints would be dynamically allocated with a lifetime tracking the kvcache, and freed when the kvcache is dropped below the checkpoint. So the parameter and the CTX checkpoint number would disappear. No idea if that's feasible or if there's some blocker to achieve it. So I did the minimal tweak that would provide checkpoints at a fixed token interval, leveraging the configurable logic batch size. Our PR are needed every time you modify a file in context while codiing, without them all code editor enhancement like the gptel PR to add context lru management and opt-in kvcache preloading won't work at all... Thanks! |
Currently, llama-server creates a single checkpoint after e.g. processing 600000 tokens with IBM Granite. Which will likely get erased real soon.
I have deemed this sad state of affairs insufferable, so I instructed my girl Gemini to solve the issue, and generate half of the requested checkpoints during PP.
Now it's actually usable.
Here is the result, do the needful saars.