Merged
Conversation
Owner
Author
|
ik_llama.cpp is important to me :3 with this PR I can run IQ3 of the model at 600 t/s prefill and 11 t/s decode on my 8gb VRAM 64GB RAM laptop, thank you for your work ! |
|
Yeah, gotta say, I'm seeing 2x speed uplift over mainline on Qwen3.5-27B for PP. ik_llama.cpp remains undefeated and incredibly important. |
|
@ikawrakow would you be interested in making a CPU WASM target? Given how far ahead mainline you are for CPU inference, it could make in-browser small agents much better (or even feasible). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

ik_llama.cppis not important or famous, so we cannot have day-0 support for new models. But in this particular case, we get day-1.CPU and CUDA, including flash attention for the new head size combination of 320, 256.
Tested with UD_IQ2_XSS from Unsloth (because I wanted to have full GPU offload on a 2x3090 system).
CUDA performance
llama.cppdoes not have CUDA FA support as of this writing, so we cannot get very far with context length. Here is as far as it gets on the 2x3090 system:CPU performance
Running on a Ryzen-3995WX CPU.
And here is what we get with
llama.cpp. On the CPU FA is enabled.