Skip to content

Better CPU prompt processing performance for SWA models#696

Merged
ikawrakow merged 3 commits intomainfrom
ik/cpu_swa_v1
Aug 17, 2025
Merged

Better CPU prompt processing performance for SWA models#696
ikawrakow merged 3 commits intomainfrom
ik/cpu_swa_v1

Conversation

@ikawrakow
Copy link
Copy Markdown
Owner

This PR is a follow up of #692 and uses the same technique to improve prompt processing performance for models utilizing SWA. As #682 it is implemented only on the CPU and requires FA.

Here some performance comparisons on a Ryzen-7950X CPU

Gemma3-270M-it, Q8_0

g3_swa_pp

GPT-OSS-20B, MXFP4

gpt_oss_swa_pp

@ikawrakow
Copy link
Copy Markdown
Owner Author

Just for fun, here a CPU-only comparison with mainline llama.cpp for GPT-OSS-20B-MXFP4 with Q8_0 KV cache:

gpt_oss_swa_pp1

@ikawrakow ikawrakow merged commit 93a4f60 into main Aug 17, 2025
ikawrakow pushed a commit that referenced this pull request Aug 17, 2025
ikawrakow added a commit that referenced this pull request Aug 17, 2025
…" (#701)

This reverts commit 93a4f60.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant