Better CPU prompt processing performance for SWA models by ikawrakow · Pull Request #702 · ikawrakow/ik_llama.cpp

ikawrakow · 2025-08-18T05:13:51Z

This PR is a fixed version of #696, see there for details.

The crashes we were getting on #696 are due to the back-end not allocating a buffer for the tensor containing the mask bounds when this tensor is not used in the graph. Now we only create the mask bounds if they are actually used (SWA models with FA enabled).

If we allocate the tensor for the mask bounds, but then don't use it, we get a crash in the back-end. Hence, we only allocate the bounds tensor when using FA.

ikawrakow · 2025-09-04T10:11:03Z

Closing in favor of #757

Iwan Kawrakow added 4 commits August 18, 2025 07:58

This does the trick for PP

43096be

Compute mask bounds when creating the mask

6aaeb81

Set mask bounds for all supported SWA models

e9899c0

Fix crash

41d346a

If we allocate the tensor for the mask bounds, but then don't use it, we get a crash in the back-end. Hence, we only allocate the bounds tensor when using FA.

This was referenced Sep 2, 2025

CUDA: FA optimization for models using SWA #752

Closed

Better CPU SWA #757

Merged

ikawrakow closed this Sep 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better CPU prompt processing performance for SWA models#702

Better CPU prompt processing performance for SWA models#702
ikawrakow wants to merge 4 commits intomainfrom
ik/cpu_swa_v1

ikawrakow commented Aug 18, 2025

Uh oh!

ikawrakow commented Sep 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ikawrakow commented Aug 18, 2025

Uh oh!

ikawrakow commented Sep 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant