ggml-blas: refactor BLAS backend by taronaeo · Pull Request #18027 · ggml-org/llama.cpp

taronaeo · 2025-12-14T12:01:23Z

I saw this comment and felt a bit sad since we use the BLAS backend a lot: #14909 (comment)

Currently, the BLAS backend dequantizes the weight tensors on the fly, which runs on every matrix multiplication operation. This is not ideal, and more performance gains can be attained by pre-dequantizing the weight tensors and running the matrix multiplication calculation on the dequantized buffers.

We currently see a 53.56% performance improvement for Prompt Processing, and 107.79% for Token Generation.

Next up in the roadmap will be to probably introduce the following:

LibXSMM for smaller matrix multiplication

But this PR focuses more on improving the performance first.

Changes

Dequantization for weight tensors now runs only once, during tensor init.
test-backend-ops.cpp now sets weight tensors to GGML_BACKEND_BUFFER_USAGE_WEIGHTS instead of GGML_BACKEND_BUFFER_USAGE_ANY which is used to detect when to dequantize a tensor.

Performance Benchmark

$ build/bin/llama-bench -m ~/Library/Caches/llama.cpp/LiquidAI_LFM2-8B-A1B-GGUF_LFM2-8B-A1B-Q4_K_M.gguf -r 1 -t 8

This PR

model	size	params	backend	threads	test	t/s
lfm2moe 8B.A1B Q4_K - Medium	4.70 GiB	8.34 B	BLAS	8	pp512	221.20 ± 0.00
lfm2moe 8B.A1B Q4_K - Medium	4.70 GiB	8.34 B	BLAS	8	tg128	77.75 ± 0.00

build: 717531b (7364)

Upstream

model	size	params	backend	threads	test	t/s
lfm2moe 8B.A1B Q4_K - Medium	4.70 GiB	8.34 B	BLAS	8	pp512	127.74 ± 0.00
lfm2moe 8B.A1B Q4_K - Medium	4.70 GiB	8.34 B	BLAS	8	tg128	23.29 ± 0.00

build: 254098a (7399)

AI Declaration: AI has been used to modify the test-backend-ops.cpp code to set weight tensors usage to GGML_BACKEND_BUFFER_USAGE_WEIGHTS because I couldn't figure it out without it crashing. Modifications to the BLAS backend were written by a human.

Signed-off-by: Aaron Teo <[email protected]>

only for mul_mat and mul_mat_id ops Signed-off-by: Aaron Teo <[email protected]>

Signed-off-by: Aaron Teo <[email protected]>

taronaeo · 2025-12-14T15:24:47Z

Made the PR ready for review by accident - pressed the wrong button.

Signed-off-by: Aaron Teo <[email protected]>

taronaeo · 2025-12-20T14:17:39Z

I can't seem to reproduce the same error found in CI / ggml-ci-mac-metal with this error. All tests are passing locally which is odd.

 /Users/ggml/actions-runner/_work/llama.cpp/llama.cpp/ggml/src/ggml-backend.cpp:1423: ggml_backend_sched_alloc_splits: unexpected graph reallocation (graph size = 1246, nodes = 1140, leafs = 373), debug_realloc = 1

WARNING: Using native backtrace. Set GGML_BACKTRACE_LLDB for more info.
WARNING: GGML_BACKTRACE_LLDB may cause native MacOS Terminal.app to crash.
See: https://github.com/ggml-org/llama.cpp/pull/17869
0   libggml-base.0.9.4.dylib            0x00000001006cd36c ggml_print_backtrace + 276
1   libggml-base.0.9.4.dylib            0x00000001006cd558 ggml_abort + 156
2   libggml-base.0.9.4.dylib            0x00000001006ea0e4 ggml_backend_sched_graph_compute + 0
3   libllama.0.0.1.dylib                0x000000010087b638 _ZN13llama_context14process_ubatchERK12llama_ubatch14llm_graph_typeP22llama_memory_context_iR11ggml_status + 540
4   libllama.0.0.1.dylib                0x000000010087cc78 _ZN13llama_context6decodeERK11llama_batch + 1736
5   libllama.0.0.1.dylib                0x00000001008810ac llama_decode + 20
6   llama-save-load-state               0x00000001001aef2c main + 596
7   dyld                                0x0000000198829d54 start + 7184
./ci/run.sh: line 438: 55931 Abort trap: 6           ./bin/llama-save-load-state --model ${model_q4_0} -ngl 10 -c 1024 -fa off --no-op-offload

I'll unfortunately have to use this PR as a test playground to figure out what is wrong with the refactor.

Signed-off-by: Aaron Teo <[email protected]>

pwilkin · 2025-12-20T23:12:55Z

@taronaeo I think I can help here :) you're missing the GGML_SCHED_NO_REALLOC=ON flag, which is probably set in the CI.

For more details on what that's doing there, see #17276 and #17617

taronaeo · 2025-12-21T03:42:23Z

@taronaeo I think I can help here :) you're missing the GGML_SCHED_NO_REALLOC=ON flag, which is probably set in the CI.

For more details on what that's doing there, see #17276 and #17617

Thanks! I can reproduce this locally now, but I can't seem to fix the issue / don't know where it is originating from.

~~Reading through all the linked PRs, does this have something to do with making the tensor->extra size consistent?~~

~~I understand that @/slaren is currently on hiatus and won't be able to help. Does anyone else have a better understanding of this problem? I'm a little lost.~~

Edit: Found the problem here: https://github.com/ggml-org/llama.cpp/pull/18027/changes#diff-2227850fec93779a72739b3ba38bfcb962c7a1e4fbfe183de91c39f0dd72b21dR511

Working on a fix...

Signed-off-by: Aaron Teo <[email protected]>

Djip007 · 2025-12-24T11:09:42Z

what hardware did you have/use for benchmark?
what did you get with standart CPU backend?

taronaeo · 2025-12-24T11:13:16Z

what hardware did you have/use for benchmark?

The benchmark was ran on an Apple M1 MacBook Pro (M1 Pro processor) with 32 GB of RAM and only BLAS compiled. Metal backend was explicitly disabled via -DGGML_METAL=OFF (default is ON).

what did you get with standart CPU backend?

I have yet to measure it. But the intention of this PR is to focus on improving speeds of the BLAS backend and not really to measure it against the CPU.

this took unnecessarily long to debug Signed-off-by: Aaron Teo <[email protected]>

Djip007 · 2025-12-24T17:13:12Z

Currently, the BLAS backend dequantizes the weight tensors on the fly, which runs on every matrix multiplication operation. This is not ideal, and more performance gains can be attained by pre-dequantizing the weight tensors and running the matrix multiplication calculation on the dequantized buffers.

We can use a non-quantized (ie FP32) model, what is the advantage here?
And it need a lot more RAM to work.

did you find a way to use sgemm for GGML_OP_MUL_MAT_ID for use with this MOE model?

taronaeo · 2025-12-25T02:45:30Z

F32 models will run as-is without the need for the dequantization step during tensor init. No additional memory cost because no dequantized buffer is allocated.

F16 and BF16 models will still need to go through the dequantization process because cblas_sgemm operates on single-precision floats only.

did you find a way to use sgemm for GGML_OP_MUL_MAT_ID for use with this MOE model?

GGML_OP_MUL_MAT_ID runs faster on CPU, so there is no need to implement it for the BLAS backend. It will automatically fall back to the CPU backend.

Djip007 · 2025-12-25T11:49:53Z

review quickly .. so there may be wrong comment ;)

DaAwesomeP · 2025-12-31T23:20:10Z

@taronaeo please also see #18205 which fixes some CMake options with the different BLAS backends.

Djip007 · 2026-02-25T21:47:09Z

https://github.com/OpenMathLib/OpenBLAS/blob/18638c70eff1f0d08e2833b2724deaa128d6a334/cblas.h#L483-L486

I have miss that, OpenBlas (and may be other blas lib) now have other OP

sbgemm / sbgemm_batch for bf16 an event type conv...
shgemm for fp16 ...

and may be more later...

So there is some improvement possible.

taronaeo · 2026-02-27T01:23:11Z

Closing PR as abandoned. While the performance gains were good, too much resources were required to keep the F32 weights in memory.

For now, we'll keep the BLAS backend as-is where it dequantizes the weights on-the-fly until a better strategy can be made.

taronaeo added 8 commits December 11, 2025 20:51

ggml-blas: initial mmid impl

f682374

Signed-off-by: Aaron Teo <[email protected]>

ggml-blas: fully working mmid

19c8ec9

Signed-off-by: Aaron Teo <[email protected]>

ggml-blas: code clean up

1926e07

Signed-off-by: Aaron Teo <[email protected]>

tests: set tensor usage as weight for weight tensors

61ee32d

only for mul_mat and mul_mat_id ops Signed-off-by: Aaron Teo <[email protected]>

ggml: rewrite ggml-blas

9a14a09

Signed-off-by: Aaron Teo <[email protected]>

ggml-blas: fix invalid data access

aae6d1e

Signed-off-by: Aaron Teo <[email protected]>

ggml-blas: add note

717531b

Signed-off-by: Aaron Teo <[email protected]>

ggml-blas: fix ne

4470579

Signed-off-by: Aaron Teo <[email protected]>

github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning labels Dec 14, 2025

taronaeo added 6 commits December 14, 2025 21:57

ggml-blas: force dequant routine to use max logical cores

6dff031

Signed-off-by: Aaron Teo <[email protected]>

ggml-blas: move global blas n threads to set_n_threads

e481be6

Signed-off-by: Aaron Teo <[email protected]>

ggml-blas: bring back openmp

7998d08

Signed-off-by: Aaron Teo <[email protected]>

ggml-blas: clean up code

75e506f

Signed-off-by: Aaron Teo <[email protected]>

ggml-blas: more code formatting

10ce5e0

Signed-off-by: Aaron Teo <[email protected]>

CODEOWNERS: add @taronaeo to blas backend [no ci]

46dea5d

Signed-off-by: Aaron Teo <[email protected]>

taronaeo marked this pull request as ready for review December 14, 2025 15:23

taronaeo requested a review from ggerganov as a code owner December 14, 2025 15:23

taronaeo marked this pull request as draft December 14, 2025 15:23

taronaeo added 6 commits December 14, 2025 23:37

ggml-blas: further cleanup

04ed19b

Signed-off-by: Aaron Teo <[email protected]>

ggml-blas: fix memleak

623e713

Signed-off-by: Aaron Teo <[email protected]>

ggml-blas: fix graph realloc

2ee4d5f

Signed-off-by: Aaron Teo <[email protected]>

ggml-blas: refactor backend

adbfbf9

Signed-off-by: Aaron Teo <[email protected]>

ggml-blas: bring back out prod

265183d

Signed-off-by: Aaron Teo <[email protected]>

ggml-blas: bring back openmp warnings

7729be2

Signed-off-by: Aaron Teo <[email protected]>

ggml-blas: switch from cpu to blas buffer

dded9fe

Signed-off-by: Aaron Teo <[email protected]>

taronaeo added 2 commits December 21, 2025 16:01

ggml-blas: refactor min_batch to graph_compute

d216b62

Signed-off-by: Aaron Teo <[email protected]>

ggml-blas: relax batch constraints

ebef650

Signed-off-by: Aaron Teo <[email protected]>

ggml-blas: band-aid fix

51c069a

this took unnecessarily long to debug Signed-off-by: Aaron Teo <[email protected]>

wallentri88 mentioned this pull request Feb 24, 2026

Eval bug: qwen35 and qwen35moe graph split issues (Severe PP impact, crashes) #19864

Closed

taronaeo closed this Feb 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-blas: refactor BLAS backend#18027

ggml-blas: refactor BLAS backend#18027
taronaeo wants to merge 24 commits intoggml-org:masterfrom
taronaeo:feat/blas_mmid

taronaeo commented Dec 14, 2025 •

edited

Loading

Uh oh!

taronaeo commented Dec 14, 2025

Uh oh!

taronaeo commented Dec 20, 2025

Uh oh!

pwilkin commented Dec 20, 2025

Uh oh!

taronaeo commented Dec 21, 2025 •

edited

Loading

Uh oh!

Djip007 commented Dec 24, 2025

Uh oh!

taronaeo commented Dec 24, 2025

Uh oh!

Djip007 commented Dec 24, 2025

Uh oh!

taronaeo commented Dec 25, 2025

Uh oh!

Djip007 commented Dec 25, 2025

Uh oh!

DaAwesomeP commented Dec 31, 2025

Uh oh!

Djip007 commented Feb 25, 2026

Uh oh!

taronaeo commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

taronaeo commented Dec 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Performance Benchmark

This PR

Upstream

Uh oh!

taronaeo commented Dec 14, 2025

Uh oh!

taronaeo commented Dec 20, 2025

Uh oh!

pwilkin commented Dec 20, 2025

Uh oh!

taronaeo commented Dec 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Djip007 commented Dec 24, 2025

Uh oh!

taronaeo commented Dec 24, 2025

Uh oh!

Djip007 commented Dec 24, 2025

Uh oh!

taronaeo commented Dec 25, 2025

Uh oh!

Djip007 commented Dec 25, 2025

Uh oh!

DaAwesomeP commented Dec 31, 2025

Uh oh!

Djip007 commented Feb 25, 2026

Uh oh!

taronaeo commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

taronaeo commented Dec 14, 2025 •

edited

Loading

taronaeo commented Dec 21, 2025 •

edited

Loading