Skip to content

ggml-blas: refactor BLAS backend#18027

Closed
taronaeo wants to merge 24 commits intoggml-org:masterfrom
taronaeo:feat/blas_mmid
Closed

ggml-blas: refactor BLAS backend#18027
taronaeo wants to merge 24 commits intoggml-org:masterfrom
taronaeo:feat/blas_mmid

Conversation

@taronaeo
Copy link
Copy Markdown
Member

@taronaeo taronaeo commented Dec 14, 2025

I saw this comment and felt a bit sad since we use the BLAS backend a lot: #14909 (comment)

Currently, the BLAS backend dequantizes the weight tensors on the fly, which runs on every matrix multiplication operation. This is not ideal, and more performance gains can be attained by pre-dequantizing the weight tensors and running the matrix multiplication calculation on the dequantized buffers.

We currently see a 53.56% performance improvement for Prompt Processing, and 107.79% for Token Generation.

Next up in the roadmap will be to probably introduce the following:

  1. LibXSMM for smaller matrix multiplication

But this PR focuses more on improving the performance first.

Changes

  1. Dequantization for weight tensors now runs only once, during tensor init.
  2. test-backend-ops.cpp now sets weight tensors to GGML_BACKEND_BUFFER_USAGE_WEIGHTS instead of GGML_BACKEND_BUFFER_USAGE_ANY which is used to detect when to dequantize a tensor.

Performance Benchmark

$ build/bin/llama-bench -m ~/Library/Caches/llama.cpp/LiquidAI_LFM2-8B-A1B-GGUF_LFM2-8B-A1B-Q4_K_M.gguf -r 1 -t 8

This PR

model size params backend threads test t/s
lfm2moe 8B.A1B Q4_K - Medium 4.70 GiB 8.34 B BLAS 8 pp512 221.20 ± 0.00
lfm2moe 8B.A1B Q4_K - Medium 4.70 GiB 8.34 B BLAS 8 tg128 77.75 ± 0.00

build: 717531b (7364)

Upstream

model size params backend threads test t/s
lfm2moe 8B.A1B Q4_K - Medium 4.70 GiB 8.34 B BLAS 8 pp512 127.74 ± 0.00
lfm2moe 8B.A1B Q4_K - Medium 4.70 GiB 8.34 B BLAS 8 tg128 23.29 ± 0.00

build: 254098a (7399)

AI Declaration: AI has been used to modify the test-backend-ops.cpp code to set weight tensors usage to GGML_BACKEND_BUFFER_USAGE_WEIGHTS because I couldn't figure it out without it crashing. Modifications to the BLAS backend were written by a human.

Signed-off-by: Aaron Teo <[email protected]>
only for mul_mat and mul_mat_id ops

Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
@github-actions github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning labels Dec 14, 2025
@taronaeo taronaeo marked this pull request as ready for review December 14, 2025 15:23
@taronaeo taronaeo requested a review from ggerganov as a code owner December 14, 2025 15:23
@taronaeo taronaeo marked this pull request as draft December 14, 2025 15:23
@taronaeo
Copy link
Copy Markdown
Member Author

Made the PR ready for review by accident - pressed the wrong button.

@taronaeo
Copy link
Copy Markdown
Member Author

I can't seem to reproduce the same error found in CI / ggml-ci-mac-metal with this error. All tests are passing locally which is odd.

 /Users/ggml/actions-runner/_work/llama.cpp/llama.cpp/ggml/src/ggml-backend.cpp:1423: ggml_backend_sched_alloc_splits: unexpected graph reallocation (graph size = 1246, nodes = 1140, leafs = 373), debug_realloc = 1

WARNING: Using native backtrace. Set GGML_BACKTRACE_LLDB for more info.
WARNING: GGML_BACKTRACE_LLDB may cause native MacOS Terminal.app to crash.
See: https://github.com/ggml-org/llama.cpp/pull/17869
0   libggml-base.0.9.4.dylib            0x00000001006cd36c ggml_print_backtrace + 276
1   libggml-base.0.9.4.dylib            0x00000001006cd558 ggml_abort + 156
2   libggml-base.0.9.4.dylib            0x00000001006ea0e4 ggml_backend_sched_graph_compute + 0
3   libllama.0.0.1.dylib                0x000000010087b638 _ZN13llama_context14process_ubatchERK12llama_ubatch14llm_graph_typeP22llama_memory_context_iR11ggml_status + 540
4   libllama.0.0.1.dylib                0x000000010087cc78 _ZN13llama_context6decodeERK11llama_batch + 1736
5   libllama.0.0.1.dylib                0x00000001008810ac llama_decode + 20
6   llama-save-load-state               0x00000001001aef2c main + 596
7   dyld                                0x0000000198829d54 start + 7184
./ci/run.sh: line 438: 55931 Abort trap: 6           ./bin/llama-save-load-state --model ${model_q4_0} -ngl 10 -c 1024 -fa off --no-op-offload

I'll unfortunately have to use this PR as a test playground to figure out what is wrong with the refactor.

@pwilkin
Copy link
Copy Markdown
Member

pwilkin commented Dec 20, 2025

@taronaeo I think I can help here :) you're missing the GGML_SCHED_NO_REALLOC=ON flag, which is probably set in the CI.

For more details on what that's doing there, see #17276 and #17617

@taronaeo
Copy link
Copy Markdown
Member Author

taronaeo commented Dec 21, 2025

@taronaeo I think I can help here :) you're missing the GGML_SCHED_NO_REALLOC=ON flag, which is probably set in the CI.

For more details on what that's doing there, see #17276 and #17617

Thanks! I can reproduce this locally now, but I can't seem to fix the issue / don't know where it is originating from.

Reading through all the linked PRs, does this have something to do with making the tensor->extra size consistent?

I understand that @/slaren is currently on hiatus and won't be able to help. Does anyone else have a better understanding of this problem? I'm a little lost.

Edit: Found the problem here: https://github.com/ggml-org/llama.cpp/pull/18027/changes#diff-2227850fec93779a72739b3ba38bfcb962c7a1e4fbfe183de91c39f0dd72b21dR511

Working on a fix...

@Djip007
Copy link
Copy Markdown
Contributor

Djip007 commented Dec 24, 2025

what hardware did you have/use for benchmark?
what did you get with standart CPU backend?

@taronaeo
Copy link
Copy Markdown
Member Author

what hardware did you have/use for benchmark?

The benchmark was ran on an Apple M1 MacBook Pro (M1 Pro processor) with 32 GB of RAM and only BLAS compiled. Metal backend was explicitly disabled via -DGGML_METAL=OFF (default is ON).

what did you get with standart CPU backend?

I have yet to measure it. But the intention of this PR is to focus on improving speeds of the BLAS backend and not really to measure it against the CPU.

this took unnecessarily long to debug

Signed-off-by: Aaron Teo <[email protected]>
@Djip007
Copy link
Copy Markdown
Contributor

Djip007 commented Dec 24, 2025

Currently, the BLAS backend dequantizes the weight tensors on the fly, which runs on every matrix multiplication operation. This is not ideal, and more performance gains can be attained by pre-dequantizing the weight tensors and running the matrix multiplication calculation on the dequantized buffers.

We can use a non-quantized (ie FP32) model, what is the advantage here?
And it need a lot more RAM to work.

did you find a way to use sgemm for GGML_OP_MUL_MAT_ID for use with this MOE model?

@taronaeo
Copy link
Copy Markdown
Member Author

F32 models will run as-is without the need for the dequantization step during tensor init. No additional memory cost because no dequantized buffer is allocated.

F16 and BF16 models will still need to go through the dequantization process because cblas_sgemm operates on single-precision floats only.

did you find a way to use sgemm for GGML_OP_MUL_MAT_ID for use with this MOE model?

GGML_OP_MUL_MAT_ID runs faster on CPU, so there is no need to implement it for the BLAS backend. It will automatically fall back to the CPU backend.

@Djip007
Copy link
Copy Markdown
Contributor

Djip007 commented Dec 25, 2025

review quickly .. so there may be wrong comment ;)

@DaAwesomeP
Copy link
Copy Markdown
Contributor

@taronaeo please also see #18205 which fixes some CMake options with the different BLAS backends.

@Djip007
Copy link
Copy Markdown
Contributor

Djip007 commented Feb 25, 2026

https://github.com/OpenMathLib/OpenBLAS/blob/18638c70eff1f0d08e2833b2724deaa128d6a334/cblas.h#L483-L486

I have miss that, OpenBlas (and may be other blas lib) now have other OP

  • sbgemm / sbgemm_batch for bf16 an event type conv...
  • shgemm for fp16 ...

and may be more later...

So there is some improvement possible.

@taronaeo
Copy link
Copy Markdown
Member Author

Closing PR as abandoned. While the performance gains were good, too much resources were required to keep the F32 weights in memory.

For now, we'll keep the BLAS backend as-is where it dequantizes the weights on-the-fly until a better strategy can be made.

@taronaeo taronaeo closed this Feb 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants