ggml-blas: refactor BLAS backend#18027
Conversation
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
only for mul_mat and mul_mat_id ops Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
|
Made the PR ready for review by accident - pressed the wrong button. |
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
|
I can't seem to reproduce the same error found in I'll unfortunately have to use this PR as a test playground to figure out what is wrong with the refactor. |
Signed-off-by: Aaron Teo <[email protected]>
Thanks! I can reproduce this locally now, but I can't seem to fix the issue / don't know where it is originating from.
Edit: Found the problem here: https://github.com/ggml-org/llama.cpp/pull/18027/changes#diff-2227850fec93779a72739b3ba38bfcb962c7a1e4fbfe183de91c39f0dd72b21dR511 Working on a fix... |
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
|
what hardware did you have/use for benchmark? |
The benchmark was ran on an Apple M1 MacBook Pro (M1 Pro processor) with 32 GB of RAM and only BLAS compiled. Metal backend was explicitly disabled via
I have yet to measure it. But the intention of this PR is to focus on improving speeds of the BLAS backend and not really to measure it against the CPU. |
this took unnecessarily long to debug Signed-off-by: Aaron Teo <[email protected]>
We can use a non-quantized (ie FP32) model, what is the advantage here? did you find a way to use sgemm for GGML_OP_MUL_MAT_ID for use with this MOE model? |
|
F32 models will run as-is without the need for the dequantization step during tensor init. No additional memory cost because no dequantized buffer is allocated. F16 and BF16 models will still need to go through the dequantization process because
|
|
review quickly .. so there may be wrong comment ;) |
|
I have miss that, OpenBlas (and may be other blas lib) now have other OP
and may be more later... So there is some improvement possible. |
|
Closing PR as abandoned. While the performance gains were good, too much resources were required to keep the F32 weights in memory. For now, we'll keep the BLAS backend as-is where it dequantizes the weights on-the-fly until a better strategy can be made. |
I saw this comment and felt a bit sad since we use the BLAS backend a lot: #14909 (comment)
Currently, the BLAS backend dequantizes the weight tensors on the fly, which runs on every matrix multiplication operation. This is not ideal, and more performance gains can be attained by pre-dequantizing the weight tensors and running the matrix multiplication calculation on the dequantized buffers.
We currently see a 53.56% performance improvement for Prompt Processing, and 107.79% for Token Generation.
Next up in the roadmap will be to probably introduce the following:
But this PR focuses more on improving the performance first.
Changes
test-backend-ops.cppnow sets weight tensors toGGML_BACKEND_BUFFER_USAGE_WEIGHTSinstead ofGGML_BACKEND_BUFFER_USAGE_ANYwhich is used to detect when to dequantize a tensor.Performance Benchmark
$ build/bin/llama-bench -m ~/Library/Caches/llama.cpp/LiquidAI_LFM2-8B-A1B-GGUF_LFM2-8B-A1B-Q4_K_M.gguf -r 1 -t 8This PR
build: 717531b (7364)
Upstream
build: 254098a (7399)
AI Declaration: AI has been used to modify the
test-backend-ops.cppcode to set weight tensors usage toGGML_BACKEND_BUFFER_USAGE_WEIGHTSbecause I couldn't figure it out without it crashing. Modifications to the BLAS backend were written by a human.