ggml-cpu: arm64: q6_K repack gemm and gemv (and generic) implementations (dotprod)#19360
ggml-cpu: arm64: q6_K repack gemm and gemv (and generic) implementations (dotprod)#19360Alcpz merged 5 commits intoggml-org:masterfrom
Conversation
68e615e to
8d1d4b3
Compare
ggerganov
left a comment
There was a problem hiding this comment.
You should be able to merge now (use squash + merge).
| UNUSED(bs); | ||
| UNUSED(nr); | ||
|
|
||
| float sumf[8]; |
There was a problem hiding this comment.
@Alcpz, this should use the templated parameter N.
There was a problem hiding this comment.
Will address in a different PR. thanks for flagging
…ons (dotprod) (ggml-org#19360) * First working version of GEMM and GEMV * interleave loads and compute * Clang-format * Added missing fallback. Removed tested TODO. * Swap M and N to be consistent with the repack template convention
…ons (dotprod) (ggml-org#19360) * First working version of GEMM and GEMV * interleave loads and compute * Clang-format * Added missing fallback. Removed tested TODO. * Swap M and N to be consistent with the repack template convention
|
May I know why we have the same repack format for all ISAs? Wouldn't it be better to specialise it per ISA? I see at least on x86 it could be made much faster if we don't have permute and shuffle. cc @ggerganov |
|
What do you mean - I think we select the repack type per ISA here: llama.cpp/ggml/src/ggml-cpu/repack.cpp Lines 3392 to 3522 in e2763a6 |
|
Oh right, then technically I can create a separate repack for x86 and see if it helps. Thanks! |
|
Basically the idea was that we do a fixup for getting |
|
Yes, should be possible to specialize the repacks any way you need. It's just a balance of code complexity and also lack of testing infrastructure that makes it a bit difficult to validate the repack implementation.s |
|
Also, while we're on the topic. Do you have any ideas to optimize the mul-mat-id vec implementation used in hybrid inference? That path doesn't use repack and it could benefit from the generally faster mul-mat-vec impl. Two ideas I have was to use accumulate multiple rows together to reuse the src1 activation, and the other was to fuse the mul-mat together with the gate (like we do in CUDA) |
|
I did some work on this in #14918. It can be improved for sure, but offloading to the GPU is pretty much always better. I just don't see any interesting use cases for CPU inference, so it hasn't been a priority for me. |
|
I think the very relevant use-case is still |
|
I see - yes, the vec path for mmid is useful to optimize. But I haven't looked into that. |
#19356 but Q6_K.
PR contents:
Same methodology for testing -> llama-cli output, outputs of gemm and gemvs and perplexity to double check prompt processing.
Performance
-mcpu=cortex-a76+dotprod+noi8mm+nosve)Perplexity
llama-cli
llama-cli using repack
llama-cli using generic