Conversation
ggml/src/ggml-cpu/ggml-cpu-hbm.cpp
Outdated
There was a problem hiding this comment.
Not sure it is good, can not test.
And may not work/build on master branch either.
There was a problem hiding this comment.
It could probably be removed, the normal CPU buffer type calls ggml_aligned_malloc which already uses HBM. So at the moment this buffer type serves no purpose.
ggml/src/ggml-cpu/ggml-cpu.c
Outdated
There was a problem hiding this comment.
Look for me it have not be write for dynamic repack but only with "native" Q4_0_N_M packing.
leave it commented it need some work to be usable on dynamic repacking.
There was a problem hiding this comment.
We should try to fix to keep support for aarch64 types with models with experts.
36a0406 to
655a3fb
Compare
|
Overall looks good. I am not sure about removing support for current Q4_0_x_x models, but I guess if we are going to do it, it is better to do it sooner than later. |
yes it will be the main/difficult choice :
|
|
@slaren I still need your expertise so as not to make too many mistakes. I was looking for where If yes, look for me that the size is not calculated correctly for llamafile and Q4_0 repacking:
Note: I'm trying to make it more generic to make it easier to reintegrate the AMX backend so maybe not useful to fix it for now. |
It's OK if we over-allocate a bit of memory for
Isn't |
Yes it is the case for Q4_0_M_N for, so not critical for now. Even if internally it is more a Q8_0_N: But may not work with other/future case. |
|
If we remove the old API and make the CPU backend accessible only through ggml-backend, then there will be a context that can be used to store the work buffer. Then the work buffer could simply be a |
So you confirm that for now this is where the size is calculated. |
|
Yes, the size is calculated in the function |
655a3fb to
e772df4
Compare
a411d95 to
fd768e0
Compare
fd768e0 to
16154eb
Compare
|
can find how to enable c++17 for macOS-latest-swift / xcode... |
16154eb to
dc8adeb
Compare
|
OK now this is merged #10570 , and c++17 is the default I need some more work. |
dc8adeb to
1b29245
Compare
1b29245 to
733f891
Compare
|
@slaren @ggerganov what do you think with this refactor? I tried to make adding a "cpu-extra-buffer" simpler and more general. 🤞 |
slaren
left a comment
There was a problem hiding this comment.
The design looks good, this is a good improvement. Just reformat the code according to the .clang-format file and remove outdated comments.
8e5bd04 to
b14b471
Compare
|
I updated the size controls, it should be better like this. 🖕 |
|
We should add debug logs about what repacks are applied to what tensors, so that when running with |
Co-authored-by: Georgi Gerganov <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>
be3c64b to
1221d13
Compare
@ggerganov: I added 2 logs, is that what you were thinking? |
|
Yes, perfect. Should we merge this or is there anything else you are planning to do? |
For me we can merge it. (if the CI success 🤞) |
|
@slaren @ggerganov thanks for all you reviews and time. |
|
Just saw that opened bug, is the implication of this change that I should no longer be making Q4_0_N_M quants then? They seem to be fully removed? |
|
Yes, support for |
* rename ggml-cpu-aarch64.c to .cpp
* reformat extra cpu backend.
- clean Q4_0_N_M and IQ4_0_N_M
- remove from "file" tensor type
- allow only with dynamic repack
- extract cpu extra bufts and convert to C++
- hbm
- "aarch64"
- more generic use of extra buffer
- generalise extra_supports_op
- new API for "cpu-accel":
- amx
- aarch64
* clang-format
* Clean Q4_0_N_M ref
Enable restrict on C++
* add op GGML_OP_MUL_MAT_ID for Q4_0_N_M with runtime repack
* added/corrected control on tensor size for Q4 repacking.
* Update ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp
Co-authored-by: Georgi Gerganov <[email protected]>
* Update ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp
Co-authored-by: Georgi Gerganov <[email protected]>
* add debug logs on repacks.
---------
Co-authored-by: Georgi Gerganov <[email protected]>
goal: consolidation of the cpu backend for reintegration of the AMX backend.