get_rows & dequantize function implementation for repacked weights of type q6_K (q6_Kx8) #16743
get_rows & dequantize function implementation for repacked weights of type q6_K (q6_Kx8) #16743swetha097 wants to merge 2 commits intoggml-org:masterfrom
Conversation
|
@swetha097 I've come across an issue due to the lack of support with GET_ROWS in CPU_REPACK (described in #20396). I think your PR essentially solves the problem, but it's been open for a while. Can I help you somehow to move this forward (testing, rebasing). This seemed blocked due to the lack of support for Q6_K repack, but despite #15275 being still open, other contributions already enabled q6_K, so it should be fine to review and merge this in. |
Hi @Alcpz |
|
I can give a hand on the other PR. This one should be pretty straight forward. As I mentioned, #20396 is essentially this PR adapted a bit. Once you are confident it's good to go, ping ggerganov. Slaren is taking a break, so he won't be able to help with the review. Edit: Also ping me if you need help here as well. |
|
@Alcpz It's difficult to extend the repack logic without having a testing infrastructure for the extra buffer types in place first (see the discussion in ggml-org/whisper.cpp#3223). So for now we avoid such changes. |
|
I understand. It's been a pain to test my PRs and honestly, I'm grateful that you agreed to merge those (which add a few new repack types). I had a chat with @tdakhran and he found that currently models with tied embeddings duplicate the tensors and after investigating I landed in this PR. The lack of GET_ROWS causes this duplication, as we now need one tensor repacked and one without repack. For small models we get a very high memory footprint that heavily affects low memory devices, so I was looking if I could help move this forward somehow. |
I think so, though it's quite low priority on my end. We basically need a mechanism to exercise and verify all of the repack logic. This also requires CI workflows and respective hardware that would run it regularly. |
NOTE: Creating the PR changes required for whisper.cpp here as llama.cpp already includes test-backend-op
system_info: n_threads = 4 / 12 | WHISPER : COREML = 0 | OPENVINO = 0 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
master branch commit - swetha097/whisper.cpp@fc45bb8
q6K repacking commit ( block interleaving approach for Q6_K quantization for x64/x86 SIMD Architecture )- swetha097/whisper.cpp@d89aaf2
development (get_rows) branch commit - swetha097/whisper.cpp@de9839e
Model for performance tests Downloaded from : https://huggingface.co/ggerganov/whisper.cpp/blob/main/ggml-base.en.bin and quantized to q6_K
This patch of the code was also tested with llama.cpp repository & the perplexity of Q6_K models were ensured to be the same before and after the changes made :
Final estimate: PPL = 5.3669 +/- 0.13305
Model used for perplexity test quantized from - https://huggingface.co/meta-llama/Llama-2-7b
This PR is to merged after - Q6_K - Block Interleaving Implementation for x86 SIMD (AVX512/AVX2) #15275